ati stream sdk(OpenCL™ and the ATI Stream SDK v2.0「建议收藏」)

发布时间:2025-12-10 19:23:28 浏览次数:4

OpenCL™ and the ATI Stream SDK v2.0「建议收藏」-

OpenCL™ and the ATI Stream SDK v2.0「建议收藏」                                          OpenCL™ and the ATI Stream SDK v2.0» Overview» History and Motivation» What is OpenCL» Porting CUDA to OpenCL» Conclusion» Useful linksOverviewAMD recently released the ATI Stream v2.0 SDK , which includes one of 

»Overview

»HistoryandMotivation

»WhatisOpenCL

»PortingCUDAtoOpenCL

»Conclusion

»Usefullinks

Overview

AMDrecentlyreleasedtheATIStreamv2.0SDK,whichincludesoneofthefirstcompleteOpenCLruntimes,allowingsoftwaredeveloperstoimplementdata-andtask-parallelcomputationsusingasingle,unifiedprogrammingmodel.TheATIStreamv2.0SDKincludesaCPUruntimeforOpenCL,allowingOpenCLexecutiononx86multi-coreCPUs,aswellasaGPUruntimeprovidingOpenCLexecutiononAMDGPUs.Thisenablesprogrammerstouseallthecoresintheirsystem,bothconventionalprocessorsandstreamprocessors,withthesamecodebase.ThisarticlewilldiscussthehistoryandmotivationforOpenCLandtheATIStreamSDK,giveasmallintroductiontoOpenCLprogramming,anddescribehowNVIDIACUDAprogramscanbeeasilyportedtoOpenCL,enablinghighperformance,cross-platformparallelcomputing.

HistoryandMotivation

Theongoingmovetowardsparallelprocessorsisnowwellunderway,drivenbythecapabilitiesandlimitationsofmodernsemiconductormanufacturing.Parallelprocessing(seeFigure1)providescontinuedperformanceincreasesatreasonablepowerconsumption.However,thismovetowardsparallelprocessingcomesatacost:inorderforfutureprogramstotakeadvantageofincreasesinprocessingpower,theymustbewritteninascalablefashion.Writingscalable,parallelprogramshasbeenadifficultexerciseformanyyears,andnowitisbecomingmandatoryforallcomputationally-intensiveprograms.

Figure1:SomecurrentParallelProcessors

Complicatingthismovetowardsparallelprocessingisthepersityandheterogeneityofthevariousparallelarchitecturesthatarenowavailable.TraditionalCPUshavebecomemulti-coreparallelprocessors,withtwotoeightcoresinasocket.GPUs,whichhavealwaysbeenveryparallel,boastinghundredsofparallelexecutionunitsonasingledie,havenowbecomeincreasinglyprogrammable,tothepointthatitisnowoftenusefultothinkofGPUsasmany-coreprocessorsinsteadofspecialpurposeaccelerators.Allofthispersityhasbeenreflectedinanarrayofincompatibletoolsandprogrammingmodelsrequiredforprogrammingthesearchitectures,resultinginalotofdeveloperpainandhighcostswhentryingtowritecross-platformprograms.

OpenCLisananswertothisproblem.WhenusingOpenCL,developerscanuseasingle,unifiedtoolchainandlanguagetotargetalloftheparallelprocessorscurrentlyinuse.Thisisdonebypresentingthedeveloperwithanabstractplatformmodelthatconceptualizesallofthesearchitecturesinasimilarway,aswellasanexecutionmodelsupportingdataandtaskparallelismacrossheterogeneousarchitectures.

OpenCLisnowthepreferredprogrammingenvironmentforAMDGPUs,andformsthecoreoftheATIStreamSDK.Brook+,whichisahigher-levelstreamingprogrammingmodeltargetingAMDGPUs,hasnowbeenreleasedasanopensourceproject(http://sourceforge.net/projects/brookplus/).TheATIComputeAbstractionLayer(CAL)livesonaspartofthefoundationoftheOpenCLruntime,butisnotintendedtobetheprimarymeansfordeveloperstoaccessAMDGPUs.

WhatisOpenCL

OpenCLisastandardmanagedbytheKhronosGroup,whoalsomanagesOpenGL®,thecross-platformgraphicsAPI.OpenCLcontributorsincludeAMD,Apple,IntelandNVIDIA,amongothersfromallcornersofthecomputerindustry.SimilarlytoOpenGL,OpenCLprovidesanAPIandaruntime.WhenanOpenCLprogramisexecuted,aseriesofAPIcallsconfigurethesystemforexecution,anembeddedJustInTimecompiler(JIT)compilestheOpenCLcode,andtheruntimeasynchronouslycoordinatesexecutionbetweenparallelkernels.OpenCL’sstrengthsareitspracticality,flexibilityandretargetability.

OpenCLPlatformModel

OpenCLseestoday’sheterogeneousworldthroughthelensofanabstract,hierarchicalplatformmodel.Inthismodel,ahostcoordinatesexecution,transferringdatatoandfromanarrayofComputeDevices.EachComputeDeviceiscomposedofanarrayofComputeUnits,andeachComputeUnitiscomposedofanarrayofProcessingElements.OneofOpenCL’sstrengthsisthatthismodeldoesnotspecifyexactlywhathardwareconstitutesacomputedevice.Thus,acomputedevicemaybeaGPU,suchastheATIRadeon™HD5870GPU,oraCPU,suchastheAMDPhenom™IIx4processor,orotherprocessorssuchastheCellBroadbandEngine.TheOpenCLplatformmodel(seeFigure2)isdesignedtopresentauniformviewofmanydifferentkindsofparallelprocessors.

Figure2:OpenCLPlatformModel

OpenCLExecutionModel

OpenCLhasaflexibleexecutionmodelthatincorporatesbothtaskanddataparallelism.Datamovementsbetweenthehostandcomputedevices,aswellasOpenCLtasks,arecoordinatedviacommandqueues.Commandqueuesprovideageneralwayofspecifyingrelationshipsbetweentasks,ensuringthattasksareexecutedinanorderthatsatisfiesthenaturaldependencesinthecomputation.TheOpenCLruntimeisfreetoexecutetasksinparalleliftheirdependenciesaresatisfied,whichprovidesageneral-purposetaskparallelexecutionmodel.Tasksthemselvescanbecomprisedofdata-parallelkernels,whichapplyasinglefunctionoverarangeofdataelements,inparallel,allowingonlyrestrictedsynchronizationandcommunicationduringtheexecutionofakernel.Theseconceptswillbefurtherexplainedinthissection.

Kernels

Asmentioned,OpenCLkernelsprovidedataparallelism.Thekernelexecutionmodelisbasedonahierarchicalabstractionofthecomputationbeingperformed.OpenCLkernelsareexecutedoveranindexspace,whichcanbe1,2or3dimensional.InFigure3,weseeanexampleofa2dimensionalindexspace,whichhasGx*Gyelements.Foreveryelementofthekernelindexspace,awork-itemwillbeexecuted.Allworkitemsexecutethesameprogram,althoughtheirexecutionmaydifferduetobranchingbasedondatacharacteristicsortheindexassignedtoeachwork-item.

Figure3:SEQExecutingKernels–Work-GroupsandWork-Items

Theindexspaceisregularlysubpidedintowork-groups,whicharetilingsoftheentireindexspace.InFigure3,weseeawork-groupofsizeSx*Syelements.Eachwork-itemintheworkgroupreceivesawork-groupid,labeled(wx,wy)inthefigure,aswellasalocalid,labeled(sx,sy)inthefigure.Eachwork-itemalsoreceivesaglobalid,whichcanbederivedfromitswork-groupandlocalids.

Thework-itemsmayonlycommunicateandsynchronizelocally,withinawork-group,viaabarriermechanism.Thisprovidesscalability,traditionallythebaneofparallelprogramming.Becausecommunicationandsynchronizationatthefinestgranularityisrestrictedinscope,theOpenCLruntimehasgreatfreedominhowwork-itemsarescheduledandexecuted.

CommandQueues

Thepisionofakernelintowork-itemsandwork-groupssupportsdata-parallelism,butOpenCLsupportsanotherkindofparallelismaswell,calledtask-parallelism.ThisisdoneviaOpenCLcommandqueues(seeFigure4).AnOpenCLcommandqueueiscreatedbythedeveloperthroughanAPIcall,andassociatedwithaspecificcomputedevice.IfadeveloperwishestotargetmultipleOpenCLcomputedevicessimultaneously,sheshouldcreatemultiplecommandqueues.

Figure4:TaskParallelismwithinaCommandQueue

Toexecuteakernel,thekernelispushedontoaparticularcommandqueue.Enqueueingakernelisdoneasynchronously,sothatthehostprogrammayenqueuemanydifferentkernelswithoutwaitingforanyofthemtocomplete.Whenenqueueingakernel,thedeveloperoptionallyspecifiesalistofeventsthatmustoccurbeforethekernelexecutes.Eventsaregeneratedbykernelcompletion,aswellasmemoryread,write,andcopycommands.Thisallowsthedevelopertospecifyadependencegraphbetweenkernelexecutionsandmemorytransfersinaparticularcommandqueueorbetweencommandqueuesthemselves,whichtheOpenCLruntimewilltraverseduringexecution.Figure4showsataskgraphillustratingthepowerofthisapproach,wherearrowsindicatedependenciesbetweentasks.Forexample,KernelAwillnotexecuteuntilWriteAandWriteBhavefinished,andKernelDwillnotexecuteuntilKernelBandKernelChavefinished.

Theabilitytoconstructarbitrarytaskgraphsisapowerfulwayofconstructingtask-parallelapplications.TheOpenCLruntimehasthefreedomtoexecutethetaskgraphinparallel,aslongasitrespectsthedependenciesencodedinthetaskgraph.Taskgraphsaregeneralenoughtorepresentthekindsofparallelismusefulacrossthespectrumofhardwarearchitectures,fromCPUstoGPUs.

Developersarealsofreetoconstructmultiplecommandqueues,eitherforparallelizinganapplicationacrossmultiplecomputedevices,orforexpressingmoreparallelismviacompletelyindependentstreamsofcomputation.OpenCL’sabilitytousebothdataandtaskparallelismsimultaneouslyisagreatbenefittoparallelapplicationdevelopers,regardlessoftheirintendedhardwaretarget.

Synchronization

BesidesthetaskparallelconstructsprovidedinOpenCLwhichallowsynchronizationandcommunicationbetweenkernels,OpenCLsupportslocalbarriersynchronizationswithinawork-group.Thismechanismallowswork-itemstocoordinateandsharedatainthelocalmemoryspaceusingonlyverylightweightandefficientbarriers.Work-itemsindifferentwork-groupsshouldnevertrytosynchronizeorsharedata,sincetheruntimeprovidesnoguaranteethatallwork-itemsareconcurrentlyexecuting,andsuchsynchronizationeasilyintroducesdeadlocks.

Work-itemsindifferentwork-groupsmaycoordinateexecutionthroughtheuseofatomicmemorytransactions,whichareanOpenCLextensionsupportedbysomeOpenCLruntimes,suchastheATIStreamSDKOpenCLruntimeforthex86multi-corecomputedevices.Forexample,work-itemsmayappendvariablenumbersofresultstoasharedqueueinglobalmemory.However,itisgoodpracticethatwork-itemsdonot,generally,attempttocommunicatedirectly,aswithoutcarefuldesignscalabilityanddeadlockcanbecomedifficultproblems.ThehierarchyofsynchronizationandcommunicationprovidedbyOpenCLisagoodfitformanyoftoday’sparallelarchitectures,whilestillprovidingdeveloperstheabilitytowriteefficientcode,evenforparallelcomputationswithnon-trivialsynchronizationandcommunicationpatterns.

OpenCLMemoryModel

OpenCLhasarelaxedconsistencymemorymodel,structuredasshowninFigure5.

Figure5:OpenCLMemoryModel

Eachcomputedevicehasaglobalmemoryspace,whichisthelargestmemoryspaceavailabletothedevice,andtypicallyresidesinoff-chipDRAM.Thereisalsoaread-only,limited-sizeconstantmemoryspace,whichallowsforefficientreuseofread-onlyparametersinacomputation.Eachcomputeunitonthedevicehasalocalmemory,whichistypicallyontheprocessordie,andthereforehasmuchhigherbandwidthandlowerlatencythanglobalmemory.Localmemorycanbereadandwrittenbyanywork-iteminawork-group,andthusallowsforlocalcommunicationbetweenwork-groups.Additionally,attachedtoeachprocessingelementisaprivatememory,whichistypicallynotuseddirectlybyprogrammers,butisusedtoholddataforeachwork-itemthatdoesnotfitintheprocessingelement’sregisters.

AsOpenCLhasarelaxedconsistencymodel,differentwork-itemsmayseeadifferentviewofglobalmemoryasthecomputationprogresses.Withinawork-item,readsandwritestoallmemoryspacesareconsistentlyordered,butbetweenwork-items,synchronizationisnecessaryinordertoensureconsistency.ThisrelaxedconsistencymodelisanimportantpartofOpenCL’seffortstoprovideparallelscalability:parallelprogramsthatrelyonstrongmemoryconsistencyforsynchronizationandcommunicationusuallyfailtoexecuteinparallel,becausememoryorderingrequirementsforceaserializationofsuchprogramsduringexecution,hinderingscalability.Requiringexplicitsynchronizationandcommunicationbetweenwork-itemsencouragesprogrammerstowritescalablecode,avoidingthetrapoftenseeninparallelprogrammingwherecodelooksparallel,butendsupexecutinginserialduetofrequentandimplicitsynchronizationinducedbyrelianceonastrictmemoryorderingmodel.

Additionally,OpenCLviewstheglobalmemoryspaceofeachcomputedeviceasprivateandseparatefromhostmemory.Movingdatabetweencomputedevicesandthehostrequirestheprogrammertomanuallymanagecommunicationbetweenthehostandthecomputedevices.Thisisdonethroughtheuseofexplicitmemoryreadsandwritesbetweendevices.

PortingCUDAtoOpenCL

Thedata-parallelprogrammingmodelinOpenCLsharessomecommonalitieswithNVIDIA’sCforCUDAprogrammingmodel,makingitrelativelystraightforwardtoconvertprogramsfromCUDAtoOpenCL.

Terminology

Table1:GeneralTerminology

Table1listssomegeneralterminologyfordescribingcomputationsandmemoryspacesinbothCforCUDAandOpenCL.Thesetermsarefairlysimilarbetweenthetwosystems.

WritingKernels:Qualifiers

Table2:QualifiersforKernelFunctions

Table2showsqualifiersthatareaddedtofunctionsanddatawhenwritingkernelsinbothCUDAandOpenCL.ThebiggestdifferencebetweenthetwoisthatinCUDA,__global__functionsareGPUentrypoints,and__device__functionsaretobeexecutedontheGPU,butarenotcallablefromthehost.InOpenCL,entrypointfunctionsareannotatedwiththe__kernelqualifier,butnon-entrypointfunctionsdonotneedtobeannotated.

WritingKernels:Indexing

Table3:IndexingfunctionsforuseinKernels

Table3showsthevariousindexingmechanismsprovidedbyCUDAandOpenCL.CUDAprovideskernelindexingviaspecialpre-definedvariables,whileOpenCLprovidestheequivalentinformationthroughfunctioncalls.OpenCLalsoprovidesglobalindexinginformation,whileCUDArequiresmanualcomputationofglobalindices.

WritingKernels:Synchronization

Table4:SynchronizationfunctionsforuseinKernelFunctions

Table4showsfunctionsprovidedforsynchronizationinkernelfunctions.__syncthreads()andbarrier()bothprovideamechanismforsynchronizingallwork-itemsinawork-group,wherecallingthefunctionsuspendswork-itemexecutionuntilallwork-itemsinthework-grouphavecalledthebarrier.

__threadfence()andmem_fence()provideamoresubtlemechanismforsharingdatabetweenwork-items.Essentially,theyforcevariousorderingsonoutstandingmemorytransactions,whichcanallowformoresophisticatedsharingofdata.Forexample,mem_fence()forcesalloutstandingloadsandstorestobecompletedbeforeexecutionproceeds,disallowingthecompiler,runtime,andhardwarefromreorderinganyloadsandstoresthroughthemem_fence().Thiscanbeusedtoensurethatalldataproducedinawork-groupisflushedtoglobalmemorybeforeproceedingtosignalanotherwork-groupthatexecutionhascompleted,whichopensupsomepossibilitiesforwork-groupstocommunicatewithoutterminatingakernel.

ImportantAPIobjects

Table5:SelectedAPIobjectsusedinHostcode

Table5showssomeobjectsprovidedbytherespectiveAPIs,whichareusedinhostcodetocontrolexecutiononvariousdevices,managedata,andsoforth.Ofnoteisthecl_command_queue,whichprovidesOpenCL’staskparallelismcapabilities,byallowingthedevelopertodeclaredependencesbetweentasksexecutingonadevice.CUDAdoesnotprovidesuchflexibility–theclosestthingCUDAprovidesistheirStreammechanism,whichallowskernelsandmemorytransactionstobeplacedinindependentstreams.ThisisnotasgeneralasOpenCL’staskparallelismcapabilitiesprovidedbyCommandQueues,becauseitdoesnotallowforparallelismwithinaqueue,andsynchronizingbetweenstreamsisdifficult,whileCommandQueuesprovideparallelismwithinandbetweenqueues,aswellasflexiblesynchronizationcapabilitiesthroughtheuseofOpenCLevents.

ImportantAPICalls

Table6:SelectedAPIcallsusedinHostcode

Table6listssomeimportantAPIcallsusedinhostcodetosetupparallelcomputationsandexecutethem,aswellasmanagedataoncomputedevices.Forthemostpart,thesefunctionsarefairlysimilar,althoughsometimesfunctionalityispidedslightlydifferently,asshowninthetable.ThebiggestdifferenceisthatOpenCLhasbotharuntimecompiledmodelaswellasallowingprogramstobecompiledoffline,whereasCUDAonlyallowsprogramstobecompiledoff-line.ToprecompileOpenCL,developerscanusetheclGetProgramInfo()APIcalltoretrieveacompiledbinaryandsaveitforlateruse,alongwiththeclCreateProgramWithBinary()calltocreateanOpenCLprogramobjectfromacompiledbinary.

SomethingstokeepinmindwhileportingfromCUDAtoOpenCL

PointersinOpenCLkernelsmustbeannotatedwiththeirmemoryspace.Forexample,apointertolocalmemorywouldbedeclaredas__localint*p;

Thisappliestokernelargumentsaswell:databeingpassedtoakernelisusuallyinarraysrepresentedby__globalpointers.

CUDAencouragestheuseofscalarcodeinkernels.WhilethisworksinOpenCLaswell,dependingonthedesiredtargetarchitecture,itmaybemoreefficienttowriteprogramsoperatingonOpenCL’svectortypes,suchasfloat4,asopposedtopurescalartypes.ThisisusefulforbothAMDCPUsandAMDGPUs,whichcanoperateefficientlyonvectortypes.OpenCLalsoprovidesflexibleswizzle/broadcastprimitivesforefficientcreationandrearrangementofvectortypes.

CUDAdoesnotproviderichfacilitiesfortaskparallelism,andsoitmaybebeneficialtothinkabouthowtotakeadvantageofOpenCL’staskparallelismasyouportyourapplication.

Conclusion

TheATIStreamv2.0SDK,focusedonOpenCL,providesapowerful,cross-platformwaytounlocktheperformanceofAMDGPUs,aswellasmulti-coreCPUs.ProgramminginOpenCLenablesonecodebasetoservetheneedsoftoday’spersityofparallelhardwarearchitectures.OpenCLprovidesaflexibleprogrammingmodelthatallowsdeveloperstoutilizebothdataparallelismaswellastaskparallelism.ItisalsofairlystraightforwardtomapNVIDIACforCUDAdata-parallelkernelstoOpenCL,whichcanfreecodefromproprietary,vendor-specifictoolchainsandprovideretargetabilitytodataparallelapplications.WithOpenCLandtheATIStreamSDK,developersareabletounlockthepotentialoftoday’sperseandpowerfulparallelhardware.

Usefullinks

OpenCLTutorial:http://developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx

ATIStreamBeta2.0:http://developer.amd.com/streambeta

OpenCLSpecification:http://www.khronos.org/registry/cl/

OpenCLandtheOpenCLlogoaretrademarksofAppleInc.usedbypermissionbyKhronos.

需要做网站?需要网络推广?欢迎咨询客户经理 13272073477