发布时间:2025-12-10 19:23:28 浏览次数:4
OpenCL™ and the ATI Stream SDK v2.0「建议收藏」 OpenCL™ and the ATI Stream SDK v2.0» Overview» History and Motivation» What is OpenCL» Porting CUDA to OpenCL» Conclusion» Useful linksOverviewAMD recently released the ATI Stream v2.0 SDK , which includes one of
»Overview
»HistoryandMotivation
»WhatisOpenCL
»PortingCUDAtoOpenCL
»Conclusion
»Usefullinks
Overview
AMDrecentlyreleasedtheATIStreamv2.0SDK,whichincludesoneofthefirstcompleteOpenCLruntimes,allowingsoftwaredeveloperstoimplementdata-andtask-parallelcomputationsusingasingle,unifiedprogrammingmodel.TheATIStreamv2.0SDKincludesaCPUruntimeforOpenCL,allowingOpenCLexecutiononx86multi-coreCPUs,aswellasaGPUruntimeprovidingOpenCLexecutiononAMDGPUs.Thisenablesprogrammerstouseallthecoresintheirsystem,bothconventionalprocessorsandstreamprocessors,withthesamecodebase.ThisarticlewilldiscussthehistoryandmotivationforOpenCLandtheATIStreamSDK,giveasmallintroductiontoOpenCLprogramming,anddescribehowNVIDIACUDAprogramscanbeeasilyportedtoOpenCL,enablinghighperformance,cross-platformparallelcomputing.
HistoryandMotivation
Theongoingmovetowardsparallelprocessorsisnowwellunderway,drivenbythecapabilitiesandlimitationsofmodernsemiconductormanufacturing.Parallelprocessing(seeFigure1)providescontinuedperformanceincreasesatreasonablepowerconsumption.However,thismovetowardsparallelprocessingcomesatacost:inorderforfutureprogramstotakeadvantageofincreasesinprocessingpower,theymustbewritteninascalablefashion.Writingscalable,parallelprogramshasbeenadifficultexerciseformanyyears,andnowitisbecomingmandatoryforallcomputationally-intensiveprograms.
Figure1:SomecurrentParallelProcessors
Complicatingthismovetowardsparallelprocessingisthepersityandheterogeneityofthevariousparallelarchitecturesthatarenowavailable.TraditionalCPUshavebecomemulti-coreparallelprocessors,withtwotoeightcoresinasocket.GPUs,whichhavealwaysbeenveryparallel,boastinghundredsofparallelexecutionunitsonasingledie,havenowbecomeincreasinglyprogrammable,tothepointthatitisnowoftenusefultothinkofGPUsasmany-coreprocessorsinsteadofspecialpurposeaccelerators.Allofthispersityhasbeenreflectedinanarrayofincompatibletoolsandprogrammingmodelsrequiredforprogrammingthesearchitectures,resultinginalotofdeveloperpainandhighcostswhentryingtowritecross-platformprograms.
OpenCLisananswertothisproblem.WhenusingOpenCL,developerscanuseasingle,unifiedtoolchainandlanguagetotargetalloftheparallelprocessorscurrentlyinuse.Thisisdonebypresentingthedeveloperwithanabstractplatformmodelthatconceptualizesallofthesearchitecturesinasimilarway,aswellasanexecutionmodelsupportingdataandtaskparallelismacrossheterogeneousarchitectures.
OpenCLisnowthepreferredprogrammingenvironmentforAMDGPUs,andformsthecoreoftheATIStreamSDK.Brook+,whichisahigher-levelstreamingprogrammingmodeltargetingAMDGPUs,hasnowbeenreleasedasanopensourceproject(http://sourceforge.net/projects/brookplus/).TheATIComputeAbstractionLayer(CAL)livesonaspartofthefoundationoftheOpenCLruntime,butisnotintendedtobetheprimarymeansfordeveloperstoaccessAMDGPUs.
WhatisOpenCL
OpenCLisastandardmanagedbytheKhronosGroup,whoalsomanagesOpenGL®,thecross-platformgraphicsAPI.OpenCLcontributorsincludeAMD,Apple,IntelandNVIDIA,amongothersfromallcornersofthecomputerindustry.SimilarlytoOpenGL,OpenCLprovidesanAPIandaruntime.WhenanOpenCLprogramisexecuted,aseriesofAPIcallsconfigurethesystemforexecution,anembeddedJustInTimecompiler(JIT)compilestheOpenCLcode,andtheruntimeasynchronouslycoordinatesexecutionbetweenparallelkernels.OpenCL’sstrengthsareitspracticality,flexibilityandretargetability.
OpenCLPlatformModel
OpenCLseestoday’sheterogeneousworldthroughthelensofanabstract,hierarchicalplatformmodel.Inthismodel,ahostcoordinatesexecution,transferringdatatoandfromanarrayofComputeDevices.EachComputeDeviceiscomposedofanarrayofComputeUnits,andeachComputeUnitiscomposedofanarrayofProcessingElements.OneofOpenCL’sstrengthsisthatthismodeldoesnotspecifyexactlywhathardwareconstitutesacomputedevice.Thus,acomputedevicemaybeaGPU,suchastheATIRadeon™HD5870GPU,oraCPU,suchastheAMDPhenom™IIx4processor,orotherprocessorssuchastheCellBroadbandEngine.TheOpenCLplatformmodel(seeFigure2)isdesignedtopresentauniformviewofmanydifferentkindsofparallelprocessors.
Figure2:OpenCLPlatformModel
OpenCLExecutionModel
OpenCLhasaflexibleexecutionmodelthatincorporatesbothtaskanddataparallelism.Datamovementsbetweenthehostandcomputedevices,aswellasOpenCLtasks,arecoordinatedviacommandqueues.Commandqueuesprovideageneralwayofspecifyingrelationshipsbetweentasks,ensuringthattasksareexecutedinanorderthatsatisfiesthenaturaldependencesinthecomputation.TheOpenCLruntimeisfreetoexecutetasksinparalleliftheirdependenciesaresatisfied,whichprovidesageneral-purposetaskparallelexecutionmodel.Tasksthemselvescanbecomprisedofdata-parallelkernels,whichapplyasinglefunctionoverarangeofdataelements,inparallel,allowingonlyrestrictedsynchronizationandcommunicationduringtheexecutionofakernel.Theseconceptswillbefurtherexplainedinthissection.
Kernels
Asmentioned,OpenCLkernelsprovidedataparallelism.Thekernelexecutionmodelisbasedonahierarchicalabstractionofthecomputationbeingperformed.OpenCLkernelsareexecutedoveranindexspace,whichcanbe1,2or3dimensional.InFigure3,weseeanexampleofa2dimensionalindexspace,whichhasGx*Gyelements.Foreveryelementofthekernelindexspace,awork-itemwillbeexecuted.Allworkitemsexecutethesameprogram,althoughtheirexecutionmaydifferduetobranchingbasedondatacharacteristicsortheindexassignedtoeachwork-item.
Figure3:SEQExecutingKernels–Work-GroupsandWork-Items
Theindexspaceisregularlysubpidedintowork-groups,whicharetilingsoftheentireindexspace.InFigure3,weseeawork-groupofsizeSx*Syelements.Eachwork-itemintheworkgroupreceivesawork-groupid,labeled(wx,wy)inthefigure,aswellasalocalid,labeled(sx,sy)inthefigure.Eachwork-itemalsoreceivesaglobalid,whichcanbederivedfromitswork-groupandlocalids.
Thework-itemsmayonlycommunicateandsynchronizelocally,withinawork-group,viaabarriermechanism.Thisprovidesscalability,traditionallythebaneofparallelprogramming.Becausecommunicationandsynchronizationatthefinestgranularityisrestrictedinscope,theOpenCLruntimehasgreatfreedominhowwork-itemsarescheduledandexecuted.
CommandQueues
Thepisionofakernelintowork-itemsandwork-groupssupportsdata-parallelism,butOpenCLsupportsanotherkindofparallelismaswell,calledtask-parallelism.ThisisdoneviaOpenCLcommandqueues(seeFigure4).AnOpenCLcommandqueueiscreatedbythedeveloperthroughanAPIcall,andassociatedwithaspecificcomputedevice.IfadeveloperwishestotargetmultipleOpenCLcomputedevicessimultaneously,sheshouldcreatemultiplecommandqueues.
Figure4:TaskParallelismwithinaCommandQueue
Toexecuteakernel,thekernelispushedontoaparticularcommandqueue.Enqueueingakernelisdoneasynchronously,sothatthehostprogrammayenqueuemanydifferentkernelswithoutwaitingforanyofthemtocomplete.Whenenqueueingakernel,thedeveloperoptionallyspecifiesalistofeventsthatmustoccurbeforethekernelexecutes.Eventsaregeneratedbykernelcompletion,aswellasmemoryread,write,andcopycommands.Thisallowsthedevelopertospecifyadependencegraphbetweenkernelexecutionsandmemorytransfersinaparticularcommandqueueorbetweencommandqueuesthemselves,whichtheOpenCLruntimewilltraverseduringexecution.Figure4showsataskgraphillustratingthepowerofthisapproach,wherearrowsindicatedependenciesbetweentasks.Forexample,KernelAwillnotexecuteuntilWriteAandWriteBhavefinished,andKernelDwillnotexecuteuntilKernelBandKernelChavefinished.
Theabilitytoconstructarbitrarytaskgraphsisapowerfulwayofconstructingtask-parallelapplications.TheOpenCLruntimehasthefreedomtoexecutethetaskgraphinparallel,aslongasitrespectsthedependenciesencodedinthetaskgraph.Taskgraphsaregeneralenoughtorepresentthekindsofparallelismusefulacrossthespectrumofhardwarearchitectures,fromCPUstoGPUs.
Developersarealsofreetoconstructmultiplecommandqueues,eitherforparallelizinganapplicationacrossmultiplecomputedevices,orforexpressingmoreparallelismviacompletelyindependentstreamsofcomputation.OpenCL’sabilitytousebothdataandtaskparallelismsimultaneouslyisagreatbenefittoparallelapplicationdevelopers,regardlessoftheirintendedhardwaretarget.
Synchronization
BesidesthetaskparallelconstructsprovidedinOpenCLwhichallowsynchronizationandcommunicationbetweenkernels,OpenCLsupportslocalbarriersynchronizationswithinawork-group.Thismechanismallowswork-itemstocoordinateandsharedatainthelocalmemoryspaceusingonlyverylightweightandefficientbarriers.Work-itemsindifferentwork-groupsshouldnevertrytosynchronizeorsharedata,sincetheruntimeprovidesnoguaranteethatallwork-itemsareconcurrentlyexecuting,andsuchsynchronizationeasilyintroducesdeadlocks.
Work-itemsindifferentwork-groupsmaycoordinateexecutionthroughtheuseofatomicmemorytransactions,whichareanOpenCLextensionsupportedbysomeOpenCLruntimes,suchastheATIStreamSDKOpenCLruntimeforthex86multi-corecomputedevices.Forexample,work-itemsmayappendvariablenumbersofresultstoasharedqueueinglobalmemory.However,itisgoodpracticethatwork-itemsdonot,generally,attempttocommunicatedirectly,aswithoutcarefuldesignscalabilityanddeadlockcanbecomedifficultproblems.ThehierarchyofsynchronizationandcommunicationprovidedbyOpenCLisagoodfitformanyoftoday’sparallelarchitectures,whilestillprovidingdeveloperstheabilitytowriteefficientcode,evenforparallelcomputationswithnon-trivialsynchronizationandcommunicationpatterns.
OpenCLMemoryModel
OpenCLhasarelaxedconsistencymemorymodel,structuredasshowninFigure5.
Figure5:OpenCLMemoryModel
Eachcomputedevicehasaglobalmemoryspace,whichisthelargestmemoryspaceavailabletothedevice,andtypicallyresidesinoff-chipDRAM.Thereisalsoaread-only,limited-sizeconstantmemoryspace,whichallowsforefficientreuseofread-onlyparametersinacomputation.Eachcomputeunitonthedevicehasalocalmemory,whichistypicallyontheprocessordie,andthereforehasmuchhigherbandwidthandlowerlatencythanglobalmemory.Localmemorycanbereadandwrittenbyanywork-iteminawork-group,andthusallowsforlocalcommunicationbetweenwork-groups.Additionally,attachedtoeachprocessingelementisaprivatememory,whichistypicallynotuseddirectlybyprogrammers,butisusedtoholddataforeachwork-itemthatdoesnotfitintheprocessingelement’sregisters.
AsOpenCLhasarelaxedconsistencymodel,differentwork-itemsmayseeadifferentviewofglobalmemoryasthecomputationprogresses.Withinawork-item,readsandwritestoallmemoryspacesareconsistentlyordered,butbetweenwork-items,synchronizationisnecessaryinordertoensureconsistency.ThisrelaxedconsistencymodelisanimportantpartofOpenCL’seffortstoprovideparallelscalability:parallelprogramsthatrelyonstrongmemoryconsistencyforsynchronizationandcommunicationusuallyfailtoexecuteinparallel,becausememoryorderingrequirementsforceaserializationofsuchprogramsduringexecution,hinderingscalability.Requiringexplicitsynchronizationandcommunicationbetweenwork-itemsencouragesprogrammerstowritescalablecode,avoidingthetrapoftenseeninparallelprogrammingwherecodelooksparallel,butendsupexecutinginserialduetofrequentandimplicitsynchronizationinducedbyrelianceonastrictmemoryorderingmodel.
Additionally,OpenCLviewstheglobalmemoryspaceofeachcomputedeviceasprivateandseparatefromhostmemory.Movingdatabetweencomputedevicesandthehostrequirestheprogrammertomanuallymanagecommunicationbetweenthehostandthecomputedevices.Thisisdonethroughtheuseofexplicitmemoryreadsandwritesbetweendevices.
PortingCUDAtoOpenCL
Thedata-parallelprogrammingmodelinOpenCLsharessomecommonalitieswithNVIDIA’sCforCUDAprogrammingmodel,makingitrelativelystraightforwardtoconvertprogramsfromCUDAtoOpenCL.
Terminology
Table1:GeneralTerminology
Table1listssomegeneralterminologyfordescribingcomputationsandmemoryspacesinbothCforCUDAandOpenCL.Thesetermsarefairlysimilarbetweenthetwosystems.
WritingKernels:Qualifiers
Table2:QualifiersforKernelFunctions
Table2showsqualifiersthatareaddedtofunctionsanddatawhenwritingkernelsinbothCUDAandOpenCL.ThebiggestdifferencebetweenthetwoisthatinCUDA,__global__functionsareGPUentrypoints,and__device__functionsaretobeexecutedontheGPU,butarenotcallablefromthehost.InOpenCL,entrypointfunctionsareannotatedwiththe__kernelqualifier,butnon-entrypointfunctionsdonotneedtobeannotated.
WritingKernels:Indexing
Table3:IndexingfunctionsforuseinKernels
Table3showsthevariousindexingmechanismsprovidedbyCUDAandOpenCL.CUDAprovideskernelindexingviaspecialpre-definedvariables,whileOpenCLprovidestheequivalentinformationthroughfunctioncalls.OpenCLalsoprovidesglobalindexinginformation,whileCUDArequiresmanualcomputationofglobalindices.
WritingKernels:Synchronization
Table4:SynchronizationfunctionsforuseinKernelFunctions
Table4showsfunctionsprovidedforsynchronizationinkernelfunctions.__syncthreads()andbarrier()bothprovideamechanismforsynchronizingallwork-itemsinawork-group,wherecallingthefunctionsuspendswork-itemexecutionuntilallwork-itemsinthework-grouphavecalledthebarrier.
__threadfence()andmem_fence()provideamoresubtlemechanismforsharingdatabetweenwork-items.Essentially,theyforcevariousorderingsonoutstandingmemorytransactions,whichcanallowformoresophisticatedsharingofdata.Forexample,mem_fence()forcesalloutstandingloadsandstorestobecompletedbeforeexecutionproceeds,disallowingthecompiler,runtime,andhardwarefromreorderinganyloadsandstoresthroughthemem_fence().Thiscanbeusedtoensurethatalldataproducedinawork-groupisflushedtoglobalmemorybeforeproceedingtosignalanotherwork-groupthatexecutionhascompleted,whichopensupsomepossibilitiesforwork-groupstocommunicatewithoutterminatingakernel.
ImportantAPIobjects
Table5:SelectedAPIobjectsusedinHostcode
Table5showssomeobjectsprovidedbytherespectiveAPIs,whichareusedinhostcodetocontrolexecutiononvariousdevices,managedata,andsoforth.Ofnoteisthecl_command_queue,whichprovidesOpenCL’staskparallelismcapabilities,byallowingthedevelopertodeclaredependencesbetweentasksexecutingonadevice.CUDAdoesnotprovidesuchflexibility–theclosestthingCUDAprovidesistheirStreammechanism,whichallowskernelsandmemorytransactionstobeplacedinindependentstreams.ThisisnotasgeneralasOpenCL’staskparallelismcapabilitiesprovidedbyCommandQueues,becauseitdoesnotallowforparallelismwithinaqueue,andsynchronizingbetweenstreamsisdifficult,whileCommandQueuesprovideparallelismwithinandbetweenqueues,aswellasflexiblesynchronizationcapabilitiesthroughtheuseofOpenCLevents.
ImportantAPICalls
Table6:SelectedAPIcallsusedinHostcode
Table6listssomeimportantAPIcallsusedinhostcodetosetupparallelcomputationsandexecutethem,aswellasmanagedataoncomputedevices.Forthemostpart,thesefunctionsarefairlysimilar,althoughsometimesfunctionalityispidedslightlydifferently,asshowninthetable.ThebiggestdifferenceisthatOpenCLhasbotharuntimecompiledmodelaswellasallowingprogramstobecompiledoffline,whereasCUDAonlyallowsprogramstobecompiledoff-line.ToprecompileOpenCL,developerscanusetheclGetProgramInfo()APIcalltoretrieveacompiledbinaryandsaveitforlateruse,alongwiththeclCreateProgramWithBinary()calltocreateanOpenCLprogramobjectfromacompiledbinary.
SomethingstokeepinmindwhileportingfromCUDAtoOpenCL
PointersinOpenCLkernelsmustbeannotatedwiththeirmemoryspace.Forexample,apointertolocalmemorywouldbedeclaredas__localint*p;
Thisappliestokernelargumentsaswell:databeingpassedtoakernelisusuallyinarraysrepresentedby__globalpointers.
CUDAencouragestheuseofscalarcodeinkernels.WhilethisworksinOpenCLaswell,dependingonthedesiredtargetarchitecture,itmaybemoreefficienttowriteprogramsoperatingonOpenCL’svectortypes,suchasfloat4,asopposedtopurescalartypes.ThisisusefulforbothAMDCPUsandAMDGPUs,whichcanoperateefficientlyonvectortypes.OpenCLalsoprovidesflexibleswizzle/broadcastprimitivesforefficientcreationandrearrangementofvectortypes.
CUDAdoesnotproviderichfacilitiesfortaskparallelism,andsoitmaybebeneficialtothinkabouthowtotakeadvantageofOpenCL’staskparallelismasyouportyourapplication.
Conclusion
TheATIStreamv2.0SDK,focusedonOpenCL,providesapowerful,cross-platformwaytounlocktheperformanceofAMDGPUs,aswellasmulti-coreCPUs.ProgramminginOpenCLenablesonecodebasetoservetheneedsoftoday’spersityofparallelhardwarearchitectures.OpenCLprovidesaflexibleprogrammingmodelthatallowsdeveloperstoutilizebothdataparallelismaswellastaskparallelism.ItisalsofairlystraightforwardtomapNVIDIACforCUDAdata-parallelkernelstoOpenCL,whichcanfreecodefromproprietary,vendor-specifictoolchainsandprovideretargetabilitytodataparallelapplications.WithOpenCLandtheATIStreamSDK,developersareabletounlockthepotentialoftoday’sperseandpowerfulparallelhardware.
Usefullinks
OpenCLTutorial:http://developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx
ATIStreamBeta2.0:http://developer.amd.com/streambeta
OpenCLSpecification:http://www.khronos.org/registry/cl/
OpenCLandtheOpenCLlogoaretrademarksofAppleInc.usedbypermissionbyKhronos.