August2012
CUDADYNAMICPARALLELISM
PROGRAMMINGGUIDE
CUDADynamicParallelismProgrammingGuideii
TABLEOFCONTENTS
Introduction......................................................................................1
Overview.......................................................................................................1
Glossary........................................................................................................2
ExecutionEnvironmentandMemoryModel.............................................3
ExecutionEnvironment......................................................................................3
ParentandChildGrids....................................................................................3
ScopeofCUDAPrimitives................................................................................4
Synchronization............................................................................................4
Streams&Events..........................................................................................5
OrderingandConcurrency...............................................................................5
DeviceManagement......................................................................................6
MemoryModel.................................................................................................6
CoherenceandConsistency..............................................................................6
ProgrammingInterface......................................................................10
CUDAC/C++Reference....................................................................................10
Device-SideKernelLaunch..............................................................................10
Streams....................................................................................................12
Events......................................................................................................12
Synchronization...........................................................................................13
DeviceManagement.....................................................................................13
MemoryDeclarations.....................................................................................14
APIErrorsandLaunchFailures.........................................................................15
APIReference.............................................................................................17
Device-SideLaunchFromPTX.............................................................................18
KernelLaunchAPIs.......................................................................................18
ParameterBufferLayout................................................................................20
ToolkitSupportforDynamicParallelism.................................................................21
IncludingdeviceruntimeAPIinCUDAcode.........................................................21
CompilingandLinking...................................................................................21
Programmingguidelines.....................................................................22
Basics..........................................................................................................22
Performance..................................................................................................24
Synchronization...........................................................................................24
Dynamic-parallelism-enabledKerneloverhead......................................................24
ImplementationRestrictions&Limitations..............................................................24
Runtime....................................................................................................24
CUDADynamicParallelismProgrammingGuide1
INTRODUCTION
Thisdocumentprovidesguidanceonhowtodesignanddevelopsoftwarethattakes
advantageofthenewDynamicParallelismcapabilitiesintroducedwithCUDA5.0.
OVERVIEW
DynamicParallelismisanextensiontotheCUDAprogrammingmodelenablinga
CUDAkerneltocreateandsynchronizewithnewworkdirectlyontheGPU.The
creationofparallelismdynamicallyatwhicheverpointinaprogramthatitisneeded
offersexcitingnewcapabilities.
TheabilitytocreateworkdirectlyfromtheGPUcanreducetheneedtotransfer
executioncontrolanddatabetweenhostanddevice,aslaunchconfigurationdecisions
cannowbemadeatruntimebythreadsexecutingonthedevice.Additionally,data-
dependentparallelworkcanbegeneratedinlinewithinakernelatrun-time,taking
advantageoftheGPU’shardwareschedulersandloadbalancersdynamicallyand
adaptinginresponsetodata-drivendecisionsorworkloads.Algorithmsand
programmingpatternsthathadpreviouslyrequiredmodificationstoeliminate
recursion,irregularloopstructure,orotherconstructsthatdonotfitaflat,single-levelof
parallelismmaymoretransparentlybeexpressed.
ThisdocumentdescribestheextendedcapabilitiesofCUDAwhichenableDynamic
Parallelism,includingthemodificationsandadditionstotheCUDAprogramming
modelnecessarytotakeadvantageofthese,aswellasguidelinesandbestpracticesfor
exploitingthisaddedcapacity.
DynamicParallelismisonlysupportedbydevicesofcomputecapability3.5andhigher.
Introduction
CUDADynamicParallelismProgrammingGuide2
GLOSSARY
Definitionsfortermsusedinthisguide.
Grid
AGridisacollectionofThreads.ThreadsinaGridexecuteaKernelFunctionandare
dividedintoThreadBlocks.
ThreadBlock
AThreadBlockisagroupofthreadswhichexecuteonthesamemultiprocessor(SMX).
ThreadswithinaThreadBlockhaveaccesstosharedmemoryandcanbeexplicitly
synchronized.
KernelFunction
AKernelFunctionisanimplicitlyparallelsubroutinethatexecutesundertheCUDA
executionandmemorymodelforeveryThreadinaGrid.
Host
TheHostreferstotheexecutionenvironmentthatinitiallyinvokedCUDA,typicallythe
threadrunningonasystem’sCPUprocessor.
Parent
AParentThread,ThreadBlock,orGridisonethathaslaunchednewgrid(s),theChild
Grid(s).TheParentisnotconsideredcompleteduntilallofitslaunchedChildGrids
havealsocompleted.
Child
AChildthread,block,orgridisonethathasbeenlaunchedbyaParentgrid.AChild
gridmustcompletebeforetheParentThread,ThreadBlock,orGridareconsidered
complete.
ThreadBlockScope
ObjectswithThreadBlockScopehavethelifetimeofasingleThreadBlock.Theyonly
havedefinedbehaviorwhenoperatedonbyThreadsintheThreadBlockthatcreated
theobjectandaredestroyedwhentheThreadBlockthatcreatedthemiscomplete.
DeviceRuntime
TheDeviceRuntimereferstotheruntimesystemandAPIsavailabletoenableKernel
FunctionstouseDynamicParallelism.
CUDADynamicParallelismProgrammingGuide3
EXECUTIONENVIRONMENTANDMEMORY
MODEL
EXECUTIONENVIRONMENT
TheCUDAexecutionmodelisbasedonprimitivesofthreads,threadblocks,andgrids,
withkernelfunctionsdefiningtheprogramexecutedbyindividualthreadswithina
threadblockandgrid.Whenakernelfunctionisinvokedthegrid''spropertiesare
describedbyanexecutionconfiguration,whichhasaspecialsyntaxinCUDA.Support
fordynamicparallelisminCUDAextendstheabilitytoconfigure,launch,and
synchronizeuponnewgridstothreadsthatarerunningonthedevice.
ParentandChildGrids
Adevicethreadthatconfiguresandlaunchesanewgridbelongstotheparentgrid,and
thegridcreatedbytheinvocationisachildgrid.
Theinvocationandcompletionofchildgridsisproperlynested,meaningthattheparent
gridisnotconsideredcompleteuntilallchildgridscreatedbyitsthreadshave
completed.Eveniftheinvokingthreadsdonotexplicitlysynchronizeonthechildgrids
launched,theruntimeguaranteesanimplicitsynchronizationbetweentheparentand
child.
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide4
ScopeofCUDAPrimitives
Onbothhostanddevice,theCUDAruntimeoffersanAPIforlaunchingkernels,for
waitingforlaunchedworktocomplete,andfortrackingdependenciesbetween
launchesviastreamsandevents.Onthehostsystem,thestateoflaunchesandthe
CUDAprimitivesreferencingstreamsandeventsaresharedbyallthreadswithina
process;howeverprocessesexecuteindependentlyandmaynotshareCUDAobjects.
Asimilarhierarchyexistsonthedevice:launchedkernelsandCUDAobjectsarevisible
toallthreadsinathreadblock,butareindependentbetweenthreadblocks.Thismeans
forexamplethatastreammaybecreatedbyonethreadandusedbyanyotherthreadin
thesamethreadblock,butmaynotbesharedwiththreadsinanyotherthreadblock.
Synchronization
CUDAruntimeoperationsfromanythread,includingkernellaunches,arevisibleacross
athreadblock.Thismeansthataninvokingthreadintheparentgridmayperform
synchronizationonthegridslaunchedbythatthread,byotherthreadsinthethread
block,oronstreamscreatedwithinthesamethreadblock.Executionofathreadblockis
notconsideredcompleteuntilalllaunchesbyallthreadsintheblockhavecompleted.If
allthreadsinablockexitbeforeallchildlauncheshavecompleted,asynchronization
operationwillautomaticallybetriggered.
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide5
Streams&Events
CUDAStreamsandEventsallowcontroloverdependenciesbetweengridlaunches:
gridslaunchedintothesamestreamexecutein-order,andeventsmaybeusedtocreate
dependenciesbetweenstreams.Streamsandeventscreatedonthedeviceservethis
exactsamepurpose.
Streamsandeventscreatedwithinagridexistwithinthreadblockscopebuthave
undefinedbehaviorwhenusedoutsideofthethreadblockwheretheywerecreated.As
describedabove,allworklaunchedbyathreadblockisimplicitlysynchronizedwhen
theblockexits;worklaunchedintostreamsisincludedinthis,withalldependencies
resolvedappropriately.Thebehaviorofoperationsonastreamthathasbeenmodified
outsideofthreadblockscopeisundefined.
Streamsandeventscreatedonthehosthaveundefinedbehaviorwhenusedwithinany
kernel,justasstreamsandeventscreatedbyaparentgridhaveundefinedbehaviorif
usedwithinachildgrid.
OrderingandConcurrency
TheorderingofkernellaunchesfromthedeviceruntimefollowsCUDAStream
orderingsemantics.Withinathreadblock,allkernellaunchesintothesamestreamare
executedin-order.Withmultiplethreadsinthesamethreadblocklaunchingintothe
samestream,theorderingwithinthestreamisdependentonthethreadscheduling
withintheblock,whichmaybecontrolledwithsynchronizationprimitivessuchas
__syncthreads().
Notethatbecausestreamsaresharedbyallthreadswithinathreadblock,theimplicit
‘NULL’streamisalsoshared.Ifmultiplethreadsinathreadblocklaunchintothe
implicitstream,thentheselauncheswillbeexecutedin-order.Ifconcurrencyisdesired,
explicitnamedstreamsshouldbeused.
DynamicParallelismenablesconcurrencytobeexpressedmoreeasilywithinaprogram;
however,thedeviceruntimeintroducesnonewconcurrencyguaranteeswithinthe
CUDAexecutionmodel.Thereisnoguaranteeofconcurrentexecutionbetweenany
numberofdifferentthreadblocksonadevice.
Thelackofconcurrencyguaranteeextendstoparentthreadblocksandtheirchildgrids.
Whenaparentthreadblocklaunchesachildgrid,thechildisnotguaranteedtobegin
executionuntiltheparentthreadblockreachesanexplicitsynchronizationpoint(e.g.
cudaDeviceSynchronize()).
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide6
Whileconcurrencywillofteneasilybeachieved,itmayvaryasafunctionofdevice
configuration,applicationworkload,andruntimescheduling.Itisthereforeunsafeto
dependuponanyconcurrencybetweendifferentthreadblocks.
DeviceManagement
Thereisnomulti-GPUsupportfromthedeviceruntime;thedeviceruntimeisonly
capableofoperatingonthedeviceuponwhichitiscurrentlyexecuting.Itispermitted,
however,toquerypropertiesforanyCUDAcapabledeviceinthesystem.
MEMORYMODEL
Parentandchildgridssharethesameglobalandconstantmemorystorage,buthave
distinctlocalandsharedmemory.
CoherenceandConsistency
GlobalMemory
Parentandchildgridshavecoherentaccesstoglobalmemory,withweakconsistency
guaranteesbetweenchildandparent.Therearetwopointsintheexecutionofachild
gridwhenitsviewofmemoryisfullyconsistentwiththeparentthread:whenthechild
gridisinvokedbytheparent,andwhenthechildgridcompletesassignaledbya
synchronizationAPIinvocationintheparentthread.
Allglobalmemoryoperationsintheparentthreadpriortothechildgrid’sinvocation
arevisibletothechildgrid.Allmemoryoperationsofthechildgridarevisibletothe
parentaftertheparenthassynchronizedonthechildgrid’scompletion.
Inthefollowingexample,thechildgridexecutingchild_launchisonlyguaranteedto
seethemodificationstodatamadebeforethechildgridwaslaunched.Sincethread0of
theparentisperformingthelaunch,thechildwillbeconsistentwiththememoryseen
bythread0oftheparent.Duetothefirst__syncthreads()call,thechildwillsee
data[0]=0,data[1]=1,...,data[255]=255(withoutthe__syncthreads()call,only
data[0]wouldbeguaranteedtobeseenbythechild).Whenthechildgridreturns,
thread0isguaranteedtoseemodificationsmadebythethreadsinitschildgrid.Those
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide7
modificationsbecomeavailabletotheotherthreadsoftheparentgridonlyafterthe
second__syncthreads()call:
__global__voidchild_launch(intdata){
data[threadIdx.x]=data[threadIdx.x]+1;
}
__global__voidparent_launch(intdata){
data[threadIdx.x]=threadIdx.x;
__syncthreads();
If(threadIdx.x==0){
child_launch<<<1,256>>>(data);
cudaDeviceSynchronize();
}
__syncthreads();
}
voidhost_launch(intdata){
parent_launch<<<1,256>>>(data);
}
Zero-CopyMemory
Zero-copysystemmemoryhasidenticalcoherenceandconsistencyguaranteestoglobal
memory,andfollowsthesemanticsdetailedabove.Akernelmaynotallocateorfree
zero-copymemory,butmayusepointerstozero-copypassedinfromthehostprogram.
ConstantMemory
Constantsareimmutableandmaynotbemodifiedfromthedevice,evenbetween
parentandchildlaunches.Thatistosay,thevalueofall__constant__variablesmustbe
setfromthehostpriortolaunch.Constantmemoryisinheritedautomaticallybyall
childkernelsfromtheirrespectiveparents.
Takingtheaddressofaconstantmemoryobjectfromwithinakernelthreadhasthe
samesemanticsasforallCUDAprograms,andpassingthatpointerfromparenttochild
orfromachildtoparentisnaturallysupported.
SharedandLocalMemory
SharedandLocalmemoryisprivatetoathreadblockorthread,respectively,andisnot
visibleorcoherentbetweenparentandchild.Behaviorisundefinedwhenanobjectin
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide8
oneoftheselocationsisreferencedoutsideofthescopewithinwhichitbelongs,and
maycauseanerror.
TheNVIDIAcompilerwillattempttowarnifitcandetectthatapointertolocalor
sharedmemoryisbeingpassedasanargumenttoakernellaunch.Atruntime,the
programmermayusethe__isGlobal()intrinsictodeterminewhetherapointerreferences
globalmemoryandsomaysafelybepassedtoachildlaunch.
NotethatcallstocudaMemcpyAsync()orcudaMemsetAsync()mayinvokenewchild
kernelsonthedeviceinordertopreservestreamsemantics.Assuch,passingsharedor
localmemorypointerstotheseAPIsisillegalandwillreturnanerror.
LocalMemory
Localmemoryisprivatestorageforanexecutingthread,andisnotvisibleoutsideof
thatthread.Itisillegaltopassapointertolocalmemoryasalaunchargumentwhen
launchingachildkernel.Theresultofdereferencingsuchalocalmemorypointerfroma
childwillbeundefined.
Forexamplethefollowingisillegal,withundefinedbehaviorifx_arrayisaccessedby
child_launch:
intx_array[10];//Createsx_arrayinparent’slocalmemory
child_launch<<<1,1>>>(x_array);
Itissometimesdifficultforaprogrammertobeawareofwhenavariableisplacedinto
localmemorybythecompiler.Asageneralrule,allstoragepassedtoachildkernel
shouldbeallocatedexplicitlyfromtheglobal-memoryheap,eitherwithcudaMalloc(),
new()orbydeclaring__device__storageatglobalscope.
Forexample:
__device__intvalue;
__device__voidx(){
value=5;
child<<<1,1>>>(&value);
}
__device__voidy(){
intvalue=5;
child<<<1,1>>>(&value);
}
Correct–“value”isglobalstorageInvalid–“value”islocalstorage
TextureMemory
Writestotheglobalmemoryregionoverwhichatextureismappedareincoherentwith
respecttotextureaccesses.Coherencefortexturememoryisenforcedattheinvocation
ofachildgridandwhenachildgridcompletes.Thismeansthatwritestomemoryprior
toachildkernellauncharereflectedintexturememoryaccessesofthechild.Similarly,
ExecutionEnvironmentandMemoryModel
CUDADynamicParallelismProgrammingGuide9
writestomemorybyachildwillbereflectedinthetexturememoryaccessesbyaparent,
butonlyaftertheparentsynchronizesonthechild''scompletion.Concurrentaccessesby
parentandchildmayresultininconsistentdata.
CUDADynamicParallelismProgrammingGuide10
PROGRAMMINGINTERFACE
CUDAC/C++REFERENCE
ThissectiondescribeschangesandadditionstotheCUDAC/C++languageextensions
forsupportingDynamicParallelism.
ThelanguageinterfaceandAPIavailabletoCUDAkernelsusingCUDAC/C++for
DynamicParallelism,referredtoastheDeviceRuntime,issubstantiallylikethatofthe
CUDARuntimeAPIavailableonthehost.Wherepossiblethesyntaxandsemanticsof
theCUDARuntimeAPIhavebeenretainedinordertofacilitateeaseofcodereusefor
routinesthatmayrunineitherthehostordeviceenvironments.
AswithallcodeinCUDAC/C++,theAPIsandcodeoutlinedhereisper-threadcode.
Thisenableseachthreadtomakeunique,dynamicdecisionsregardingwhatkernelor
operationtoexecutenext.Therearenosynchronizationrequirementsbetweenthreads
withinablocktoexecuteanyoftheprovideddeviceruntimeAPIs,whichenablesthe
deviceruntimeAPIfunctionstobecalledinarbitrarilydivergentkernelcodewithout
deadlock.
Device-SideKernelLaunch
KernelsmaybelaunchedfromthedeviceusingthestandardCUDA<<<>>>syntax:
kernel_name<<>>([kernelarguments]);
?Dgisoftypedim3andspecifiesthedimensionsandsizeofthegrid
?Dbisoftypedim3andspecifiesthedimensionsandsizeofeachthreadblock
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide11
?Nsisoftypesize_tandspecifiesthenumberofbytesofsharedmemorythatis
dynamicallyallocatedperthreadblockforthiscallandadditiontostaticallyallocated
memory.Nsisanoptionalargumentthatdefaultsto0.
?SisoftypecudaStream_tandspecifiesthestreamassociatedwiththiscall.The
streammusthavebeenallocatedinthesamethreadblockwherethecallisbeing
made.Sisanoptionalargumentthatdefaultsto0.
LaunchesareAsynchronous
Identicaltohost-sidelaunches,alldevice-sidekernellaunchesareasynchronouswith
respecttothelaunchingthread.Thatistosay,the<<<>>>launchcommandwillreturn
immediatelyandthelaunchingthreadwillcontinuetoexecuteuntilithitsanexplicit
launch-synchronizationpointsuchascudaDeviceSynchronize().Thegridlaunchisposted
tothedeviceandwillexecuteindependentlyoftheparentthread.Thechildgridmay
beginexecutionatanytimeafterlaunch,butisnotguaranteedtobeginexecutionuntil
thelaunchingthreadreachesanexplicitlaunch-synchronizationpoint.
LaunchEnvironmentConfiguration
Allglobaldeviceconfigurationsettings(e.g.sharedmemory&L1cachesizeasreturned
fromcudaDeviceGetCacheConfig(),anddevicelimitsreturnedfromcudaDeviceGetLimit())
willbeinheritedfromtheparent.Thatistosayif,whentheparentislaunched,
executionisconfiguredgloballyfor16kofsharedmemoryand48kofL1cache,thenthe
child’sexecutionstatewillbeconfiguredidentically.Likewise,devicelimitssuchas
stacksizewillremainas-configured.
Forhost-launchedkernels,per-kernelconfigurationssetfromthehostwilltake
precedenceovertheglobalsetting.Theseconfigurationswillbeusedwhenthekernelis
launchedfromthedeviceaswell.Itisnotpossibletoreconfigureakernel’senvironment
fromthedevice.
LaunchFrom__host____device__Functions
Althoughthedeviceruntimeenableskernellaunchesfromeitherthehostordevice,
kernellaunchesfrom__host____device__functionsareunsupported.Thecompilerwill
failtocompileifa__host__device__functionisusedtolaunchakernel.
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide12
Streams
Bothnamedandunnamed(NULL)streamsareavailablefromthedeviceruntime.
Namedstreamsmaybeusedbyanythreadwithinathread-block,butstreamhandles
maynotbepassedtootherblocksorchild/parentkernels.Inotherwords,astream
shouldbetreatedasprivatetotheblockinwhichitiscreated.Streamhandlesarenot
guaranteedtobeuniquebetweenblocks,sousingastreamhandlewithinablockthat
didnotallocateitwillresultinundefinedbehavior.
Similartohost-sidelaunch,worklaunchedintoseparatestreamsmayrunconcurrently,
butactualconcurrencyisnotguaranteed.Programsthatdependuponconcurrency
betweenchildkernelsarenotsupportedbytheCUDAprogrammingmodelandwill
haveundefinedbehavior.
Thehost-sideNULLstream''scross-streambarriersemanticisnotsupportedonthe
device(seebelowfordetails).Inordertoretainsemanticcompatibilitywiththehost
runtime,alldevicestreamsmustbecreatedusingthecudaStreamCreateWithFlags()API,
passingthecudaStreamNonBlockingflag.ThecudaStreamCreate()callisahost-runtime-
onlyAPIandwillfailtocompileforthedevice.
AscudaStreamSynchronize()andcudaStreamQuery()areunsupportedbythedevice
runtime,cudaDeviceSynchronize()shouldbeusedinsteadwhentheapplicationneedsto
knowthatstream-launchedchildkernelshavecompleted.
TheImplicit(NULL)Stream
Withinahostprogram,theunnamed(NULL)streamhasadditionalbarrier
synchronizationsemanticswithotherstreams(seetheCUDAProgrammingGuidefor
details).Thedeviceruntimeoffersasingleimplicit,unnamedstreamsharedbetweenall
threadsinablock,butasallnamedstreamsmustbecreatedwiththe
cudaStreamNonBlockingflag,worklaunchedintotheNULLstreamwillnotinsertan
implicitdependencyonpendingworkinanyotherstreams.
Events
Onlytheinter-streamsynchronizationcapabilitiesofCUDAeventsaresupported.This
meansthatcudaStreamWaitEvent()issupported,butcudaEventSynchronize()
cudaEventElapsedTime(),andcudaEventQuery()arenot.AscudaEventElapsedTime()isnot
supported,cudaEventsmustbecreatedviacudaEventCreateWithFlags(),passingthe
cudaEventDisableTimingflag.
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide13
Asforalldeviceruntimeobjects,eventobjectsmaybesharedbetweenallthreadswithin
thethread-blockwhichcreatedthembutarelocaltothatblockandmaynotbepassedto
otherkernels,orbetweenblockswithinthesamekernel.Eventhandlesarenot
guaranteedtobeuniquebetweenblocks,sousinganeventhandlewithinablockthat
didnotcreateitwillresultinundefinedbehavior.
Synchronization
ThecudaDeviceSynchronize()functionwillsynchronizeonallworklaunchedbyany
threadinthethread-blockuptothepointwherecudaDeviceSynchronize()wascalled.
NotethatcudaDeviceSynchronize()maybecalledfromwithindivergentcode(seebelow).
Itisuptotheprogramtoperformsufficientadditionalinter-threadsynchronization,for
exampleviaacallto__syncthreads(),ifthecallingthreadisintendedtosynchronizewith
childgridsinvokedfromotherthreads.
Block-WideSynchronization
ThecudaDeviceSynchronize()functiondoesnotimplyintra-blocksynchronization.In
particular,withoutexplicitsynchronizationviaa__syncthreads()directivethecalling
threadcanmakenoassumptionsaboutwhatworkhasbeenlaunchedbyanythread
otherthanitself.Forexampleifmultiplethreadswithinablockareeachlaunchingwork
andsynchronizationisdesiredforallthisworkatonce(perhapsbecauseofevent-based
dependencies),itisuptotheprogramtoguaranteethatthisworkissubmittedbyall
threadsbeforecallingcudaDeviceSynchronize().
Becausetheimplementationispermittedtosynchronizeonlaunchesfromanythreadin
theblock,itisquitepossiblethatsimultaneouscallstocudaDeviceSynchronize()by
multiplethreadswilldrainallworkinthefirstcallandthenhavenoeffectforthelater
calls.
DeviceManagement
Onlythedeviceonwhichakernelisrunningwillbecontrollablefromthatkernel.This
meansthatdeviceAPIssuchascudaSetDevice()arenotsupportedbythedeviceruntime.
TheactivedeviceasseenfromtheGPU(returnedfromcudaGetDevice())willhavethe
samedevicenumberasseenfromthehostsystem.ThecudaGetDeviceProperty()callmay
requestinformationaboutanotherdeviceasthisAPIallowsspecificationofadeviceID
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide14
asaparameterofthecall.Notethatthecatch-allcudaGetDeviceProperties()APIisnot
offeredbythedeviceruntime–propertiesmustbequeriedindividually.
MemoryDeclarations
DeviceandConstantMemory
Memorydeclaredatfilescopewith__device__or__constant__qualifiersbehave
identicallywhenusingthedeviceruntime.Allkernelsmayreadorwrite__device__
variables,whetherthekernelwasinitiallylaunchedbythehostordeviceruntime.
Equivalently,allkernelswillhavethesameviewof__constant__sasdeclaredatthe
modulescope.
Textures&Surfaces
CUDAsupportsdynamicallycreatedtextureandsurfaceobjects1,whereatexture
referencemaybecreatedonthehost,passedtoakernel,usedbythatkernel,andthen
destroyedfromthehost.Thedeviceruntimedoesnotallowcreationordestructionof
textureorsurfaceobjectsfromwithindevicecode,buttextureandsurfaceobjects
createdfromthehostmaybeusedandpassedaroundfreelyonthedevice.Regardless
ofwheretheyarecreated,dynamicallycreatedtextureobjectsarealwaysvalidandmay
bepassedtochildkernelsfromaparent.
NOTE:Thedeviceruntimedoesnotsupportlegacymodule-scope(i.e.Fermi-style)
texturesandsurfaceswithinakernellaunchedfromthedevice.Module-scope(legacy)
texturesmaybecreatedfromthehostandusedindevicecodeasforanykernel,but
mayonlybeusedbyatop-levelkernel(i.e.theonewhichislaunchedfromthehost).
SharedMemoryVariableDeclarations
InCUDAC/C++sharedmemorycanbedeclaredeitherasastaticallysizedfile-scopeor
function-scopedvariable,orasanexternvariablewiththesizedeterminedatruntimeby
thekernel’scallerviaalaunchconfigurationargument.Bothtypesofdeclarationsare
validunderthedeviceruntime.
1DynamicallycreatedtextureandsurfaceobjectsareanadditiontotheCUDAmemorymodel
introducedwithCUDA5.0.PleaseseetheCUDAProgrammingGuidefordetails.
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide15
__global__voidpermute(intn,intdata){
extern__shared__intsmem[];
if(n<=1)
return;
smem[threadIdx.x]=data[threadIdx.x];
__syncthreads();
permute_data(smem,n);
__syncthreads();
//WritebacktoGMEMsincewecan’tpassSMEMtochildren.
data[threadIdx.x]=smem[threadIdx.x];
__syncthreads();
if(threadIdx.x==0){
permute<<<1,256,n/2sizeof(int)>>>(n/2,data);
permute<<<1,256,n/2sizeof(int)>>>(n/2,data+n/2);
}
}
voidhost_launch(intdata){
permute<<<1,256,256sizeof(int)>>>(256,data);
}
SymbolAddresses
Device-sidesymbols(i.e.thosemarked__device__)maybereferencedfromwithina
kernelsimplyviathe?&?operator,asallglobal-scopedevicevariablesareinthe
kernel’svisibleaddressspace.Thisalsoappliesto__constant__symbols,althoughin
thiscasethepointerwillreferenceread-onlydata.
Giventhatdevice-sidesymbolscanbereferenceddirectly,thoseCUDAruntimeAPIs
whichreferencesymbols(e.g.cudaMemcpyToSymbol()orcudaGetSymbolAddress())are
redundantandhencenotsupportedbythedeviceruntime.Notethisimpliesthat
__constant__datacannotbealteredfromwithinarunningkernel,evenaheadofachild
kernellaunch,asreferencesto__constant__spaceareread-only.
APIErrorsandLaunchFailures
AsusualfortheCUDAruntime,anyfunctionmayreturnanerrorcode.Thelasterror
codereturnedisrecordedandmayberetrievedviathecudaGetLastError()call.Errorsare
recordedper-thread,sothateachthreadcanidentifythemostrecenterrorthatithas
generated.TheerrorcodeisoftypecudaError_t.
Similartoahost-sidelaunch,device-sidelaunchesmayfailformanyreasons(invalid
arguments,etc).TheusermustcallcudaGetLastError()todetermineifalaunchgenerated
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide16
anerror,howeverlackofanerrorafterlaunchdoesnotimplythechildkernel
completedsuccessfully.
Fordevice-sideexceptions,e.g.,accesstoaninvalidaddress,anerrorinachildgridwill
bereturnedtothehostinsteadofbeingreturnedbytheparent’scallto
cudaDeviceSynchronize().
LaunchSetupAPIs
Kernellaunchisasystem-levelmechanismexposedthroughthedeviceruntimelibrary,
andassuchisavailabledirectlyfromPTXviatheunderlyingcudaGetParameterBuffer()
andcudaLaunchDevice()APIs.ItispermittedforaCUDAapplicationtocalltheseAPIs
itself,withthesamerequirementsasforPTX.Inbothcases,theuseristhenresponsible
forcorrectlypopulatingallnecessarydatastructuresinthecorrectformataccordingto
specification.Backwardscompatibilityisguaranteedinthesedatastructures.
Aswithhost-sidelaunch,thedevice-sideoperator<<<>>>mapstounderlyingkernel
launchAPIs.ThisissothatuserstargetingPTXwillbeabletoenactalaunch,andso
thatthecompilerfront-endcantranslate<<<>>>intothesecalls.
RuntimeAPILaunch
Functions
DescriptionofDifferenceFromHostRuntimeBehaviour
(behaviourisidenticalifnodescription)
cudaGetParameterBufferGeneratedautomaticallyfrom<<<>>>.NotedifferentAPIto
hostequivalent.
cudaLaunchDeviceGeneratedautomaticallyfrom<<<>>>.NotedifferentAPIto
hostequivalent.
NewDevice-onlylaunchimplementationfunctions
TheAPIsfortheselaunchfunctionsaredifferenttothoseoftheCUDARuntimeAPI,
andaredefinedasfollows:
extern__device__cudaError_tcudaGetParameterBuffer(voidparams);
extern__device__cudaError_tcudaLaunchDevice(voidkernel,
voidparams,dim3gridDim,
dim3blockDim,
unsignedintsharedMemSize=0,
cudaStream_tstream=0);
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide17
APIReference
TheportionsoftheCUDARuntimeAPIsupportedinthedeviceruntimearedetailed
here.HostanddeviceruntimeAPIshaveidenticalsyntax;semanticsarethesameexcept
whereindicated.ThetablebelowprovidesanoverviewoftheAPIrelativetotheversion
availablefromthehost.
RuntimeAPIFunctionsDetails
cudaDeviceSynchronizeSynchronizesonworklaunchedfromthread’sownblock
only
cudaDeviceGetCacheConfig
cudaDeviceGetLimit
cudaGetLastErrorLasterrorisper-threadstate,notper-blockstate
cudaPeekAtLastError
cudaGetErrorString
cudaGetDeviceCount
cudaGetDevicePropertyWillreturnpropertiesforanydevice
cudaGetDeviceAlwaysreturnscurrentdeviceIDaswouldbeseenfrom
host
cudaStreamCreateWithFlagsMustpasscudaStreamNonBlockingflag
cudaStreamDestroy
cudaStreamWaitEvent
cudaEventCreateWithFlagsMustpasscudaEventDisableTimingflag
cudaEventRecord
cudaEventDestroy
cudaFuncGetAttributes
cudaMemcpyAsyncNotesaboutallmemcpy/memsetfunctions:
cudaMemcpy2DAsyncOnlyasyncmemcpy/setfunctionsaresupported
cudaMemcpy3DAsyncOnlydevice-to-devicememcpyispermitted
cudaMemsetAsyncMaynotpassinlocalorsharedmemorypointers
cudaMemset2DAsync
cudaMemset3DAsync
cudaRuntimeGetVersion
cudaMallocMaynotcallcudaFreeonthedeviceonapointercreated
cudaFreeonthehost,andvice-versa
SupportedAPIfunctions
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide18
DEVICE-SIDELAUNCHFROMPTX
Thissectionisfortheprogramminglanguageandcompilerimplementerswhotarget
ParallelThreadExecution(PTX)andplantosupportDynamicParallelismintheir
language.Itprovidesthelow-leveldetailsrelatedtosupportingkernellaunchesatthe
PTXlevel.
KernelLaunchAPIs
Device-sidekernellaunchescanbeimplementedusingthefollowingtwoAPIs
accessiblefromPTX:cudaLaunchDevice()andcudaGetParameterBuffer().
cudaLaunchDevice()launchesthespecifiedkernelwiththeparameterbufferthatis
obtainedbycallingcudaGetParameterBuffer()andfilledwiththeparameterstothe
launchedkernel.TheparameterbuffercanbeNULL,i.e.,noneedtoinvoke
cudaGetParameterBuffer(),ifthelaunchedkerneldoesnottakeanyparameters.
cudaLaunchDevice
AtthePTXlevel,cudaLaunchDevice()needstobedeclaredinoneofthetwoformsshown
belowbeforeitisused.
//When.address_sizeis64
.extern.func(.param.b32func_retval0)cudaLaunchDevice
(
.param.b64func,
.param.b64parameterBuffer,
.param.align4.b8gridDimension[12],
.param.align4.b8blockDimension[12],
.param.b32sharedMemSize,
.param.b64stream
)
;
PTX-leveldeclarationofcudaLaunchDevice()when.address_sizeis64
//When.address_sizeis32
.extern.func(.param.b32func_retval0)cudaLaunchDevice
(
.param.b32func,
.param.b32parameterBuffer,
.param.align4.b8gridDimension[12],
.param.align4.b8blockDimension[12],
.param.b32sharedMemSize,
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide19
.param.b32stream
)
;
PTX-leveldeclarationofcudaLaunchDevice()when.address_sizeis32
TheCUDA-leveldeclarationbelowismappedtooneoftheaforementionedPTX-level
declarationsandisfoundinthesystemheaderfilecuda_device_runtime_api.h.The
functionisdefinedinthecudadevrtsystemlibrary,whichmustbelinkedwithaprogram
inordertousedevice-sidekernellaunchfunctionality.
extern“C”__device__
cudaError_tcudaLaunchDevice(voidfunc,voidparameterBuffer,
dim3gridDimension,dim3blockDimension,
unsignedintsharedMemSize,
cudaStream_tstream);
CUDA-leveldeclarationofcudaLaunchDevice()
Thefirstparameterisapointertothekerneltobeislaunched,andthesecondparameter
istheparameterbufferthatholdstheactualparameterstothelaunchedkernel.The
layoutoftheparameterbufferisexplainedin?ParameterBufferLayout?,below.Other
parametersspecifythelaunchconfiguration,i.e.,asgriddimension,blockdimension,
sharedmemorysize,andthestreamassociatedwiththelaunch(pleaserefertothe
CUDAProgrammingGuideforthedetaileddescriptionoflaunchconfiguration,andof
cudaLaunchDevice()specifically).
cudaGetParameterBuffer
cudaGetParameterBuffer()needstobedeclaredatthePTXlevelbeforeit’sused.ThePTX-
leveldeclarationmustbeinoneofthetwoformsgivenbelow,dependingonaddress
size:
//When.address_sizeis64
.extern.func(.param.b64func_retval0)cudaGetParameterBuffer
(
.param.b64alignment,
.param.b64size
)
;
PTX-leveldeclarationofcudaGetParameterBuffer()when.address_sizeis64
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide20
.extern.func(.param.b32func_retval0)cudaGetParameterBuffer
(
.param.b32alignment,
.param.b32size
)
;
PTX-leveldeclarationofcudaGetParameterBuffer()when.address_sizeis32
ThefollowingCUDA-leveldeclarationofcudaGetParameterBuffer()ismappedtothe
aforementionedPTX-leveldeclaration:
extern“C”__device__
voidcudaGetParameterBuffer(size_talignment,size_tsize);
CUDA-leveldeclarationofcudaGetParameterBuffer()
Thefirstparameterspecifiesthealignmentrequirementoftheparameterbufferandthe
secondparameterthesizerequirementinbytes.Inthecurrentimplementation,the
parameterbufferreturnedbycudaGetParameterBuffer()isalwaysguaranteedtobe64-
bytealigned,andthealignmentrequirementparameterisignored.However,itis
recommendedtopassthecorrectalignmentrequirementvalue–whichisthelargest
alignmentofanyparametertobeplacedintheparameterbuffer–to
cudaGetParameterBuffer()toensureportabilityinthefuture.
ParameterBufferLayout
Parameterreorderingintheparameterbufferisprohibited,andeachindividual
parameterplacedintheparameterbufferisrequiredtobealigned.Thatis,each
parametermustbeplacedatthenthbyteintheparameterbuffer,wherenisthesmallest
multipleoftheparametersizethatisgreaterthantheoffsetofthelastbytetakenbythe
precedingparameter.Themaximumsizeoftheparameterbufferis4KB.
ForamoredetaileddescriptionofPTXcodegeneratedbytheCUDAcompiler,please
refertothePTX-3.5specification.
ProgrammingInterface
CUDADynamicParallelismProgrammingGuide21
TOOLKITSUPPORTFORDYNAMICPARALLELISM
IncludingdeviceruntimeAPIinCUDAcode
Similartothehost-sideruntimeAPI,prototypesfortheCUDAdeviceruntimeAPIare
includedautomaticallyduringprogramcompilation.Thereisnoneedtoinclude
cuda_device_runtime_api.hexplicitly.
CompilingandLinking
CUDAprogramsareautomaticallylinkedwiththehostruntimelibrarywhencompiled
withnvcc,butthedeviceruntimeisshippedasastaticlibrarywhichmustexplicitlybe
linkedwithaprogramwhichwishestouseit.
Thedeviceruntimeisofferedasastaticlibrary(cudadevrt.libonWindows,libcudadevrt.a
underLinuxandMacOS),againstwhichaGPUapplicationthatusesthedeviceruntime
mustbelinked.Linkingofdevicelibrariescanbeaccomplishedthroughnvccand/or
nvlink.Twosimpleexamplesareshownbelow.
Adeviceruntimeprogrammaybecompiledandlinkedinasinglestep,ifallrequired
sourcefilescanbespecifiedfromthecommandline:
$nvcc-arch=sm_35-rdc=truehello_world.cu-ohello-lcudadevrt
ItisalsopossibletocompileCUDA.cusourcefilesfirsttoobjectfiles,andthenlink
thesetogetherinatwo-stageprocess:
$nvcc-arch=sm_35-dchello_world.cu-ohello_world.o
$nvcc-arch=sm_35-rdc=truehello_world.o-ohello-lcudadevrt
Pleaseseethe?UsingSeparateCompilation?sectionof?TheCUDADriverCompiler
NVCC?guideformoredetails.
CUDADynamicParallelismProgrammingGuide22
PROGRAMMINGGUIDELINES
BASICS
Thedeviceruntimeisafunctionalsubsetofthehostruntime.APIleveldevice
management,kernellaunching,devicememcpy,streammanagement,andevent
managementareexposedfromthedeviceruntime.
Programmingforthedeviceruntimeshouldbefamiliartosomeonewhoalreadyhas
experiencewithCUDA.Deviceruntimesyntaxandsemanticsarelargelythesameas
thatofthehostAPI,withanyexceptionsdetailedearlierinthisdocument.
Programmingguidelines
CUDADynamicParallelismProgrammingGuide23
Thefollowingexampleshowsasimple?HelloWorld?programincorporatingdynamic
parallelism:
#include
__global__voidchildKernel()
{
printf("Hello");
}
__global__voidparentKernel()
{
//launchchild
childKernel<<<1,1>>>();
if(cudaSuccess!=cudaGetLastError()){
return;
}
//waitforchildtocomplete
if(cudaSuccess!=cudaDeviceSynchronize()){
return;
}
printf("World!\n");
}
intmain(intargc,charargv[])
{
//launchparent
parentKernel<<<1,1>>>();
if(cudaSuccess!=cudaGetLastError()){
return1;
}
//waitforparenttocomplete
if(cudaSuccess!=cudaDeviceSynchronize()){
return2;
}
return0;
}
Thisprogrammaybebuiltinasinglestepfromthecommandlineasfollows:
$nvcc-arch=sm_35-rdc=truehello_world.cu-ohello-lcudadevrt
Programmingguidelines
CUDADynamicParallelismProgrammingGuide24
PERFORMANCE
Synchronization
Synchronizationbyonethreadmayimpacttheperformanceofotherthreadsinthesame
ThreadBlock,evenwhenthoseotherthreadsdonotcallcudaDeviceSynchronize()
themselves.Thisimpactwilldependupontheunderlyingimplementation.
Dynamic-parallelism-enabledKerneloverhead
Systemsoftwarewhichisactivewhencontrollingdynamiclaunchesmayimposean
overheadonanykernelwhichisrunningatthetime,whetherornotitinvokeskernel
launchesofitsown.Thisoverheadarisesfromthedeviceruntime’sexecutiontracking
andmanagementsoftwareandmayresultindecreasedperformancefore.g.librarycalls
whenmadefromthedevicecomparedtofromthehostside.Thisoverheadis,in
general,incurredforapplicationsthatlinkagainstthedeviceruntimelibrary.
IMPLEMENTATIONRESTRICTIONS&LIMITATIONS
DynamicParallelismguaranteesallsemanticsdescribedinthisdocument,however,
certainhardwareandsoftwareresourcesareimplementation-dependentandlimitthe
scale,performanceandotherpropertiesofaprogramwhichusesthedeviceruntime.
Runtime
MemoryFootprint
Thedeviceruntimesystemsoftwarereservesmemoryforvariousmanagement
purposes,inparticularonereservationwhichisusedforsavingparent-gridstateduring
synchronization,andasecondreservationfortrackingpendinggridlaunches.
Configurationcontrolsareavailabletoreducethesizeofthesereservationsinexchange
forcertainlaunchlimitations.SeeConfigurationOptions,below,fordetails.
Themajorityofreservedmemoryisallocatedasbacking-storeforparentkernelstate,for
usewhensynchronizingonachildlaunch.Conservatively,thismemorymustsupport
storingofstateforthemaximumnumberoflivethreadspossibleonthedevice.This
meansthateachparentgenerationatwhichcudaDeviceSynchronize()iscallablemay
Programmingguidelines
CUDADynamicParallelismProgrammingGuide25
requireupto150MBofdevicememory,dependingonthedeviceconfiguration,which
willbeunavailableforprogramuseevenifitisnotallconsumed.
Nesting&SynchronizationDepth
Usingthedeviceruntime,onekernelmaylaunchanotherkernel,andthatkernelmay
launchanother,andsoon.Eachsubordinatelaunchisconsideredanew?nestinglevel?,
andthetotalnumberoflevelsisthe?nestingdepth?oftheprogram.The
?synchronizationdepth?isdefinedasthedeepestlevelatwhichtheprogramwill
explicitlysynchronizeonachildlaunch.Typicallythisisonelessthanthenestingdepth
oftheprogram,butiftheprogramdoesnotneedtocallcudaDeviceSynchronize()atall
levelsthenthesynchronizationdepthmightbesubstantiallydifferenttothenesting
depth.
Theoverallmaximumnestingdepthislimitedto24,butpracticallyspeakingthereal
limitwillbetheamountofmemoryrequiredbythesystemforeachnewlevel(see
MemoryFootprintabove).Anylaunchwhichwouldresultinakernelatadeeperlevel
thanthemaximumwillfail.NotethatthismayalsoapplytocudaMemcpyAsync(),which
mightitselfgenerateakernellaunch.SeeConfigurationOptions,below,fordetails.
Bydefault,sufficientstorageisreservedfortwolevelsofsynchronization.This
maximumsynchronizationdepth(andhencereservedstorage)maybecontrolledby
callingcudaDeviceSetLimit()andspecifyingcudaLimitDevRuntimeSyncDepth.Thenumber
oflevelstobesupportedmustbeconfiguredbeforethetop-levelkernelislaunched
fromthehost,inordertoguaranteesuccessfulexecutionofanestedprogram.Calling
cudaDeviceSynchronize()atadepthgreaterthanthespecifiedmaximumsynchronization
depthwillreturnanerror.
Anoptimizationispermittedwherethesystemdetectsthatitneednotreservespacefor
theparent’sstateincaseswheretheparentkernelnevercallscudaDeviceSynchronize().In
thiscase,becauseexplicitparent/childsynchronizationneveroccurs,thememory
footprintrequiredforaprogramwillbemuchlessthantheconservativemaximum.
Suchaprogramcouldspecifyashallowermaximumsynchronizationdepthtoavoid
over-allocationofbackingstore.
PendingKernelLaunches
Whenakernelislaunched,allassociatedconfigurationandparameterdataistracked
untilthekernelcompletes.Thisdataisstoredwithinasystem-managedlaunchpool.
ThesizeofthelaunchpoolisconfigurablebycallingcudaDeviceSetLimit()fromthehost
andspecifyingcudaLimitDevRuntimePendingLaunchCount.
Programmingguidelines
CUDADynamicParallelismProgrammingGuide26
ConfigurationOptions
Resourceallocationforthedeviceruntimesystemsoftwareiscontrolledviathe
cudaDeviceSetLimit()APIfromthehostprogram.Limitsmustbesetbeforeanykernelis
launched,andmaynotbechangedwhiletheGPUisactivelyrunningprograms.
Thefollowingnamedlimitsmaybeset:
LimitBehaviour
cudaLimitDevRuntimeSyncDepthSetsthemaximumdepthatwhich
cudaDeviceSynchronize()maybecalled.
Launchesmaybeperformeddeeperthanthis,
butexplicitsynchronizationdeeperthanthis
limitwillreturnthe
cudaErrorLaunchMaxDepthExceeded.The
defaultmaximumsyncdepthis2.
cudaLimitDevRuntimePendingLaunchCountControlstheamountofmemorysetasidefor
bufferingkernellauncheswhichhavenotyet
beguntoexecute,dueeithertounresolved
dependenciesorlackofexecutionresources.
Whenthebufferisfull,launcheswillsetthe
thread’slasterrorto
cudaErrorLaunchPendingCountExceeded.The
defaultpendinglaunchcountis2048launches.
MemoryAllocationandLifetime
cudaMalloc()andcudaFree()havedistinctsemanticsbetweenthehostanddevice
environments.Wheninvokedfromthehost,cudaMalloc()allocatesanewregionfrom
unuseddevicememory.Wheninvokedfromthedeviceruntimethesefunctionsmapto
device-sidemalloc()andfree().Thisimpliesthatwithinthedeviceenvironmentthetotal
allocatablememoryislimitedtothedevicemalloc()heapsize,whichmaybesmaller
thantheavailableunuseddevicememory.Also,itisanerrortoinvokecudaFree()from
thehostprogramonapointerwhichwasallocatedbycudaMalloc()onthedeviceorvice-
versa.
cudaMalloc()onHostcudaMalloc()onDevice
cudaFree()onHostSupportedNotSupported
cudaFree()onDeviceNotSupportedSupported
AllocationlimitFreedevicememorycudaLimitMallocHeapSize
Programmingguidelines
CUDADynamicParallelismProgrammingGuide27
SMIdandWarpId
NotethatinPTX%smidand%warpidaredefinedasvolatilevalues.Thedeviceruntime
mayreschedulethreadblocksontodifferentSMsinordertomoreefficientlymanage
resources.Assuch,itisunsafetorelyupon%smidor%warpidremainingunchanged
acrossthelifetimeofathreadorthreadblock.
ECCErrors
NonotificationofECCerrorsisavailabletocodewithinaCUDAkernel.ECCerrorsare
reportedatthehostsideoncetheentirelaunchtreehascompleted.AnyECCerrors
whichariseduringexecutionofanestedprogramwilleithergenerateanexceptionor
continueexecution(dependinguponerrorandconfiguration).
Programmingguidelines
CUDADynamicParallelismProgrammingGuide28
Notice
ALLNVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER
DOCUMENTS(TOGETHERANDSEPARATELY,“MATERIALS”)AREBEINGPROVIDED“ASIS.”NVIDIAMAKESNO
WARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOTHEMATERIALS,AND
EXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFOR
APARTICULARPURPOSE.
Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno
responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorother
rightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationofotherwiseunder
anypatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochange
withoutnotice.Thispublicationsupersedesandreplacesallotherinformationpreviouslysupplied.NVIDIA
Corporationproductsarenotauthorizedascriticalcomponentsinlifesupportdevicesorsystemswithout
expresswrittenapprovalofNVIDIACorporation.
Trademarks
NVIDIA,theNVIDIAlogo,andaretrademarks
and/orregisteredtrademarksofNVIDIACorporationintheU.S.andothercountries.Othercompanyand
productnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.
Copyright
?2012NVIDIACorporation.Allrightsreserved.
|
|