配色: 字号:
CUDA_Dynamic_Parallelism_Programming_Guide
2013-07-20 | 阅:  转:  |  分享 
  




August2012

CUDADYNAMICPARALLELISM

PROGRAMMINGGUIDE





CUDADynamicParallelismProgrammingGuideii

TABLEOFCONTENTS

Introduction......................................................................................1

Overview.......................................................................................................1

Glossary........................................................................................................2

ExecutionEnvironmentandMemoryModel.............................................3

ExecutionEnvironment......................................................................................3

ParentandChildGrids....................................................................................3

ScopeofCUDAPrimitives................................................................................4

Synchronization............................................................................................4

Streams&Events..........................................................................................5

OrderingandConcurrency...............................................................................5

DeviceManagement......................................................................................6

MemoryModel.................................................................................................6

CoherenceandConsistency..............................................................................6

ProgrammingInterface......................................................................10

CUDAC/C++Reference....................................................................................10

Device-SideKernelLaunch..............................................................................10

Streams....................................................................................................12

Events......................................................................................................12

Synchronization...........................................................................................13

DeviceManagement.....................................................................................13

MemoryDeclarations.....................................................................................14

APIErrorsandLaunchFailures.........................................................................15

APIReference.............................................................................................17

Device-SideLaunchFromPTX.............................................................................18

KernelLaunchAPIs.......................................................................................18

ParameterBufferLayout................................................................................20

ToolkitSupportforDynamicParallelism.................................................................21

IncludingdeviceruntimeAPIinCUDAcode.........................................................21

CompilingandLinking...................................................................................21

Programmingguidelines.....................................................................22

Basics..........................................................................................................22

Performance..................................................................................................24

Synchronization...........................................................................................24

Dynamic-parallelism-enabledKerneloverhead......................................................24

ImplementationRestrictions&Limitations..............................................................24

Runtime....................................................................................................24





CUDADynamicParallelismProgrammingGuide1

INTRODUCTION

Thisdocumentprovidesguidanceonhowtodesignanddevelopsoftwarethattakes

advantageofthenewDynamicParallelismcapabilitiesintroducedwithCUDA5.0.

OVERVIEW

DynamicParallelismisanextensiontotheCUDAprogrammingmodelenablinga

CUDAkerneltocreateandsynchronizewithnewworkdirectlyontheGPU.The

creationofparallelismdynamicallyatwhicheverpointinaprogramthatitisneeded

offersexcitingnewcapabilities.

TheabilitytocreateworkdirectlyfromtheGPUcanreducetheneedtotransfer

executioncontrolanddatabetweenhostanddevice,aslaunchconfigurationdecisions

cannowbemadeatruntimebythreadsexecutingonthedevice.Additionally,data-

dependentparallelworkcanbegeneratedinlinewithinakernelatrun-time,taking

advantageoftheGPU’shardwareschedulersandloadbalancersdynamicallyand

adaptinginresponsetodata-drivendecisionsorworkloads.Algorithmsand

programmingpatternsthathadpreviouslyrequiredmodificationstoeliminate

recursion,irregularloopstructure,orotherconstructsthatdonotfitaflat,single-levelof

parallelismmaymoretransparentlybeexpressed.

ThisdocumentdescribestheextendedcapabilitiesofCUDAwhichenableDynamic

Parallelism,includingthemodificationsandadditionstotheCUDAprogramming

modelnecessarytotakeadvantageofthese,aswellasguidelinesandbestpracticesfor

exploitingthisaddedcapacity.

DynamicParallelismisonlysupportedbydevicesofcomputecapability3.5andhigher.

Introduction



CUDADynamicParallelismProgrammingGuide2

GLOSSARY

Definitionsfortermsusedinthisguide.

Grid

AGridisacollectionofThreads.ThreadsinaGridexecuteaKernelFunctionandare

dividedintoThreadBlocks.

ThreadBlock

AThreadBlockisagroupofthreadswhichexecuteonthesamemultiprocessor(SMX).

ThreadswithinaThreadBlockhaveaccesstosharedmemoryandcanbeexplicitly

synchronized.

KernelFunction

AKernelFunctionisanimplicitlyparallelsubroutinethatexecutesundertheCUDA

executionandmemorymodelforeveryThreadinaGrid.

Host

TheHostreferstotheexecutionenvironmentthatinitiallyinvokedCUDA,typicallythe

threadrunningonasystem’sCPUprocessor.

Parent

AParentThread,ThreadBlock,orGridisonethathaslaunchednewgrid(s),theChild

Grid(s).TheParentisnotconsideredcompleteduntilallofitslaunchedChildGrids

havealsocompleted.

Child

AChildthread,block,orgridisonethathasbeenlaunchedbyaParentgrid.AChild

gridmustcompletebeforetheParentThread,ThreadBlock,orGridareconsidered

complete.

ThreadBlockScope

ObjectswithThreadBlockScopehavethelifetimeofasingleThreadBlock.Theyonly

havedefinedbehaviorwhenoperatedonbyThreadsintheThreadBlockthatcreated

theobjectandaredestroyedwhentheThreadBlockthatcreatedthemiscomplete.

DeviceRuntime

TheDeviceRuntimereferstotheruntimesystemandAPIsavailabletoenableKernel

FunctionstouseDynamicParallelism.





CUDADynamicParallelismProgrammingGuide3

EXECUTIONENVIRONMENTANDMEMORY

MODEL

EXECUTIONENVIRONMENT

TheCUDAexecutionmodelisbasedonprimitivesofthreads,threadblocks,andgrids,

withkernelfunctionsdefiningtheprogramexecutedbyindividualthreadswithina

threadblockandgrid.Whenakernelfunctionisinvokedthegrid''spropertiesare

describedbyanexecutionconfiguration,whichhasaspecialsyntaxinCUDA.Support

fordynamicparallelisminCUDAextendstheabilitytoconfigure,launch,and

synchronizeuponnewgridstothreadsthatarerunningonthedevice.



ParentandChildGrids

Adevicethreadthatconfiguresandlaunchesanewgridbelongstotheparentgrid,and

thegridcreatedbytheinvocationisachildgrid.

Theinvocationandcompletionofchildgridsisproperlynested,meaningthattheparent

gridisnotconsideredcompleteuntilallchildgridscreatedbyitsthreadshave

completed.Eveniftheinvokingthreadsdonotexplicitlysynchronizeonthechildgrids

launched,theruntimeguaranteesanimplicitsynchronizationbetweentheparentand

child.

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide4





ScopeofCUDAPrimitives

Onbothhostanddevice,theCUDAruntimeoffersanAPIforlaunchingkernels,for

waitingforlaunchedworktocomplete,andfortrackingdependenciesbetween

launchesviastreamsandevents.Onthehostsystem,thestateoflaunchesandthe

CUDAprimitivesreferencingstreamsandeventsaresharedbyallthreadswithina

process;howeverprocessesexecuteindependentlyandmaynotshareCUDAobjects.

Asimilarhierarchyexistsonthedevice:launchedkernelsandCUDAobjectsarevisible

toallthreadsinathreadblock,butareindependentbetweenthreadblocks.Thismeans

forexamplethatastreammaybecreatedbyonethreadandusedbyanyotherthreadin

thesamethreadblock,butmaynotbesharedwiththreadsinanyotherthreadblock.



Synchronization

CUDAruntimeoperationsfromanythread,includingkernellaunches,arevisibleacross

athreadblock.Thismeansthataninvokingthreadintheparentgridmayperform

synchronizationonthegridslaunchedbythatthread,byotherthreadsinthethread

block,oronstreamscreatedwithinthesamethreadblock.Executionofathreadblockis

notconsideredcompleteuntilalllaunchesbyallthreadsintheblockhavecompleted.If

allthreadsinablockexitbeforeallchildlauncheshavecompleted,asynchronization

operationwillautomaticallybetriggered.

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide5



Streams&Events

CUDAStreamsandEventsallowcontroloverdependenciesbetweengridlaunches:

gridslaunchedintothesamestreamexecutein-order,andeventsmaybeusedtocreate

dependenciesbetweenstreams.Streamsandeventscreatedonthedeviceservethis

exactsamepurpose.

Streamsandeventscreatedwithinagridexistwithinthreadblockscopebuthave

undefinedbehaviorwhenusedoutsideofthethreadblockwheretheywerecreated.As

describedabove,allworklaunchedbyathreadblockisimplicitlysynchronizedwhen

theblockexits;worklaunchedintostreamsisincludedinthis,withalldependencies

resolvedappropriately.Thebehaviorofoperationsonastreamthathasbeenmodified

outsideofthreadblockscopeisundefined.

Streamsandeventscreatedonthehosthaveundefinedbehaviorwhenusedwithinany

kernel,justasstreamsandeventscreatedbyaparentgridhaveundefinedbehaviorif

usedwithinachildgrid.



OrderingandConcurrency

TheorderingofkernellaunchesfromthedeviceruntimefollowsCUDAStream

orderingsemantics.Withinathreadblock,allkernellaunchesintothesamestreamare

executedin-order.Withmultiplethreadsinthesamethreadblocklaunchingintothe

samestream,theorderingwithinthestreamisdependentonthethreadscheduling

withintheblock,whichmaybecontrolledwithsynchronizationprimitivessuchas

__syncthreads().

Notethatbecausestreamsaresharedbyallthreadswithinathreadblock,theimplicit

‘NULL’streamisalsoshared.Ifmultiplethreadsinathreadblocklaunchintothe

implicitstream,thentheselauncheswillbeexecutedin-order.Ifconcurrencyisdesired,

explicitnamedstreamsshouldbeused.

DynamicParallelismenablesconcurrencytobeexpressedmoreeasilywithinaprogram;

however,thedeviceruntimeintroducesnonewconcurrencyguaranteeswithinthe

CUDAexecutionmodel.Thereisnoguaranteeofconcurrentexecutionbetweenany

numberofdifferentthreadblocksonadevice.

Thelackofconcurrencyguaranteeextendstoparentthreadblocksandtheirchildgrids.

Whenaparentthreadblocklaunchesachildgrid,thechildisnotguaranteedtobegin

executionuntiltheparentthreadblockreachesanexplicitsynchronizationpoint(e.g.

cudaDeviceSynchronize()).

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide6

Whileconcurrencywillofteneasilybeachieved,itmayvaryasafunctionofdevice

configuration,applicationworkload,andruntimescheduling.Itisthereforeunsafeto

dependuponanyconcurrencybetweendifferentthreadblocks.



DeviceManagement

Thereisnomulti-GPUsupportfromthedeviceruntime;thedeviceruntimeisonly

capableofoperatingonthedeviceuponwhichitiscurrentlyexecuting.Itispermitted,

however,toquerypropertiesforanyCUDAcapabledeviceinthesystem.



MEMORYMODEL

Parentandchildgridssharethesameglobalandconstantmemorystorage,buthave

distinctlocalandsharedmemory.



CoherenceandConsistency

GlobalMemory

Parentandchildgridshavecoherentaccesstoglobalmemory,withweakconsistency

guaranteesbetweenchildandparent.Therearetwopointsintheexecutionofachild

gridwhenitsviewofmemoryisfullyconsistentwiththeparentthread:whenthechild

gridisinvokedbytheparent,andwhenthechildgridcompletesassignaledbya

synchronizationAPIinvocationintheparentthread.

Allglobalmemoryoperationsintheparentthreadpriortothechildgrid’sinvocation

arevisibletothechildgrid.Allmemoryoperationsofthechildgridarevisibletothe

parentaftertheparenthassynchronizedonthechildgrid’scompletion.

Inthefollowingexample,thechildgridexecutingchild_launchisonlyguaranteedto

seethemodificationstodatamadebeforethechildgridwaslaunched.Sincethread0of

theparentisperformingthelaunch,thechildwillbeconsistentwiththememoryseen

bythread0oftheparent.Duetothefirst__syncthreads()call,thechildwillsee

data[0]=0,data[1]=1,...,data[255]=255(withoutthe__syncthreads()call,only

data[0]wouldbeguaranteedtobeseenbythechild).Whenthechildgridreturns,

thread0isguaranteedtoseemodificationsmadebythethreadsinitschildgrid.Those

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide7

modificationsbecomeavailabletotheotherthreadsoftheparentgridonlyafterthe

second__syncthreads()call:

__global__voidchild_launch(intdata){

data[threadIdx.x]=data[threadIdx.x]+1;

}



__global__voidparent_launch(intdata){

data[threadIdx.x]=threadIdx.x;



__syncthreads();



If(threadIdx.x==0){

child_launch<<<1,256>>>(data);

cudaDeviceSynchronize();

}



__syncthreads();

}



voidhost_launch(intdata){

parent_launch<<<1,256>>>(data);

}



Zero-CopyMemory

Zero-copysystemmemoryhasidenticalcoherenceandconsistencyguaranteestoglobal

memory,andfollowsthesemanticsdetailedabove.Akernelmaynotallocateorfree

zero-copymemory,butmayusepointerstozero-copypassedinfromthehostprogram.



ConstantMemory

Constantsareimmutableandmaynotbemodifiedfromthedevice,evenbetween

parentandchildlaunches.Thatistosay,thevalueofall__constant__variablesmustbe

setfromthehostpriortolaunch.Constantmemoryisinheritedautomaticallybyall

childkernelsfromtheirrespectiveparents.

Takingtheaddressofaconstantmemoryobjectfromwithinakernelthreadhasthe

samesemanticsasforallCUDAprograms,andpassingthatpointerfromparenttochild

orfromachildtoparentisnaturallysupported.



SharedandLocalMemory

SharedandLocalmemoryisprivatetoathreadblockorthread,respectively,andisnot

visibleorcoherentbetweenparentandchild.Behaviorisundefinedwhenanobjectin

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide8

oneoftheselocationsisreferencedoutsideofthescopewithinwhichitbelongs,and

maycauseanerror.

TheNVIDIAcompilerwillattempttowarnifitcandetectthatapointertolocalor

sharedmemoryisbeingpassedasanargumenttoakernellaunch.Atruntime,the

programmermayusethe__isGlobal()intrinsictodeterminewhetherapointerreferences

globalmemoryandsomaysafelybepassedtoachildlaunch.

NotethatcallstocudaMemcpyAsync()orcudaMemsetAsync()mayinvokenewchild

kernelsonthedeviceinordertopreservestreamsemantics.Assuch,passingsharedor

localmemorypointerstotheseAPIsisillegalandwillreturnanerror.

LocalMemory

Localmemoryisprivatestorageforanexecutingthread,andisnotvisibleoutsideof

thatthread.Itisillegaltopassapointertolocalmemoryasalaunchargumentwhen

launchingachildkernel.Theresultofdereferencingsuchalocalmemorypointerfroma

childwillbeundefined.

Forexamplethefollowingisillegal,withundefinedbehaviorifx_arrayisaccessedby

child_launch:

intx_array[10];//Createsx_arrayinparent’slocalmemory

child_launch<<<1,1>>>(x_array);

Itissometimesdifficultforaprogrammertobeawareofwhenavariableisplacedinto

localmemorybythecompiler.Asageneralrule,allstoragepassedtoachildkernel

shouldbeallocatedexplicitlyfromtheglobal-memoryheap,eitherwithcudaMalloc(),

new()orbydeclaring__device__storageatglobalscope.

Forexample:

__device__intvalue;

__device__voidx(){

value=5;

child<<<1,1>>>(&value);

}



__device__voidy(){

intvalue=5;

child<<<1,1>>>(&value);

}



Correct–“value”isglobalstorageInvalid–“value”islocalstorage





TextureMemory

Writestotheglobalmemoryregionoverwhichatextureismappedareincoherentwith

respecttotextureaccesses.Coherencefortexturememoryisenforcedattheinvocation

ofachildgridandwhenachildgridcompletes.Thismeansthatwritestomemoryprior

toachildkernellauncharereflectedintexturememoryaccessesofthechild.Similarly,

ExecutionEnvironmentandMemoryModel



CUDADynamicParallelismProgrammingGuide9

writestomemorybyachildwillbereflectedinthetexturememoryaccessesbyaparent,

butonlyaftertheparentsynchronizesonthechild''scompletion.Concurrentaccessesby

parentandchildmayresultininconsistentdata.





CUDADynamicParallelismProgrammingGuide10



PROGRAMMINGINTERFACE

CUDAC/C++REFERENCE

ThissectiondescribeschangesandadditionstotheCUDAC/C++languageextensions

forsupportingDynamicParallelism.

ThelanguageinterfaceandAPIavailabletoCUDAkernelsusingCUDAC/C++for

DynamicParallelism,referredtoastheDeviceRuntime,issubstantiallylikethatofthe

CUDARuntimeAPIavailableonthehost.Wherepossiblethesyntaxandsemanticsof

theCUDARuntimeAPIhavebeenretainedinordertofacilitateeaseofcodereusefor

routinesthatmayrunineitherthehostordeviceenvironments.

AswithallcodeinCUDAC/C++,theAPIsandcodeoutlinedhereisper-threadcode.

Thisenableseachthreadtomakeunique,dynamicdecisionsregardingwhatkernelor

operationtoexecutenext.Therearenosynchronizationrequirementsbetweenthreads

withinablocktoexecuteanyoftheprovideddeviceruntimeAPIs,whichenablesthe

deviceruntimeAPIfunctionstobecalledinarbitrarilydivergentkernelcodewithout

deadlock.



Device-SideKernelLaunch

KernelsmaybelaunchedfromthedeviceusingthestandardCUDA<<<>>>syntax:

kernel_name<<>>([kernelarguments]);

?Dgisoftypedim3andspecifiesthedimensionsandsizeofthegrid

?Dbisoftypedim3andspecifiesthedimensionsandsizeofeachthreadblock



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide11

?Nsisoftypesize_tandspecifiesthenumberofbytesofsharedmemorythatis

dynamicallyallocatedperthreadblockforthiscallandadditiontostaticallyallocated

memory.Nsisanoptionalargumentthatdefaultsto0.

?SisoftypecudaStream_tandspecifiesthestreamassociatedwiththiscall.The

streammusthavebeenallocatedinthesamethreadblockwherethecallisbeing

made.Sisanoptionalargumentthatdefaultsto0.



LaunchesareAsynchronous

Identicaltohost-sidelaunches,alldevice-sidekernellaunchesareasynchronouswith

respecttothelaunchingthread.Thatistosay,the<<<>>>launchcommandwillreturn

immediatelyandthelaunchingthreadwillcontinuetoexecuteuntilithitsanexplicit

launch-synchronizationpointsuchascudaDeviceSynchronize().Thegridlaunchisposted

tothedeviceandwillexecuteindependentlyoftheparentthread.Thechildgridmay

beginexecutionatanytimeafterlaunch,butisnotguaranteedtobeginexecutionuntil

thelaunchingthreadreachesanexplicitlaunch-synchronizationpoint.



LaunchEnvironmentConfiguration

Allglobaldeviceconfigurationsettings(e.g.sharedmemory&L1cachesizeasreturned

fromcudaDeviceGetCacheConfig(),anddevicelimitsreturnedfromcudaDeviceGetLimit())

willbeinheritedfromtheparent.Thatistosayif,whentheparentislaunched,

executionisconfiguredgloballyfor16kofsharedmemoryand48kofL1cache,thenthe

child’sexecutionstatewillbeconfiguredidentically.Likewise,devicelimitssuchas

stacksizewillremainas-configured.

Forhost-launchedkernels,per-kernelconfigurationssetfromthehostwilltake

precedenceovertheglobalsetting.Theseconfigurationswillbeusedwhenthekernelis

launchedfromthedeviceaswell.Itisnotpossibletoreconfigureakernel’senvironment

fromthedevice.



LaunchFrom__host____device__Functions

Althoughthedeviceruntimeenableskernellaunchesfromeitherthehostordevice,

kernellaunchesfrom__host____device__functionsareunsupported.Thecompilerwill

failtocompileifa__host__device__functionisusedtolaunchakernel.





ProgrammingInterface



CUDADynamicParallelismProgrammingGuide12

Streams

Bothnamedandunnamed(NULL)streamsareavailablefromthedeviceruntime.

Namedstreamsmaybeusedbyanythreadwithinathread-block,butstreamhandles

maynotbepassedtootherblocksorchild/parentkernels.Inotherwords,astream

shouldbetreatedasprivatetotheblockinwhichitiscreated.Streamhandlesarenot

guaranteedtobeuniquebetweenblocks,sousingastreamhandlewithinablockthat

didnotallocateitwillresultinundefinedbehavior.

Similartohost-sidelaunch,worklaunchedintoseparatestreamsmayrunconcurrently,

butactualconcurrencyisnotguaranteed.Programsthatdependuponconcurrency

betweenchildkernelsarenotsupportedbytheCUDAprogrammingmodelandwill

haveundefinedbehavior.

Thehost-sideNULLstream''scross-streambarriersemanticisnotsupportedonthe

device(seebelowfordetails).Inordertoretainsemanticcompatibilitywiththehost

runtime,alldevicestreamsmustbecreatedusingthecudaStreamCreateWithFlags()API,

passingthecudaStreamNonBlockingflag.ThecudaStreamCreate()callisahost-runtime-

onlyAPIandwillfailtocompileforthedevice.

AscudaStreamSynchronize()andcudaStreamQuery()areunsupportedbythedevice

runtime,cudaDeviceSynchronize()shouldbeusedinsteadwhentheapplicationneedsto

knowthatstream-launchedchildkernelshavecompleted.



TheImplicit(NULL)Stream

Withinahostprogram,theunnamed(NULL)streamhasadditionalbarrier

synchronizationsemanticswithotherstreams(seetheCUDAProgrammingGuidefor

details).Thedeviceruntimeoffersasingleimplicit,unnamedstreamsharedbetweenall

threadsinablock,butasallnamedstreamsmustbecreatedwiththe

cudaStreamNonBlockingflag,worklaunchedintotheNULLstreamwillnotinsertan

implicitdependencyonpendingworkinanyotherstreams.



Events

Onlytheinter-streamsynchronizationcapabilitiesofCUDAeventsaresupported.This

meansthatcudaStreamWaitEvent()issupported,butcudaEventSynchronize()

cudaEventElapsedTime(),andcudaEventQuery()arenot.AscudaEventElapsedTime()isnot

supported,cudaEventsmustbecreatedviacudaEventCreateWithFlags(),passingthe

cudaEventDisableTimingflag.



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide13

Asforalldeviceruntimeobjects,eventobjectsmaybesharedbetweenallthreadswithin

thethread-blockwhichcreatedthembutarelocaltothatblockandmaynotbepassedto

otherkernels,orbetweenblockswithinthesamekernel.Eventhandlesarenot

guaranteedtobeuniquebetweenblocks,sousinganeventhandlewithinablockthat

didnotcreateitwillresultinundefinedbehavior.



Synchronization

ThecudaDeviceSynchronize()functionwillsynchronizeonallworklaunchedbyany

threadinthethread-blockuptothepointwherecudaDeviceSynchronize()wascalled.

NotethatcudaDeviceSynchronize()maybecalledfromwithindivergentcode(seebelow).

Itisuptotheprogramtoperformsufficientadditionalinter-threadsynchronization,for

exampleviaacallto__syncthreads(),ifthecallingthreadisintendedtosynchronizewith

childgridsinvokedfromotherthreads.



Block-WideSynchronization

ThecudaDeviceSynchronize()functiondoesnotimplyintra-blocksynchronization.In

particular,withoutexplicitsynchronizationviaa__syncthreads()directivethecalling

threadcanmakenoassumptionsaboutwhatworkhasbeenlaunchedbyanythread

otherthanitself.Forexampleifmultiplethreadswithinablockareeachlaunchingwork

andsynchronizationisdesiredforallthisworkatonce(perhapsbecauseofevent-based

dependencies),itisuptotheprogramtoguaranteethatthisworkissubmittedbyall

threadsbeforecallingcudaDeviceSynchronize().

Becausetheimplementationispermittedtosynchronizeonlaunchesfromanythreadin

theblock,itisquitepossiblethatsimultaneouscallstocudaDeviceSynchronize()by

multiplethreadswilldrainallworkinthefirstcallandthenhavenoeffectforthelater

calls.



DeviceManagement

Onlythedeviceonwhichakernelisrunningwillbecontrollablefromthatkernel.This

meansthatdeviceAPIssuchascudaSetDevice()arenotsupportedbythedeviceruntime.

TheactivedeviceasseenfromtheGPU(returnedfromcudaGetDevice())willhavethe

samedevicenumberasseenfromthehostsystem.ThecudaGetDeviceProperty()callmay

requestinformationaboutanotherdeviceasthisAPIallowsspecificationofadeviceID



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide14

asaparameterofthecall.Notethatthecatch-allcudaGetDeviceProperties()APIisnot

offeredbythedeviceruntime–propertiesmustbequeriedindividually.



MemoryDeclarations

DeviceandConstantMemory

Memorydeclaredatfilescopewith__device__or__constant__qualifiersbehave

identicallywhenusingthedeviceruntime.Allkernelsmayreadorwrite__device__

variables,whetherthekernelwasinitiallylaunchedbythehostordeviceruntime.

Equivalently,allkernelswillhavethesameviewof__constant__sasdeclaredatthe

modulescope.



Textures&Surfaces

CUDAsupportsdynamicallycreatedtextureandsurfaceobjects1,whereatexture

referencemaybecreatedonthehost,passedtoakernel,usedbythatkernel,andthen

destroyedfromthehost.Thedeviceruntimedoesnotallowcreationordestructionof

textureorsurfaceobjectsfromwithindevicecode,buttextureandsurfaceobjects

createdfromthehostmaybeusedandpassedaroundfreelyonthedevice.Regardless

ofwheretheyarecreated,dynamicallycreatedtextureobjectsarealwaysvalidandmay

bepassedtochildkernelsfromaparent.

NOTE:Thedeviceruntimedoesnotsupportlegacymodule-scope(i.e.Fermi-style)

texturesandsurfaceswithinakernellaunchedfromthedevice.Module-scope(legacy)

texturesmaybecreatedfromthehostandusedindevicecodeasforanykernel,but

mayonlybeusedbyatop-levelkernel(i.e.theonewhichislaunchedfromthehost).



SharedMemoryVariableDeclarations

InCUDAC/C++sharedmemorycanbedeclaredeitherasastaticallysizedfile-scopeor

function-scopedvariable,orasanexternvariablewiththesizedeterminedatruntimeby

thekernel’scallerviaalaunchconfigurationargument.Bothtypesofdeclarationsare

validunderthedeviceruntime.



1DynamicallycreatedtextureandsurfaceobjectsareanadditiontotheCUDAmemorymodel

introducedwithCUDA5.0.PleaseseetheCUDAProgrammingGuidefordetails.



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide15

__global__voidpermute(intn,intdata){

extern__shared__intsmem[];

if(n<=1)

return;



smem[threadIdx.x]=data[threadIdx.x];

__syncthreads();



permute_data(smem,n);

__syncthreads();



//WritebacktoGMEMsincewecan’tpassSMEMtochildren.

data[threadIdx.x]=smem[threadIdx.x];

__syncthreads();



if(threadIdx.x==0){

permute<<<1,256,n/2sizeof(int)>>>(n/2,data);

permute<<<1,256,n/2sizeof(int)>>>(n/2,data+n/2);

}

}



voidhost_launch(intdata){

permute<<<1,256,256sizeof(int)>>>(256,data);

}



SymbolAddresses

Device-sidesymbols(i.e.thosemarked__device__)maybereferencedfromwithina

kernelsimplyviathe?&?operator,asallglobal-scopedevicevariablesareinthe

kernel’svisibleaddressspace.Thisalsoappliesto__constant__symbols,althoughin

thiscasethepointerwillreferenceread-onlydata.

Giventhatdevice-sidesymbolscanbereferenceddirectly,thoseCUDAruntimeAPIs

whichreferencesymbols(e.g.cudaMemcpyToSymbol()orcudaGetSymbolAddress())are

redundantandhencenotsupportedbythedeviceruntime.Notethisimpliesthat

__constant__datacannotbealteredfromwithinarunningkernel,evenaheadofachild

kernellaunch,asreferencesto__constant__spaceareread-only.



APIErrorsandLaunchFailures

AsusualfortheCUDAruntime,anyfunctionmayreturnanerrorcode.Thelasterror

codereturnedisrecordedandmayberetrievedviathecudaGetLastError()call.Errorsare

recordedper-thread,sothateachthreadcanidentifythemostrecenterrorthatithas

generated.TheerrorcodeisoftypecudaError_t.

Similartoahost-sidelaunch,device-sidelaunchesmayfailformanyreasons(invalid

arguments,etc).TheusermustcallcudaGetLastError()todetermineifalaunchgenerated



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide16

anerror,howeverlackofanerrorafterlaunchdoesnotimplythechildkernel

completedsuccessfully.

Fordevice-sideexceptions,e.g.,accesstoaninvalidaddress,anerrorinachildgridwill

bereturnedtothehostinsteadofbeingreturnedbytheparent’scallto

cudaDeviceSynchronize().



LaunchSetupAPIs



Kernellaunchisasystem-levelmechanismexposedthroughthedeviceruntimelibrary,

andassuchisavailabledirectlyfromPTXviatheunderlyingcudaGetParameterBuffer()

andcudaLaunchDevice()APIs.ItispermittedforaCUDAapplicationtocalltheseAPIs

itself,withthesamerequirementsasforPTX.Inbothcases,theuseristhenresponsible

forcorrectlypopulatingallnecessarydatastructuresinthecorrectformataccordingto

specification.Backwardscompatibilityisguaranteedinthesedatastructures.

Aswithhost-sidelaunch,thedevice-sideoperator<<<>>>mapstounderlyingkernel

launchAPIs.ThisissothatuserstargetingPTXwillbeabletoenactalaunch,andso

thatthecompilerfront-endcantranslate<<<>>>intothesecalls.



RuntimeAPILaunch

Functions

DescriptionofDifferenceFromHostRuntimeBehaviour

(behaviourisidenticalifnodescription)

cudaGetParameterBufferGeneratedautomaticallyfrom<<<>>>.NotedifferentAPIto

hostequivalent.

cudaLaunchDeviceGeneratedautomaticallyfrom<<<>>>.NotedifferentAPIto

hostequivalent.

NewDevice-onlylaunchimplementationfunctions



TheAPIsfortheselaunchfunctionsaredifferenttothoseoftheCUDARuntimeAPI,

andaredefinedasfollows:

extern__device__cudaError_tcudaGetParameterBuffer(voidparams);

extern__device__cudaError_tcudaLaunchDevice(voidkernel,

voidparams,dim3gridDim,

dim3blockDim,

unsignedintsharedMemSize=0,

cudaStream_tstream=0);





ProgrammingInterface



CUDADynamicParallelismProgrammingGuide17

APIReference

TheportionsoftheCUDARuntimeAPIsupportedinthedeviceruntimearedetailed

here.HostanddeviceruntimeAPIshaveidenticalsyntax;semanticsarethesameexcept

whereindicated.ThetablebelowprovidesanoverviewoftheAPIrelativetotheversion

availablefromthehost.



RuntimeAPIFunctionsDetails

cudaDeviceSynchronizeSynchronizesonworklaunchedfromthread’sownblock

only

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaGetLastErrorLasterrorisper-threadstate,notper-blockstate

cudaPeekAtLastError

cudaGetErrorString

cudaGetDeviceCount

cudaGetDevicePropertyWillreturnpropertiesforanydevice

cudaGetDeviceAlwaysreturnscurrentdeviceIDaswouldbeseenfrom

host

cudaStreamCreateWithFlagsMustpasscudaStreamNonBlockingflag

cudaStreamDestroy

cudaStreamWaitEvent

cudaEventCreateWithFlagsMustpasscudaEventDisableTimingflag

cudaEventRecord

cudaEventDestroy

cudaFuncGetAttributes

cudaMemcpyAsyncNotesaboutallmemcpy/memsetfunctions:

cudaMemcpy2DAsyncOnlyasyncmemcpy/setfunctionsaresupported

cudaMemcpy3DAsyncOnlydevice-to-devicememcpyispermitted

cudaMemsetAsyncMaynotpassinlocalorsharedmemorypointers

cudaMemset2DAsync

cudaMemset3DAsync

cudaRuntimeGetVersion

cudaMallocMaynotcallcudaFreeonthedeviceonapointercreated

cudaFreeonthehost,andvice-versa

SupportedAPIfunctions





ProgrammingInterface



CUDADynamicParallelismProgrammingGuide18

DEVICE-SIDELAUNCHFROMPTX

Thissectionisfortheprogramminglanguageandcompilerimplementerswhotarget

ParallelThreadExecution(PTX)andplantosupportDynamicParallelismintheir

language.Itprovidesthelow-leveldetailsrelatedtosupportingkernellaunchesatthe

PTXlevel.



KernelLaunchAPIs

Device-sidekernellaunchescanbeimplementedusingthefollowingtwoAPIs

accessiblefromPTX:cudaLaunchDevice()andcudaGetParameterBuffer().

cudaLaunchDevice()launchesthespecifiedkernelwiththeparameterbufferthatis

obtainedbycallingcudaGetParameterBuffer()andfilledwiththeparameterstothe

launchedkernel.TheparameterbuffercanbeNULL,i.e.,noneedtoinvoke

cudaGetParameterBuffer(),ifthelaunchedkerneldoesnottakeanyparameters.



cudaLaunchDevice

AtthePTXlevel,cudaLaunchDevice()needstobedeclaredinoneofthetwoformsshown

belowbeforeitisused.

//When.address_sizeis64

.extern.func(.param.b32func_retval0)cudaLaunchDevice

(

.param.b64func,

.param.b64parameterBuffer,

.param.align4.b8gridDimension[12],

.param.align4.b8blockDimension[12],

.param.b32sharedMemSize,

.param.b64stream

)

;

PTX-leveldeclarationofcudaLaunchDevice()when.address_sizeis64



//When.address_sizeis32

.extern.func(.param.b32func_retval0)cudaLaunchDevice

(

.param.b32func,

.param.b32parameterBuffer,

.param.align4.b8gridDimension[12],

.param.align4.b8blockDimension[12],

.param.b32sharedMemSize,



ProgrammingInterface



CUDADynamicParallelismProgrammingGuide19

.param.b32stream

)

;

PTX-leveldeclarationofcudaLaunchDevice()when.address_sizeis32



TheCUDA-leveldeclarationbelowismappedtooneoftheaforementionedPTX-level

declarationsandisfoundinthesystemheaderfilecuda_device_runtime_api.h.The

functionisdefinedinthecudadevrtsystemlibrary,whichmustbelinkedwithaprogram

inordertousedevice-sidekernellaunchfunctionality.

extern“C”__device__

cudaError_tcudaLaunchDevice(voidfunc,voidparameterBuffer,

dim3gridDimension,dim3blockDimension,

unsignedintsharedMemSize,

cudaStream_tstream);

CUDA-leveldeclarationofcudaLaunchDevice()

Thefirstparameterisapointertothekerneltobeislaunched,andthesecondparameter

istheparameterbufferthatholdstheactualparameterstothelaunchedkernel.The

layoutoftheparameterbufferisexplainedin?ParameterBufferLayout?,below.Other

parametersspecifythelaunchconfiguration,i.e.,asgriddimension,blockdimension,

sharedmemorysize,andthestreamassociatedwiththelaunch(pleaserefertothe

CUDAProgrammingGuideforthedetaileddescriptionoflaunchconfiguration,andof

cudaLaunchDevice()specifically).



cudaGetParameterBuffer

cudaGetParameterBuffer()needstobedeclaredatthePTXlevelbeforeit’sused.ThePTX-

leveldeclarationmustbeinoneofthetwoformsgivenbelow,dependingonaddress

size:

//When.address_sizeis64

.extern.func(.param.b64func_retval0)cudaGetParameterBuffer

(

.param.b64alignment,

.param.b64size

)

;

PTX-leveldeclarationofcudaGetParameterBuffer()when.address_sizeis64







ProgrammingInterface



CUDADynamicParallelismProgrammingGuide20

.extern.func(.param.b32func_retval0)cudaGetParameterBuffer

(

.param.b32alignment,

.param.b32size

)

;

PTX-leveldeclarationofcudaGetParameterBuffer()when.address_sizeis32

ThefollowingCUDA-leveldeclarationofcudaGetParameterBuffer()ismappedtothe

aforementionedPTX-leveldeclaration:



extern“C”__device__

voidcudaGetParameterBuffer(size_talignment,size_tsize);

CUDA-leveldeclarationofcudaGetParameterBuffer()

Thefirstparameterspecifiesthealignmentrequirementoftheparameterbufferandthe

secondparameterthesizerequirementinbytes.Inthecurrentimplementation,the

parameterbufferreturnedbycudaGetParameterBuffer()isalwaysguaranteedtobe64-

bytealigned,andthealignmentrequirementparameterisignored.However,itis

recommendedtopassthecorrectalignmentrequirementvalue–whichisthelargest

alignmentofanyparametertobeplacedintheparameterbuffer–to

cudaGetParameterBuffer()toensureportabilityinthefuture.



ParameterBufferLayout

Parameterreorderingintheparameterbufferisprohibited,andeachindividual

parameterplacedintheparameterbufferisrequiredtobealigned.Thatis,each

parametermustbeplacedatthenthbyteintheparameterbuffer,wherenisthesmallest

multipleoftheparametersizethatisgreaterthantheoffsetofthelastbytetakenbythe

precedingparameter.Themaximumsizeoftheparameterbufferis4KB.

ForamoredetaileddescriptionofPTXcodegeneratedbytheCUDAcompiler,please

refertothePTX-3.5specification.





ProgrammingInterface



CUDADynamicParallelismProgrammingGuide21

TOOLKITSUPPORTFORDYNAMICPARALLELISM

IncludingdeviceruntimeAPIinCUDAcode

Similartothehost-sideruntimeAPI,prototypesfortheCUDAdeviceruntimeAPIare

includedautomaticallyduringprogramcompilation.Thereisnoneedtoinclude

cuda_device_runtime_api.hexplicitly.



CompilingandLinking

CUDAprogramsareautomaticallylinkedwiththehostruntimelibrarywhencompiled

withnvcc,butthedeviceruntimeisshippedasastaticlibrarywhichmustexplicitlybe

linkedwithaprogramwhichwishestouseit.

Thedeviceruntimeisofferedasastaticlibrary(cudadevrt.libonWindows,libcudadevrt.a

underLinuxandMacOS),againstwhichaGPUapplicationthatusesthedeviceruntime

mustbelinked.Linkingofdevicelibrariescanbeaccomplishedthroughnvccand/or

nvlink.Twosimpleexamplesareshownbelow.

Adeviceruntimeprogrammaybecompiledandlinkedinasinglestep,ifallrequired

sourcefilescanbespecifiedfromthecommandline:

$nvcc-arch=sm_35-rdc=truehello_world.cu-ohello-lcudadevrt

ItisalsopossibletocompileCUDA.cusourcefilesfirsttoobjectfiles,andthenlink

thesetogetherinatwo-stageprocess:

$nvcc-arch=sm_35-dchello_world.cu-ohello_world.o

$nvcc-arch=sm_35-rdc=truehello_world.o-ohello-lcudadevrt

Pleaseseethe?UsingSeparateCompilation?sectionof?TheCUDADriverCompiler

NVCC?guideformoredetails.





CUDADynamicParallelismProgrammingGuide22



PROGRAMMINGGUIDELINES

BASICS

Thedeviceruntimeisafunctionalsubsetofthehostruntime.APIleveldevice

management,kernellaunching,devicememcpy,streammanagement,andevent

managementareexposedfromthedeviceruntime.

Programmingforthedeviceruntimeshouldbefamiliartosomeonewhoalreadyhas

experiencewithCUDA.Deviceruntimesyntaxandsemanticsarelargelythesameas

thatofthehostAPI,withanyexceptionsdetailedearlierinthisdocument.



Programmingguidelines



CUDADynamicParallelismProgrammingGuide23



Thefollowingexampleshowsasimple?HelloWorld?programincorporatingdynamic

parallelism:

#include



__global__voidchildKernel()

{

printf("Hello");

}



__global__voidparentKernel()

{

//launchchild

childKernel<<<1,1>>>();

if(cudaSuccess!=cudaGetLastError()){

return;

}



//waitforchildtocomplete

if(cudaSuccess!=cudaDeviceSynchronize()){

return;

}



printf("World!\n");

}



intmain(intargc,charargv[])

{

//launchparent

parentKernel<<<1,1>>>();

if(cudaSuccess!=cudaGetLastError()){

return1;

}



//waitforparenttocomplete

if(cudaSuccess!=cudaDeviceSynchronize()){

return2;

}



return0;

}



Thisprogrammaybebuiltinasinglestepfromthecommandlineasfollows:

$nvcc-arch=sm_35-rdc=truehello_world.cu-ohello-lcudadevrt



Programmingguidelines



CUDADynamicParallelismProgrammingGuide24

PERFORMANCE

Synchronization

Synchronizationbyonethreadmayimpacttheperformanceofotherthreadsinthesame

ThreadBlock,evenwhenthoseotherthreadsdonotcallcudaDeviceSynchronize()

themselves.Thisimpactwilldependupontheunderlyingimplementation.



Dynamic-parallelism-enabledKerneloverhead

Systemsoftwarewhichisactivewhencontrollingdynamiclaunchesmayimposean

overheadonanykernelwhichisrunningatthetime,whetherornotitinvokeskernel

launchesofitsown.Thisoverheadarisesfromthedeviceruntime’sexecutiontracking

andmanagementsoftwareandmayresultindecreasedperformancefore.g.librarycalls

whenmadefromthedevicecomparedtofromthehostside.Thisoverheadis,in

general,incurredforapplicationsthatlinkagainstthedeviceruntimelibrary.



IMPLEMENTATIONRESTRICTIONS&LIMITATIONS

DynamicParallelismguaranteesallsemanticsdescribedinthisdocument,however,

certainhardwareandsoftwareresourcesareimplementation-dependentandlimitthe

scale,performanceandotherpropertiesofaprogramwhichusesthedeviceruntime.

Runtime

MemoryFootprint

Thedeviceruntimesystemsoftwarereservesmemoryforvariousmanagement

purposes,inparticularonereservationwhichisusedforsavingparent-gridstateduring

synchronization,andasecondreservationfortrackingpendinggridlaunches.

Configurationcontrolsareavailabletoreducethesizeofthesereservationsinexchange

forcertainlaunchlimitations.SeeConfigurationOptions,below,fordetails.

Themajorityofreservedmemoryisallocatedasbacking-storeforparentkernelstate,for

usewhensynchronizingonachildlaunch.Conservatively,thismemorymustsupport

storingofstateforthemaximumnumberoflivethreadspossibleonthedevice.This

meansthateachparentgenerationatwhichcudaDeviceSynchronize()iscallablemay

Programmingguidelines



CUDADynamicParallelismProgrammingGuide25

requireupto150MBofdevicememory,dependingonthedeviceconfiguration,which

willbeunavailableforprogramuseevenifitisnotallconsumed.



Nesting&SynchronizationDepth

Usingthedeviceruntime,onekernelmaylaunchanotherkernel,andthatkernelmay

launchanother,andsoon.Eachsubordinatelaunchisconsideredanew?nestinglevel?,

andthetotalnumberoflevelsisthe?nestingdepth?oftheprogram.The

?synchronizationdepth?isdefinedasthedeepestlevelatwhichtheprogramwill

explicitlysynchronizeonachildlaunch.Typicallythisisonelessthanthenestingdepth

oftheprogram,butiftheprogramdoesnotneedtocallcudaDeviceSynchronize()atall

levelsthenthesynchronizationdepthmightbesubstantiallydifferenttothenesting

depth.

Theoverallmaximumnestingdepthislimitedto24,butpracticallyspeakingthereal

limitwillbetheamountofmemoryrequiredbythesystemforeachnewlevel(see

MemoryFootprintabove).Anylaunchwhichwouldresultinakernelatadeeperlevel

thanthemaximumwillfail.NotethatthismayalsoapplytocudaMemcpyAsync(),which

mightitselfgenerateakernellaunch.SeeConfigurationOptions,below,fordetails.

Bydefault,sufficientstorageisreservedfortwolevelsofsynchronization.This

maximumsynchronizationdepth(andhencereservedstorage)maybecontrolledby

callingcudaDeviceSetLimit()andspecifyingcudaLimitDevRuntimeSyncDepth.Thenumber

oflevelstobesupportedmustbeconfiguredbeforethetop-levelkernelislaunched

fromthehost,inordertoguaranteesuccessfulexecutionofanestedprogram.Calling

cudaDeviceSynchronize()atadepthgreaterthanthespecifiedmaximumsynchronization

depthwillreturnanerror.

Anoptimizationispermittedwherethesystemdetectsthatitneednotreservespacefor

theparent’sstateincaseswheretheparentkernelnevercallscudaDeviceSynchronize().In

thiscase,becauseexplicitparent/childsynchronizationneveroccurs,thememory

footprintrequiredforaprogramwillbemuchlessthantheconservativemaximum.

Suchaprogramcouldspecifyashallowermaximumsynchronizationdepthtoavoid

over-allocationofbackingstore.



PendingKernelLaunches

Whenakernelislaunched,allassociatedconfigurationandparameterdataistracked

untilthekernelcompletes.Thisdataisstoredwithinasystem-managedlaunchpool.

ThesizeofthelaunchpoolisconfigurablebycallingcudaDeviceSetLimit()fromthehost

andspecifyingcudaLimitDevRuntimePendingLaunchCount.



Programmingguidelines



CUDADynamicParallelismProgrammingGuide26

ConfigurationOptions

Resourceallocationforthedeviceruntimesystemsoftwareiscontrolledviathe

cudaDeviceSetLimit()APIfromthehostprogram.Limitsmustbesetbeforeanykernelis

launched,andmaynotbechangedwhiletheGPUisactivelyrunningprograms.

Thefollowingnamedlimitsmaybeset:



LimitBehaviour

cudaLimitDevRuntimeSyncDepthSetsthemaximumdepthatwhich

cudaDeviceSynchronize()maybecalled.

Launchesmaybeperformeddeeperthanthis,

butexplicitsynchronizationdeeperthanthis

limitwillreturnthe

cudaErrorLaunchMaxDepthExceeded.The

defaultmaximumsyncdepthis2.

cudaLimitDevRuntimePendingLaunchCountControlstheamountofmemorysetasidefor

bufferingkernellauncheswhichhavenotyet

beguntoexecute,dueeithertounresolved

dependenciesorlackofexecutionresources.

Whenthebufferisfull,launcheswillsetthe

thread’slasterrorto

cudaErrorLaunchPendingCountExceeded.The

defaultpendinglaunchcountis2048launches.



MemoryAllocationandLifetime

cudaMalloc()andcudaFree()havedistinctsemanticsbetweenthehostanddevice

environments.Wheninvokedfromthehost,cudaMalloc()allocatesanewregionfrom

unuseddevicememory.Wheninvokedfromthedeviceruntimethesefunctionsmapto

device-sidemalloc()andfree().Thisimpliesthatwithinthedeviceenvironmentthetotal

allocatablememoryislimitedtothedevicemalloc()heapsize,whichmaybesmaller

thantheavailableunuseddevicememory.Also,itisanerrortoinvokecudaFree()from

thehostprogramonapointerwhichwasallocatedbycudaMalloc()onthedeviceorvice-

versa.

cudaMalloc()onHostcudaMalloc()onDevice

cudaFree()onHostSupportedNotSupported

cudaFree()onDeviceNotSupportedSupported

AllocationlimitFreedevicememorycudaLimitMallocHeapSize



Programmingguidelines



CUDADynamicParallelismProgrammingGuide27

SMIdandWarpId

NotethatinPTX%smidand%warpidaredefinedasvolatilevalues.Thedeviceruntime

mayreschedulethreadblocksontodifferentSMsinordertomoreefficientlymanage

resources.Assuch,itisunsafetorelyupon%smidor%warpidremainingunchanged

acrossthelifetimeofathreadorthreadblock.



ECCErrors

NonotificationofECCerrorsisavailabletocodewithinaCUDAkernel.ECCerrorsare

reportedatthehostsideoncetheentirelaunchtreehascompleted.AnyECCerrors

whichariseduringexecutionofanestedprogramwilleithergenerateanexceptionor

continueexecution(dependinguponerrorandconfiguration).



Programmingguidelines



CUDADynamicParallelismProgrammingGuide28



Notice

ALLNVIDIADESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER

DOCUMENTS(TOGETHERANDSEPARATELY,“MATERIALS”)AREBEINGPROVIDED“ASIS.”NVIDIAMAKESNO

WARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOTHEMATERIALS,AND

EXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFOR

APARTICULARPURPOSE.

Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno

responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsorother

rightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationofotherwiseunder

anypatentrightsofNVIDIACorporation.Specificationsmentionedinthispublicationaresubjecttochange

withoutnotice.Thispublicationsupersedesandreplacesallotherinformationpreviouslysupplied.NVIDIA

Corporationproductsarenotauthorizedascriticalcomponentsinlifesupportdevicesorsystemswithout

expresswrittenapprovalofNVIDIACorporation.

Trademarks

NVIDIA,theNVIDIAlogo,andaretrademarks

and/orregisteredtrademarksofNVIDIACorporationintheU.S.andothercountries.Othercompanyand

productnamesmaybetrademarksoftherespectivecompanieswithwhichtheyareassociated.

Copyright

?2012NVIDIACorporation.Allrightsreserved.

献花(0)
+1
(本文系yangshiquan...首藏)