Java锁优化
1.同步的原理
JVM规范规定JVM基于进入和退出Monitor对象来实现方法同步和代码块同步,但两者的实现细节不一样。代码块同步是使用monitorenter和monitorexit指令实现,而方法同步是使用另外一种方式实现的,细节在JVM规范里并没有详细说明,但是方法的同步同样可以使用这两个指令来实现。monitorenter指令是在编译后插入到同步代码块的开始位置,而monitorexit是插入到方法结束处和异常处,JVM要保证每个monitorenter必须有对应的monitorexit与之配对。任何对象都有一个monitor与之关联,当且一个monitor被持有后,它将处于锁定状态。线程执行到monitorenter指令时,将会尝试获取对象所对应的monitor的所有权,即尝试获得对象的锁。
2.Java对象头
锁存在Java对象头里。如果对象是数组类型,则虚拟机用3个Word(字宽)存储对象头,如果对象是非数组类型,则用2字宽存储对象头。在32位虚拟机中,一字宽等于四字节,即32bit。
长度 内容 说明
32/64bit MarkWord 存储对象的hashCode或锁信息等
32/64bit ClassMetadataAddress 存储到对象类型数据的指针
32/64bit Arraylength 数组的长度(如果当前对象是数组)
Java对象头里的MarkWord里默认存储对象的HashCode,分代年龄和锁标记位。32位JVM的MarkWord的默认存储结构如下:
25bit 4bit
1bit
是否是偏向锁
2bit
锁标志位
无锁状态 对象的hashCode 对象分代年龄 0 01
在运行期间MarkWord里存储的数据会随着锁标志位的变化而变化。MarkWord可能变化为存储以下4种数据:
3.几种锁的类型
线程的阻塞和唤醒需要CPU从用户态转为核心态,频繁的阻塞和唤醒对CPU来说是一件负担很重的工作。
JavaSE1.6为了减少获得锁和释放锁所带来的性能消耗,引入了“偏向锁”和“轻量级锁”,所以在JavaSE1.6里锁一共有四种状态,无锁状态,偏向锁状态,轻量级锁状态和重量级锁状态,它会随着竞争情况逐渐升级。
锁可以升级但不能降级,意味着偏向锁升级成轻量级锁后不能降级成偏向锁。这种锁升级却不能降级的策略,目的是为了提高获得锁和释放锁的效率。
3.1偏向锁
Hotspot的作者经过以往的研究发现大多数情况下锁不仅不存在多线程竞争,而且总是由同一线程多次获得。偏向锁的目的是在某个线程获得锁之后,消除这个线程锁重入(CAS)的开销,看起来让这个线程得到了偏护。
偏向锁的进一步理解
偏向锁的释放不需要做任何事情,这也就意味着加过偏向锁的MarkValue会一直保留偏向锁的状态,因此即便同一个线程持续不断地加锁解锁,也是没有开销的。
另一方面,偏向锁比轻量锁更容易被终结,轻量锁是在有锁竞争出现时升级为重量锁,而一般偏向锁是在有不同线程申请锁时升级为轻量锁,这也就意味着假如一个对象先被线程1加锁解锁,再被线程2加锁解锁,这过程中没有锁冲突,也一样会发生偏向锁失效,不同的是这回要先退化为无锁的状态,再加轻量锁,如图:
另外,JVM对那种会有多线程加锁,但不存在锁竞争的情况也做了优化,听起来比较拗口,但在现实应用中确实是可能出现这种情况,因为线程之前除了互斥之外也可能发生同步关系,被同步的两个线程(一前一后)对共享对象锁的竞争很可能是没有冲突的。对这种情况,JVM用一个epoch表示一个偏向锁的时间戳(真实地生成一个时间戳代价还是蛮大的,因此这里应当理解为一种类似时间戳的identifier),对epoch,官方是这么解释的:
Asimilarmechanism,calledbulkrebiasing,optimizessituationsinwhichobjectsofaclassarelockedandunlockedbydifferentthreadsbutneverconcurrently.Itinvalidatesthebiasofallinstancesofaclasswithoutdisablingbiasedlocking.Anepochvalueintheclassactsasatimestampthatindicatesthevalidityofthebias.Thisvalueiscopiedintotheheaderworduponobjectallocation.Bulkrebiasingcanthenefficientlybeimplementedasanincrementoftheepochintheappropriateclass.Thenexttimeaninstanceofthisclassisgoingtobelocked,thecodedetectsadifferentvalueintheheaderwordandrebiasestheobjecttowardsthecurrentthread.
偏向锁的获取
当一个线程访问同步块并获取锁时,会在对象头和栈帧中的锁记录里存储锁偏向的线程ID,以后该线程在进入和退出同步块时不需要花费CAS操作来加锁和解锁,而只需简单的测试一下对象头的MarkWord里是否存储着指向当前线程的偏向锁,如果测试成功,表示线程已经获得了锁,如果测试失败,则需要再测试下MarkWord中偏向锁的标识是否设置成1(表示当前是偏向锁),如果没有设置,则使用CAS竞争锁,如果设置了,则尝试使用CAS将对象头的偏向锁指向当前线程。
偏向锁的撤销
偏向锁使用了一种等到竞争出现才释放锁的机制,所以当其他线程尝试竞争偏向锁时,持有偏向锁的线程才会释放锁。偏向锁的撤销,需要等待全局安全点(在这个时间点上没有字节码正在执行),它会首先暂停拥有偏向锁的线程,然后检查持有偏向锁的线程是否活着,如果线程不处于活动状态,则将对象头设置成无锁状态,如果线程仍然活着,拥有偏向锁的栈会被执行,遍历偏向对象的锁记录,栈中的锁记录和对象头的MarkWord要么重新偏向于其他线程,要么恢复到无锁或者标记对象不适合作为偏向锁,最后唤醒暂停的线程。下图中的线程1演示了偏向锁初始化的流程,线程2演示了偏向锁撤销的流程。
偏向锁的设置
关闭偏向锁:偏向锁在Java6和Java7里是默认启用的,但是它在应用程序启动几秒钟之后才激活,如有必要可以使用JVM参数来关闭延迟-XX:BiasedLockingStartupDelay=0。如果你确定自己应用程序里所有的锁通常情况下处于竞争状态,可以通过JVM参数关闭偏向锁-XX:-UseBiasedLocking=false,那么默认会进入轻量级锁状态。
3.2自旋锁
线程的阻塞和唤醒需要CPU从用户态转为核心态,频繁的阻塞和唤醒对CPU来说是一件负担很重的工作。同时我们可以发现,很多对象锁的锁定状态只会持续很短的一段时间,例如整数的自加操作,在很短的时间内阻塞并唤醒线程显然不值得,为此引入了自旋锁。
所谓“自旋”,就是让线程去执行一个无意义的循环,循环结束后再去重新竞争锁,如果竞争不到继续循环,循环过程中线程会一直处于running状态,但是基于JVM的线程调度,会出让时间片,所以其他线程依旧有申请锁和释放锁的机会。
自旋锁省去了阻塞锁的时间空间(队列的维护等)开销,但是长时间自旋就变成了“忙式等待”,忙式等待显然还不如阻塞锁。所以自旋的次数一般控制在一个范围内,例如10,100等,在超出这个范围后,自旋锁会升级为阻塞锁。
对自旋锁周期的选择上,HotSpot认为最佳时间应是一个线程上下文切换的时间,但目前并没有做到。经过调查,目前只是通过汇编暂停了几个CPU周期,除了自旋周期选择,HotSpot还进行许多其他的自旋优化策略,具体如下:
如果平均负载小于CPUs则一直自旋
如果有超过(CPUs/2)个线程正在自旋,则后来线程直接阻塞
如果正在自旋的线程发现Owner发生了变化则延迟自旋时间(自旋计数)或进入阻塞如果CPU处于节电模式则停止自旋
自旋时间的最坏情况是CPU的存储延迟(CPUA存储了一个数据,到CPUB得知这个数据直接的时间差)
3.3轻量级锁
轻量级锁加锁
线程在执行同步块之前,JVM会先在当前线程的栈桢中创建用于存储锁记录的空间,并将对象头中的MarkWord复制到锁记录中,官方称为DisplacedMarkWord。然后线程尝试使用CAS将对象头中的MarkWord替换为指向锁记录的指针。如果成功,当前线程获得锁,如果失败,则自旋获取锁,当自旋获取锁仍然失败时,表示存在其他线程竞争锁(两条或两条以上的线程竞争同一个锁),则轻量级锁会膨胀成重量级锁。
轻量级锁解锁
轻量级解锁时,会使用原子的CAS操作来将DisplacedMarkWord替换回到对象头,如果成功,则表示同步过程已完成。如果失败,表示有其他线程尝试过获取该锁,则要在释放锁的同时唤醒被挂起的线程。下图是两个线程同时争夺锁,导致锁膨胀的流程图。
3.4重量级锁
重量锁在JVM中又叫对象监视器(Monitor),它很像C中的Mutex,除了具备Mutex互斥的功能,它还负责实现了Semaphore的功能,也就是说它至少包含一个竞争锁的队列,和一个信号阻塞队列(wait队列),前者负责做互斥,后一个用于做线程同步。
4.锁的优缺点对比
锁 优点 缺点 适用场景
偏向锁 加锁和解锁不需要额外的消耗,和执行非同步方法比仅存在纳秒级的差距 如果线程间存在锁竞争,会带来额外的锁撤销的消耗 适用于只有一个线程访问同步块场景
轻量级锁 竞争的线程不会阻塞,提高了程序的响应速度 如果始终得不到锁竞争的线程使用自旋会消耗CPU 追求响应时间,锁占用时间很短
重量级锁 线程竞争不使用自旋,不会消耗CPU 线程阻塞,响应时间缓慢 追求吞吐量,锁占用时间较长
BiasedLockinginHotSpot
ByDaveonAug18,2006
ThebiasedlockingschemeinHotSpotarosefromthefollowingpaperauthoredbymyself,MarkMoir,andBillScherer.Briefly,thecompare-and-swap(CAS)operationsnormallyusedtoacquireaJavamonitorincursconsiderablelocallatencyintheexecutingCPU.Sincemostobjectsarelockedbyatmostonethreadduringtheirlifetime,weallowthatthreadtobiasanobjecttowarditself.Oncebiased,thatthreadcansubsequentlylockandunlocktheobjectwithoutresortingtoexpensiveatomicinstructions.Obviously,anobjectcanbebiasedtowardatmostonethreadatanygiventime.(Werefertothatthreadasthebiasholdingthread).Ifanotherthreadtriestoacquireabiasedobject,however,weneedtorevokethebiasfromtheoriginalthread.(Atthisjuncturewecaneitherrebiastheobjectorsimplyreverttonormallockingfortheremainderoftheobject''slifetime).Thekeychallengeinrevocationistocoordinatetherevokerandtherevokee(thebiasholdingthread)--wemustensurethattherevokeedoesn''tlockorunlocktheobjectduringrevocation.Asnotedinthepaper,revocationcanbeimplementedinvariousways-signals,suspension,andsafepointstonameafew.Critically,forbiasedlockingtoproveprofitable,thebenefitconferredfromavoidingtheatomicinstructionsmustexceedtherevocationcosts.Anotherequivalentwaytoconceptualizebiasedlockingisthattheoriginalownersimplydefersunlockingtheobjectuntilitencounterscontention.Biasedlockingbearssomesemblancetotheconceptoflockreservation,whichiswell-describedinKawachiya''sdissertation.It''salsosimilarinspiritto"opportunisticlocks"(oplocks)foundinvariousfilesystems;theschemedescribeinMikeBurrows''s"Howtoimplementunnecessarymutexes"(InComputerSystems:Theory,TechnologyandApplications,Dec.2003);orChristianSiebert''sone-sidedmutualexclusionprimitives.
BiasedlockingisstrictlyaresponsetothelatencyofCAS.It''simportanttonotethatCASincurslocallatency,butdoesnotimpactscalabilityonmodernprocessors.AcommonmythisthateachCASoperation"goesonthebus",and,giventhattheinterconnectisafixedacontendedresource,useofCAScanimpairscalability.That''sfalse.CAScanbeaccomplishedlocally--thatis,withnobustransactions--ifthelineisalreadyinM-state.CASisusuallyimplementedontopoftheexistingMESIsnoop-basedcachecoherenceprotocol,butintermsofthebus,CASisnodifferentthanastore.Forexampleletssayyouhaveatrue16-waysystem.YoulaunchathreadthatCASes1billiontimestoathread-privatelocation,measuringtheelapsedtime.Ifyouthenlaunch16threads,allCASingtothread-privatelocations,theelapsedtimewillbethesame.Thethreadsdon''tinterferewithorimpedeeachotherinanyway.Ifyoulaunch16threadsallCASingtothesamelocationyou''lltypicallyseeamassiveslow-downbecauseofinterconnecttraffic.(ThesoleexceptiontothatclaimisSun''sNiagara,whichcangracefullytoleratesharingonamassivescaleastheL2$servesastheinterconnect).IfyouthenchangethatCAStoanormalstoreyou''llalsoseeasimilarslow-down;asnotedbefore,intermsofcoherencybustraffic,CASisn''tappreciablydifferentthananormalstore.SomeofthemisinformationregardingCASprobablyarisesfromtheoriginalimplementationoflock:cmpxchg(CAS)onIntelprocessors.Thelock:prefixcausedtheLOCK#signaltobeasserted,acquiringexclusiveaccesstothebus.Thisdidn''tscaleofcourse.Subsequentimplementationsoflock:cmpxchgleveragecachecoherencyprotocol--typicallysnoop-basedMESI--anddon''tassertLOCK#.Notethatlock:cmpxchgwillstilldriveLOCK#inoneextremelyexoticcase--whenthememoryaddressismisalignedandspans2cachelines.Finally,wecansafelyusecmpxchgonuniprocessorsbutmustuselock:cmpxchgonmultiprocessorsystems.Lock:cmpxchgincursmorelatency,butthenagainit''safundamentallydifferentinstructionthatcmpxchg.Lock:cmpxchgisserializing,providingbidirectionalmfence-equivalentsemantics.(Fenceorbarrierinstructionsareneverneededforuniprocessors)ThisfactmightalsohavecontributedtothemyththatCASismoreexpensiveonMPsystems.Butofcourselock:cmpxchgincursnomorelatencyona2xsystemthanonan8xsystem.
Whiledigressingonthetopicofbusoperations,letssaythataloadisfollowedcloselyinprogramorderbyastoreorCAStothesamecacheline.Ifthecachelineisnotpresentintheissuingprocessortheloadwillgeneratearequest-to-sharetransactiontogetthelineinS-stateandthestoreorCASwillresultinasubsequentrequest-to-owntransactiontoforcethelineintoM-state.This2ndtransactioncanbeavoidedonsomeplatformsbyusingaprefetch-for-writeinstructionbeforetheload,whichwillforcethelinedirectlyintoM-state.It''salsoworthmentioningthatontypicalclassicSMPsystems,pureread-sharingisveryefficient.Alltherequestingprocessorscanhavethecacheline(s)replicatedintheircaches.Butifevenoneprocessoriswritingtoasharedcacheline,thosewriteswillgenerateconsiderablecachecoherencetraffic;assumingawrite-invalidatecachecoherencepolicy(asopposedtowrite-update)thereaderswillcontinuallyre-loadthecachelinejusttohaveitsubsequentlyinvalidatedbythewriter(s).Putdifferently,loadstoacachelinearecheapifotherprocessorsareloadingfrombutnotstoringtothatsameline.Storesarecheaponlyifnootherprocessorsareconcurrentlystoringtoorloadingfromthatsameline.(Wecandrawanimpreciseanalogybetweencachecoherencyprotocolsandread-writelocksinthatforagivencachelinetherecanonlybeonewriteratanygiventime.That''stheprocessorwiththelineinM-state.Multiplereadersofthelineallowedandofcoursethelifetimeofareadercan''toverlapawrite.Unliketraditionalread-writelocks,however,thecachecoherencyprotocolallowswriterstoinvalidatereaders,sowecan''tpushtheanalogytoofar.Inatwistedsense,thecoherencyprotocolisobstruction-free).Coherencybandwidthisafixedandcontendedglobalresource,soinadditiontolocallatency,excessivesharingtrafficwillimpactoverallscalabilityandimpedetheprogressofthreadsrunningonotherprocessors.Aso-calledcoherencymiss--forexamplealoadonprocessorP1whereprocessorP2hasthecachelineinM-state--istypicallymuchslowerthananormalmiss(exceptonNiagara).Recalltoo,www.baiyuewang.netthatacquiringalockinvolvesastore(CAS,really)tothelockmetadata,soifyouhavethreadsonprocessorsP1andP2iterating,acquiringthesame,thelockacquisitionitselfwillgeneratecoherencytrafficandresultinthecache"sloshing"oftheline(s)holdingthemetadata.Generally,excessivecoherencytrafficistobeavoidedonclassicSMPsystems.Butasusual,there''sanexceptiontoanyrule,andinthiscasethatexceptionisSun''sNiagara,whichcantoleratesharinggracefully.WeshouldnowreturnbacktothetopicofCASlatency,asthatissuemotivatestheneedforbiasedlocking.
CASandthefence(AKAbarrierorMEMBAR)instructionsaresaidtobeserializing.Onmultiprocessorsystemsserializinginstructionscontroltheorderinwhichwhichstoresareloadsarevisiblewithrespecttomemory.Suchinstructionsareoftenimplementedinacrudefashiononmostcurrentprocessors.Typically,aserializinginstructionwillbringtheCPUtoanearhalt,killandinhibitanyout-of-order(OoO)instructions,andwaitforthelocalstorebuffertodrain.That''sOKforsimpleprocessors,butitresultsinconsiderablelossofperformanceonmoresophisticatedOoOprocessors.There''snofundamentalreasonwhyCASorfenceshouldbeslow.Untilrecently,thetrendwastowardworseCASandfenceperformance,butitappearsthatthistrendmaybereversing.IBMmadeconsiderableimprovementsinfencelatencybetweenthePower4andPower5.Likewise,Intelmaderemarkableimprovementsforbothlock:cmpxchgandmfencelatencyintherecentCore2processors.
Inthecaseofbiasedlocking--sinceweknowthattheobjectisunshared--it''sreasonabletoassumethattheactionsofthebiasholdingthreadwouldkeepcachelinesunderlyingtheobject''slockwordanddatainM-orE-state
Interestingly,ifaamemorybarrier,or"fence"instructionissufficientlyfasterthanCASwecanimplementbiasedlockingrevocationwithaDekker-likemechanism.Ifyou''reinterestedindelvingdeeperyoumightfindthefollowingofinterest:AsymmetricDekkerSynchronization.(ThatsamedocumentalsodescribesanoptimizationusedinHotSpotwherewewereabletoeliminateaMEMBARfromthestatetransitioncodepathonmultiprocessorsystems).
KennethRussellandDavidDetlefs(formerlyofSunLabs)haveapaperappearinginOOPSLA''06--EliminatingSynchronization-RelatedAtomicOperationswithBiasedLockingandBulkRebiasing--wheretheydescribeaschemetoreducethecostofrevocation.
It''simportanttonotethatbiasedlockingispermittedunderthetheJavaMemoryModelasclarifiedbyJSR133.RefertotheJavaLanguageSpecification,3rdedition,Chapter17,Section4.DougLea''sJSR-133CookbookisanexcellentresourceforJVMdevelopersorcuriousJavaprogrammers.
Finally,there''snotcurrentlyspaceinthemarkwordtosupportbothanidentityhashCode()valueaswellasthethreadIDneededforthebiasedlockingencoding.Giventhat,youcanavoidbiasedlockingonaper-objectbasisbycallingSystem.identityHashCode(o).Iftheobjectisalreadybiased,assigninganidentityhashCodewillresultinrevocation,otherwise,theassignmentofahashCode()willmaketheobjectineligibleforsubsequentbiasedlocking.Thispropertyisanartifactofourcurrentimplementation,ofcourse.
Asanaside,thelatencyofatomicread-modify-writeoperationssuchasCASwasinpartwhatmotivatedmyselfandAlexGarthwaitetodevelopourMostlyLock-FreeMalloc(ISMM2002)mechanism,whichusesprocessor-localallocationwithoutrequiringatomicsorthebindingofthreadstoprocessors.
SeealsoUS07814488(2002).
Category:GeneralTags:nonePermanentlinktothisentry
?Foraglimpseatthe...|Main|Letssayyou''re...?
Comments:
Thanksforwritingthisup,hopethere''smoretocome.
PostedbyMikaelGueckonAugust19,2006at12:39AMEDT#
btw,ifyouknowRussianthereareseveralotherpostsrelatedtoJIToptimization(includingBiasedLocking)here:http://blogs.sun.com/vmrobot/category/Compilers
PostedbykatyaonSeptember03,2006at01:14AMEDT#
Hi,
Ireadthisarticlewithinterest.BiasedlockingwasintroducedinJDK1.6andimproveduncontendedsynchronizationbyremovingatomiclocks.HoweverIhavereadelsewherethatAtomiclocks(AtomicIntergeretc)areusedinthejava.util.concurrentlibsandimproveperformance.Canyouexplaintheconflictplease.
Thankyou
PostedbyandyblackonJune22,2009at05:25AMEDT#
HelloAndy,
Insteadof''atomiclocks''letssay''atomicinstructions'',which,onx86typicallyhaveaLOCK:prefix(exceptXCHG).
AtomicIntegercouldlegitimatelybeimplementedinanumberofways,suchasvialocks("synchronized"oraj.u.c.ReentrantLock)orviaatomicread-modify-writeinstructionssuchasCAS/LOCK:CMPXCHGetc.WehappentousethelatterinHotSpot.Thatgivesusanon-blockingimplementationbutatthecostofanatomicinstruction.(Atomics,BTW,arebecomingcheaperasprocessordesignersputmoreeffortintheimplementation.Intelhasclearlypaidcloseattentiontoatomiclatency--it''smuchbetteronNehalem,forinstance).Thenon-blockingpropertyisgenerallydesirable.
Hypothetically,ifweimplementedAtomicIntegerwithsynchronizedthenitcouldalsobenefitfrombiasedlockinginthecasewhereonlyonethreadeveraccessedagiveninstance.Butbroadly,weexpectthej.u.cprimitivestobeusedinthecasewherethere''srealsharing.Thatis,they''reoptimizedforthecaseofrealconcurrentaccess,whichistherighttrade-off.
Regards,-Dave
PostedbyguestonJune22,2009at05:54AMEDT#
IsAtomicInteger.getAndIncrement()guaranteedtocompleteonallhwplatforms.ThewayIseeitforhwtoprovidethisguaranteelock:addAndIncrementonthatplatformmustnotallowthreadcontextswitchuntilitcompletes.Soona16threadsystemletsassumeall16threadswentforthisatthesametimetheyareallguaranteedtocompletebeforeotherthreadsrunonthosehwthreads.
PostedbybankkusonDecember11,2009at08:29PMEST#
Theprogresspropertiesaren''tspecifiedastheydependonthecapabilitiesoftheJVMandandthetheunderlyingplatform.Onanx86withLOCK:XADDitcouldbewait-free.OnotherplatformswithCASit''llbelock-free(LD...CASinaloop).OnsomeplatformsandJVMstheimplementationmaybelock-based.
Regardingcontextswitching,ifyouuseLOCK:XADDorsomeothersimilaratomictheneithertheatomicinstructioncompletesoritdoesn''t.IftheimplementationusesaLD...CASorLL..SCidiomthenwecanbepreemptedbetweentheLL/LDorCAS/SC.That''sharmless,however,asatworstwe''lljustretry.
PostedbydavediceonDecember13,2009at12:51AMEST#
HiDave
Thanksfortheclarification.CurioushowweimplementFairlocksinJUC?Clearlyticketlockingcannotbeimplementedonhwthatdoesnthaveawait-freefetchAndIncrement.
PostedbybankkusonDecember13,2009at08:22AMEST#
HiBank,goodquestion.
Thefairlocksare"fair"inthesensethatsuccessionismanagedbypassingownershipdirectlytoanotherthread,insteadofsimplyreleasingthelockandandusingcompetitivesuccession,whichcanallowso-called"barging"byotherthreadstryingtogainthelock.Wetypicallyuseanexplicitlistofthreadsblockedonalockaswewanttounparkthespecificsuccessor.(Wedothisbecausewecan''tallowunboundedspinning).
IIRC,fairlocksuseCASinthe"doorway"protocol(atleastonourreferenceplatforms),soit''spossiblethatathreadthatarrived1stinthedoorwaymightultimatelyenqueuebehindsomeotherthreadthatarrivedmorerecentlyatthelockdoorway.Soit''snotstrictlyfairinthatsense.
Regards
Dave
PostedbyDavidDiceonDecember14,2009at02:34AMEST#
InterestingIhadasimilarimplementationforaFairLock(orIthoughtwasfair)whereIwouldntreleasethelockvarinsteaddolockhandover.ThiswaytryLockstaysfairtooiftherearequeuedcontenders.
WhatIrealizedisthatwithhandovertolistofcontenderstheresstillaproblem.Addingathreadtothelistitselfisnotwait-free.PerhapstheJDKimplementationdoesntsufferfromthis.
BtwthisbloghasbeenveryeducationalformewishIhadfoudnitearlier:-)
PostedbybankkusonDecember14,2009at06:43AMEST#
HiBanks,
MCSandCLHaregoodalgorithms(atleastforstartingpoints)ifyouwanttoconstructalockwithawait-freedoorway.CLHarguablyhasbetterexecutionpropertiesbutalsorequiresasparedummynodeperlockinstance.
Regards,-Dave.
p.s.,someofthelockqueuesfor"synchronized"etc.inhotspotnativecodearelock-free.Iimplementedthemassuchtoavoid"doublecontention"whichcanoccurifwehaveahotlockandwheresome"inner"lockthatprotectsthequeuesisalsohot.
PostedbydavediceonDecember14,2009at07:00AMEST#
Dave-IreadwithinterestyourQRLpaperandtheotherassociatedmaterialsinyourpost.Forthosewhoareinterested,youcangetattheOOPSLApaperdirectlyfromSun,incaseyoudon''thaveanACMaccount:
http://java.sun.com/javase/technologies/hotspot/publications/biasedlocking_oopsla2006_wp.pdf
Iamspecificallyinterestedintheschemeyoumentionin3.2oftheQRLpaperregardingusingahalf-wordstorefollowedbyafull-wordreadtoensurethereadcannotbere-orderedbeforethewrite.
Inparticular,I''vebeentryingtodetermineifthiswillworkunderthex86memorymodel.IntelprovidesaprettydetaileddescriptionoftheIA-32behaviorin:
http://www.intel.com/Assets/PDF/manual/253668.pdf
...section8.2-buttheredoesn''tseemtobeenoughinformationtodeterminewhetherare-orderingthatwouldbreaktheQRLsemanticsispossible.
Forexample,under"8.2.3.5Intra-ProcessorForwardingIsAllowed",wehaveascenariowhereaCPUexecutingawritemayseethisright"before"writesofotherCPUs-andthatthoseotherCPUswillseetheirwritesfirst.ThisseemsdangerousfortheQRLtrick,sinceitmaybepossibleforthebiasedthreadtoseeitsownhalf-wordwritebeforeasuccessfulrevokingCASbyanotherCPUisseen,buttherevokingCPUmayseetheoppositecase-CASsucceededandnohalf-wordwriteyet.
ThisleadstotheoddcasewhereboththewriteandCAShavesucceeded,whichisclearlyimpossibleundersequentialconsistency,oranymodelwherememoryprocessesthroughaseriesofwell-definedstates.ItisnotcleartomewhetheritisallowedundertheIntelmodelhowever,because(a)theydon''tdiscussedextensivelytherelationshipbetweenregular(noLOCK/fence)reads/storesandinterlockedoperationsondifferentCPUs(theydopointoutthattheycan''tberelativelyreorderedonasingleCPU),or(b)whethertworead/storesareconsideredtobeat"sameaddress"iftheoperandswww.wang027.comhavedifferentlengths,butoneiscontainedwiththeotherand(c)don''tspecificallydiscussCASwhichisacompound,conditionalread-then-writeoperation-andwhetheritispossibleforanapparentre-orderingofastoreononeCPU(duetointra-processorforwarding)thatwouldhavecausedaCAStofail,toallowthetoCAStosucceed,butalsoletthelocalCPUseethestore.
Wellthatwasamouthful!Anyinsightyouhavewouldbeappreciated.ItismyunderstandingthatthecurrentmechanismintheJVMisnottousestoresatall,favoringinsteadaninvasiverevocationwhichretroactivelymodifiesthelockingrecords,soIguessthistopicdoesn''tdirectlythere(butI''malsocuriouswhythisapproachwasfavoredoveronewhichusestheregular-storetrickyoudescribe,whichispresumablyquitecheap).
PostedbyTravisDownsonJanuary04,2010at08:09AMEST#
HiTravis,
Thecollocationtrick(sometimesreferredto"wordtearing",althoughthattermisslightlyoverloaded),whileinteresting,isn''tguaranteedsafe.Asapracticalconcern,ifyouexecuteasubwordSTfollowedbyafullwordLDthatcoversthatstored-toaddress,mostprocessorswillstall,allowingthestoretodrainbeforetheloadissatisfied.(We''reassumingafullwordstoretotheaddressisn''talreadyinthestorebuffer,inwhichcaseanimplementation\might\mergethefullwordstorewiththesubsequentsubwordstoreandthensatisfytheloadbylookaside).InotherwordsonmostimplementationstheST;LDXtrickhasMEMBAR#storeloadsemantics,butthispropertyisn''tformallydefinedbythearchitecture,sowe''dneverconsiderusingitinaproduct.
AndinfactsincethiswaswrittenAMDandIntelhaveclarifiedtheirmemorymodels,clearlyindicatingthatsuchunsavorytricksaren''tsafeinthelongterm.(Perhapsitworkstoday,butitmaynotontomorrow''sprocessor).
Furthermore,theperformanceofthetechnique,evenifitweresomehowdeemedsafe,isn''tprofoundlybetterthanatraditionalfence,astheprocessorneedstostall.
Seealsohttp://blogs.sun.com/dave/entry/instruction_selection_for_volatile_fencesforsomemoreuptodateinformationonfence/barrier/membarcosts.
Asyoupointout,inourcurrentimplementationinhotspotweuseasafepointrendezvoustorevokebias.That''sratherheavy-handedbutit''sentirelysafe.
Regards,Dave
PostedbyDavidDiceonJanuary04,2010at02:28PMEST#
Thanks,Dave,foryourquickandcomprehensiveresponse.
Iguessthewarningin8.1.2.2ofthecurrentIntelguideisalongtheselines:
"Softwareshouldaccesssemaphores(sharedmemoryusedforsignallingbetween
multipleprocessors)usingidenticaladdressesandoperandlengths.Forexample,if
oneprocessoraccessesasemaphoreusingawordaccess,otherprocessorsshould
notaccessthesemaphoreusingabyteaccess."
Notrigorousbyanymeans(whytheystartreferringto"semaphores"hereratherthansimplememorylocationaccessasintheremainderofthesectionisbeyondme)-butahintthatthisso-called"wordtearing"maybeano-no.
PostedbyTravisDownsonJanuary04,2010at05:30PMEST#
HiTravis,Iagree--there''sagoodbitofinterpretationrequiredwhenreadingthevariousmemorymodeldocuments.BTW,Ihighlyrecommendthefollowing:http://www.cl.cam.ac.uk/~so294/documents/draft-scTSO.pdf.
Regards,Dave
PostedbyDavidDiceonJanuary05,2010at12:33AMEST#
Iwasre-readingyourpapertoday,andlookingattheHotSpotsourcecode,andwascuriousthatwhenrevokingabiasduetoalockattemptonanobjectbyathreadotherthanthebias-owneranddiscoveringthebias-ownerhastheobjectlocked,thatinadditiontomakingthebias-owner''sthreadappeartohavelightweightlockedtheobject,youdon''talsoinflatetheobject''smonitor.
Isthistoavoidmixinglockingandrevokingconcerns,orperhapsinthehopethatthebias-ownermightunlocktheobjectbeforetherevokingthreadactuallyruns.
Itjustseemslikethatsinceyou''realreadyatasafepointyoucouldavoidsomeofthespin-nastinessassociatedwithinflating.
PostedbygrahamsandersononJanuary06,2010at03:07PMEST#
HiGraham,
You''recorrectthatifobjectOisbiasedtoT1,T1hasOlogicallylocked,andT2triestolockOresultinginrevocationofbiasT1''sbiasfromO,thenwecouldtransitionOdirectlytoinflatedstate.Broadly,though,weweretryingtokeepthestatetransitionsassimpleaspossible.Asyoupointout,forthesakeofcodestewardshipsometimesit''sbettertoavoidmixingconcerns.Also,theinflationpath,asyou''dexpect,isburriedintheslow-pathcontended"enter"mechanism.It''sprudenttoleaveitthatway.Also,thedominantcaseisthatO,whilebiased,islogicallyunlocked.
Regards,-Dave
PostedbyDavidDiceonJanuary07,2010at07:07AMEST#
SomeinfoonSystem.identityHashCodeandusingittoavoidthebiasedlockingcausesanotherproblemw/thedefaultsettings:itusesos:random()thatcontendsonasingletonstate.
Thelatterresultsintoalotofsharing.-XX:hashCode=-1disablesitandreliesonMarsaliga''srandomthathasathreadlocalstate,i.e.concurrentfriendly.
PostedbyguestonApril04,2013at07:16AMEDT#
SomeinfoonSystem.identityHashCodeandusingittoavoidthebiasedlockingcausesanotherproblemw/thedefaultsettings:itusesos:random()thatcontendsonasingletonstate.
Thelatterresultsintoalotofsharing.-XX:hashCode=-1disablesitandreliesonMarsaliga''srandomthathasathreadlocalstate,i.e.concurrentfriendly.
PostedbybestsssonApril04,2013at07:17AMEDT#
ThanksforthecommentregardingidentityHashCode.Iimplementedthealternative-XX:hashCode=NvariantsafewyearsagobuttheJ2SEteamhasn''thadtimetoswitchtothedefaults.Asnoted,thedefaultisracy--that''slegalandarguablybenign--butundesirable.Theglobalstateforos::random()isalsoacoherencehotspot.WecanseetheimpactofthiswithasimplebenchmarkwhereotherwiseindependentthreadsassignhashCodes(weassignlazily)atahighrate.Thecoherencetrafficontheos::random()variablesquicklychokesscalability.Usingalternativeschemesprovidesrelief.That''spreciselywhyIprovidedtheMarsagliashift-xorPRNG.
Regards,-Dave
PostedbyguestonJune03,2013at10:13AMEDT#
Hello,
thanksforthisnicearticle.
IhavequestiontoQuicklyReacquirableLocks.Doesiunderstandit
wellthatThreadperformingRevokewillspinlockonline47untilthebiasthreadreleasesthelock,possiblyafterlongtime?
PostedbyjhronApril26,2015at06:01PMEDT#
Regardingthespinatline47,therevokerneedstowaitforthelocktobereleased.Forthepurposesofexplicationwejustusedtest-and-test-and-setlockswithunboundedspinning.Youwouldn''tdothatinpractice,butit''sfineforthepurposesofexplainingthebiasingmechanism.Regards,-Dave
PostedbyguestonApril27,2015at08:52AMEDT#
|
|