配色: 字号:
Java锁优化
2016-10-11 | 阅:  转:  |  分享 
  
Java锁优化



1.同步的原理



JVM规范规定JVM基于进入和退出Monitor对象来实现方法同步和代码块同步,但两者的实现细节不一样。代码块同步是使用monitorenter和monitorexit指令实现,而方法同步是使用另外一种方式实现的,细节在JVM规范里并没有详细说明,但是方法的同步同样可以使用这两个指令来实现。monitorenter指令是在编译后插入到同步代码块的开始位置,而monitorexit是插入到方法结束处和异常处,JVM要保证每个monitorenter必须有对应的monitorexit与之配对。任何对象都有一个monitor与之关联,当且一个monitor被持有后,它将处于锁定状态。线程执行到monitorenter指令时,将会尝试获取对象所对应的monitor的所有权,即尝试获得对象的锁。



2.Java对象头



锁存在Java对象头里。如果对象是数组类型,则虚拟机用3个Word(字宽)存储对象头,如果对象是非数组类型,则用2字宽存储对象头。在32位虚拟机中,一字宽等于四字节,即32bit。



长度 内容 说明

32/64bit MarkWord 存储对象的hashCode或锁信息等

32/64bit ClassMetadataAddress 存储到对象类型数据的指针

32/64bit Arraylength 数组的长度(如果当前对象是数组)

Java对象头里的MarkWord里默认存储对象的HashCode,分代年龄和锁标记位。32位JVM的MarkWord的默认存储结构如下:



25bit 4bit

1bit



是否是偏向锁



2bit



锁标志位



无锁状态 对象的hashCode 对象分代年龄 0 01

在运行期间MarkWord里存储的数据会随着锁标志位的变化而变化。MarkWord可能变化为存储以下4种数据:





3.几种锁的类型



线程的阻塞和唤醒需要CPU从用户态转为核心态,频繁的阻塞和唤醒对CPU来说是一件负担很重的工作。



JavaSE1.6为了减少获得锁和释放锁所带来的性能消耗,引入了“偏向锁”和“轻量级锁”,所以在JavaSE1.6里锁一共有四种状态,无锁状态,偏向锁状态,轻量级锁状态和重量级锁状态,它会随着竞争情况逐渐升级。



锁可以升级但不能降级,意味着偏向锁升级成轻量级锁后不能降级成偏向锁。这种锁升级却不能降级的策略,目的是为了提高获得锁和释放锁的效率。



3.1偏向锁



Hotspot的作者经过以往的研究发现大多数情况下锁不仅不存在多线程竞争,而且总是由同一线程多次获得。偏向锁的目的是在某个线程获得锁之后,消除这个线程锁重入(CAS)的开销,看起来让这个线程得到了偏护。





偏向锁的进一步理解



偏向锁的释放不需要做任何事情,这也就意味着加过偏向锁的MarkValue会一直保留偏向锁的状态,因此即便同一个线程持续不断地加锁解锁,也是没有开销的。



另一方面,偏向锁比轻量锁更容易被终结,轻量锁是在有锁竞争出现时升级为重量锁,而一般偏向锁是在有不同线程申请锁时升级为轻量锁,这也就意味着假如一个对象先被线程1加锁解锁,再被线程2加锁解锁,这过程中没有锁冲突,也一样会发生偏向锁失效,不同的是这回要先退化为无锁的状态,再加轻量锁,如图:





另外,JVM对那种会有多线程加锁,但不存在锁竞争的情况也做了优化,听起来比较拗口,但在现实应用中确实是可能出现这种情况,因为线程之前除了互斥之外也可能发生同步关系,被同步的两个线程(一前一后)对共享对象锁的竞争很可能是没有冲突的。对这种情况,JVM用一个epoch表示一个偏向锁的时间戳(真实地生成一个时间戳代价还是蛮大的,因此这里应当理解为一种类似时间戳的identifier),对epoch,官方是这么解释的:



Asimilarmechanism,calledbulkrebiasing,optimizessituationsinwhichobjectsofaclassarelockedandunlockedbydifferentthreadsbutneverconcurrently.Itinvalidatesthebiasofallinstancesofaclasswithoutdisablingbiasedlocking.Anepochvalueintheclassactsasatimestampthatindicatesthevalidityofthebias.Thisvalueiscopiedintotheheaderworduponobjectallocation.Bulkrebiasingcanthenefficientlybeimplementedasanincrementoftheepochintheappropriateclass.Thenexttimeaninstanceofthisclassisgoingtobelocked,thecodedetectsadifferentvalueintheheaderwordandrebiasestheobjecttowardsthecurrentthread.



偏向锁的获取



当一个线程访问同步块并获取锁时,会在对象头和栈帧中的锁记录里存储锁偏向的线程ID,以后该线程在进入和退出同步块时不需要花费CAS操作来加锁和解锁,而只需简单的测试一下对象头的MarkWord里是否存储着指向当前线程的偏向锁,如果测试成功,表示线程已经获得了锁,如果测试失败,则需要再测试下MarkWord中偏向锁的标识是否设置成1(表示当前是偏向锁),如果没有设置,则使用CAS竞争锁,如果设置了,则尝试使用CAS将对象头的偏向锁指向当前线程。



偏向锁的撤销



偏向锁使用了一种等到竞争出现才释放锁的机制,所以当其他线程尝试竞争偏向锁时,持有偏向锁的线程才会释放锁。偏向锁的撤销,需要等待全局安全点(在这个时间点上没有字节码正在执行),它会首先暂停拥有偏向锁的线程,然后检查持有偏向锁的线程是否活着,如果线程不处于活动状态,则将对象头设置成无锁状态,如果线程仍然活着,拥有偏向锁的栈会被执行,遍历偏向对象的锁记录,栈中的锁记录和对象头的MarkWord要么重新偏向于其他线程,要么恢复到无锁或者标记对象不适合作为偏向锁,最后唤醒暂停的线程。下图中的线程1演示了偏向锁初始化的流程,线程2演示了偏向锁撤销的流程。



偏向锁的设置



关闭偏向锁:偏向锁在Java6和Java7里是默认启用的,但是它在应用程序启动几秒钟之后才激活,如有必要可以使用JVM参数来关闭延迟-XX:BiasedLockingStartupDelay=0。如果你确定自己应用程序里所有的锁通常情况下处于竞争状态,可以通过JVM参数关闭偏向锁-XX:-UseBiasedLocking=false,那么默认会进入轻量级锁状态。



3.2自旋锁



线程的阻塞和唤醒需要CPU从用户态转为核心态,频繁的阻塞和唤醒对CPU来说是一件负担很重的工作。同时我们可以发现,很多对象锁的锁定状态只会持续很短的一段时间,例如整数的自加操作,在很短的时间内阻塞并唤醒线程显然不值得,为此引入了自旋锁。



所谓“自旋”,就是让线程去执行一个无意义的循环,循环结束后再去重新竞争锁,如果竞争不到继续循环,循环过程中线程会一直处于running状态,但是基于JVM的线程调度,会出让时间片,所以其他线程依旧有申请锁和释放锁的机会。



自旋锁省去了阻塞锁的时间空间(队列的维护等)开销,但是长时间自旋就变成了“忙式等待”,忙式等待显然还不如阻塞锁。所以自旋的次数一般控制在一个范围内,例如10,100等,在超出这个范围后,自旋锁会升级为阻塞锁。



对自旋锁周期的选择上,HotSpot认为最佳时间应是一个线程上下文切换的时间,但目前并没有做到。经过调查,目前只是通过汇编暂停了几个CPU周期,除了自旋周期选择,HotSpot还进行许多其他的自旋优化策略,具体如下:



如果平均负载小于CPUs则一直自旋

如果有超过(CPUs/2)个线程正在自旋,则后来线程直接阻塞

如果正在自旋的线程发现Owner发生了变化则延迟自旋时间(自旋计数)或进入阻塞如果CPU处于节电模式则停止自旋

自旋时间的最坏情况是CPU的存储延迟(CPUA存储了一个数据,到CPUB得知这个数据直接的时间差)

3.3轻量级锁



轻量级锁加锁



线程在执行同步块之前,JVM会先在当前线程的栈桢中创建用于存储锁记录的空间,并将对象头中的MarkWord复制到锁记录中,官方称为DisplacedMarkWord。然后线程尝试使用CAS将对象头中的MarkWord替换为指向锁记录的指针。如果成功,当前线程获得锁,如果失败,则自旋获取锁,当自旋获取锁仍然失败时,表示存在其他线程竞争锁(两条或两条以上的线程竞争同一个锁),则轻量级锁会膨胀成重量级锁。



轻量级锁解锁



轻量级解锁时,会使用原子的CAS操作来将DisplacedMarkWord替换回到对象头,如果成功,则表示同步过程已完成。如果失败,表示有其他线程尝试过获取该锁,则要在释放锁的同时唤醒被挂起的线程。下图是两个线程同时争夺锁,导致锁膨胀的流程图。





3.4重量级锁



重量锁在JVM中又叫对象监视器(Monitor),它很像C中的Mutex,除了具备Mutex互斥的功能,它还负责实现了Semaphore的功能,也就是说它至少包含一个竞争锁的队列,和一个信号阻塞队列(wait队列),前者负责做互斥,后一个用于做线程同步。



4.锁的优缺点对比



锁 优点 缺点 适用场景

偏向锁 加锁和解锁不需要额外的消耗,和执行非同步方法比仅存在纳秒级的差距 如果线程间存在锁竞争,会带来额外的锁撤销的消耗 适用于只有一个线程访问同步块场景

轻量级锁 竞争的线程不会阻塞,提高了程序的响应速度 如果始终得不到锁竞争的线程使用自旋会消耗CPU 追求响应时间,锁占用时间很短

重量级锁 线程竞争不使用自旋,不会消耗CPU 线程阻塞,响应时间缓慢 追求吞吐量,锁占用时间较长



BiasedLockinginHotSpot

ByDaveonAug18,2006



ThebiasedlockingschemeinHotSpotarosefromthefollowingpaperauthoredbymyself,MarkMoir,andBillScherer.Briefly,thecompare-and-swap(CAS)operationsnormallyusedtoacquireaJavamonitorincursconsiderablelocallatencyintheexecutingCPU.Sincemostobjectsarelockedbyatmostonethreadduringtheirlifetime,weallowthatthreadtobiasanobjecttowarditself.Oncebiased,thatthreadcansubsequentlylockandunlocktheobjectwithoutresortingtoexpensiveatomicinstructions.Obviously,anobjectcanbebiasedtowardatmostonethreadatanygiventime.(Werefertothatthreadasthebiasholdingthread).Ifanotherthreadtriestoacquireabiasedobject,however,weneedtorevokethebiasfromtheoriginalthread.(Atthisjuncturewecaneitherrebiastheobjectorsimplyreverttonormallockingfortheremainderoftheobject''slifetime).Thekeychallengeinrevocationistocoordinatetherevokerandtherevokee(thebiasholdingthread)--wemustensurethattherevokeedoesn''tlockorunlocktheobjectduringrevocation.Asnotedinthepaper,revocationcanbeimplementedinvariousways-signals,suspension,andsafepointstonameafew.Critically,forbiasedlockingtoproveprofitable,thebenefitconferredfromavoidingtheatomicinstructionsmustexceedtherevocationcosts.Anotherequivalentwaytoconceptualizebiasedlockingisthattheoriginalownersimplydefersunlockingtheobjectuntilitencounterscontention.Biasedlockingbearssomesemblancetotheconceptoflockreservation,whichiswell-describedinKawachiya''sdissertation.It''salsosimilarinspiritto"opportunisticlocks"(oplocks)foundinvariousfilesystems;theschemedescribeinMikeBurrows''s"Howtoimplementunnecessarymutexes"(InComputerSystems:Theory,TechnologyandApplications,Dec.2003);orChristianSiebert''sone-sidedmutualexclusionprimitives.



BiasedlockingisstrictlyaresponsetothelatencyofCAS.It''simportanttonotethatCASincurslocallatency,butdoesnotimpactscalabilityonmodernprocessors.AcommonmythisthateachCASoperation"goesonthebus",and,giventhattheinterconnectisafixedacontendedresource,useofCAScanimpairscalability.That''sfalse.CAScanbeaccomplishedlocally--thatis,withnobustransactions--ifthelineisalreadyinM-state.CASisusuallyimplementedontopoftheexistingMESIsnoop-basedcachecoherenceprotocol,butintermsofthebus,CASisnodifferentthanastore.Forexampleletssayyouhaveatrue16-waysystem.YoulaunchathreadthatCASes1billiontimestoathread-privatelocation,measuringtheelapsedtime.Ifyouthenlaunch16threads,allCASingtothread-privatelocations,theelapsedtimewillbethesame.Thethreadsdon''tinterferewithorimpedeeachotherinanyway.Ifyoulaunch16threadsallCASingtothesamelocationyou''lltypicallyseeamassiveslow-downbecauseofinterconnecttraffic.(ThesoleexceptiontothatclaimisSun''sNiagara,whichcangracefullytoleratesharingonamassivescaleastheL2$servesastheinterconnect).IfyouthenchangethatCAStoanormalstoreyou''llalsoseeasimilarslow-down;asnotedbefore,intermsofcoherencybustraffic,CASisn''tappreciablydifferentthananormalstore.SomeofthemisinformationregardingCASprobablyarisesfromtheoriginalimplementationoflock:cmpxchg(CAS)onIntelprocessors.Thelock:prefixcausedtheLOCK#signaltobeasserted,acquiringexclusiveaccesstothebus.Thisdidn''tscaleofcourse.Subsequentimplementationsoflock:cmpxchgleveragecachecoherencyprotocol--typicallysnoop-basedMESI--anddon''tassertLOCK#.Notethatlock:cmpxchgwillstilldriveLOCK#inoneextremelyexoticcase--whenthememoryaddressismisalignedandspans2cachelines.Finally,wecansafelyusecmpxchgonuniprocessorsbutmustuselock:cmpxchgonmultiprocessorsystems.Lock:cmpxchgincursmorelatency,butthenagainit''safundamentallydifferentinstructionthatcmpxchg.Lock:cmpxchgisserializing,providingbidirectionalmfence-equivalentsemantics.(Fenceorbarrierinstructionsareneverneededforuniprocessors)ThisfactmightalsohavecontributedtothemyththatCASismoreexpensiveonMPsystems.Butofcourselock:cmpxchgincursnomorelatencyona2xsystemthanonan8xsystem.



Whiledigressingonthetopicofbusoperations,letssaythataloadisfollowedcloselyinprogramorderbyastoreorCAStothesamecacheline.Ifthecachelineisnotpresentintheissuingprocessortheloadwillgeneratearequest-to-sharetransactiontogetthelineinS-stateandthestoreorCASwillresultinasubsequentrequest-to-owntransactiontoforcethelineintoM-state.This2ndtransactioncanbeavoidedonsomeplatformsbyusingaprefetch-for-writeinstructionbeforetheload,whichwillforcethelinedirectlyintoM-state.It''salsoworthmentioningthatontypicalclassicSMPsystems,pureread-sharingisveryefficient.Alltherequestingprocessorscanhavethecacheline(s)replicatedintheircaches.Butifevenoneprocessoriswritingtoasharedcacheline,thosewriteswillgenerateconsiderablecachecoherencetraffic;assumingawrite-invalidatecachecoherencepolicy(asopposedtowrite-update)thereaderswillcontinuallyre-loadthecachelinejusttohaveitsubsequentlyinvalidatedbythewriter(s).Putdifferently,loadstoacachelinearecheapifotherprocessorsareloadingfrombutnotstoringtothatsameline.Storesarecheaponlyifnootherprocessorsareconcurrentlystoringtoorloadingfromthatsameline.(Wecandrawanimpreciseanalogybetweencachecoherencyprotocolsandread-writelocksinthatforagivencachelinetherecanonlybeonewriteratanygiventime.That''stheprocessorwiththelineinM-state.Multiplereadersofthelineallowedandofcoursethelifetimeofareadercan''toverlapawrite.Unliketraditionalread-writelocks,however,thecachecoherencyprotocolallowswriterstoinvalidatereaders,sowecan''tpushtheanalogytoofar.Inatwistedsense,thecoherencyprotocolisobstruction-free).Coherencybandwidthisafixedandcontendedglobalresource,soinadditiontolocallatency,excessivesharingtrafficwillimpactoverallscalabilityandimpedetheprogressofthreadsrunningonotherprocessors.Aso-calledcoherencymiss--forexamplealoadonprocessorP1whereprocessorP2hasthecachelineinM-state--istypicallymuchslowerthananormalmiss(exceptonNiagara).Recalltoo,www.baiyuewang.netthatacquiringalockinvolvesastore(CAS,really)tothelockmetadata,soifyouhavethreadsonprocessorsP1andP2iterating,acquiringthesame,thelockacquisitionitselfwillgeneratecoherencytrafficandresultinthecache"sloshing"oftheline(s)holdingthemetadata.Generally,excessivecoherencytrafficistobeavoidedonclassicSMPsystems.Butasusual,there''sanexceptiontoanyrule,andinthiscasethatexceptionisSun''sNiagara,whichcantoleratesharinggracefully.WeshouldnowreturnbacktothetopicofCASlatency,asthatissuemotivatestheneedforbiasedlocking.

CASandthefence(AKAbarrierorMEMBAR)instructionsaresaidtobeserializing.Onmultiprocessorsystemsserializinginstructionscontroltheorderinwhichwhichstoresareloadsarevisiblewithrespecttomemory.Suchinstructionsareoftenimplementedinacrudefashiononmostcurrentprocessors.Typically,aserializinginstructionwillbringtheCPUtoanearhalt,killandinhibitanyout-of-order(OoO)instructions,andwaitforthelocalstorebuffertodrain.That''sOKforsimpleprocessors,butitresultsinconsiderablelossofperformanceonmoresophisticatedOoOprocessors.There''snofundamentalreasonwhyCASorfenceshouldbeslow.Untilrecently,thetrendwastowardworseCASandfenceperformance,butitappearsthatthistrendmaybereversing.IBMmadeconsiderableimprovementsinfencelatencybetweenthePower4andPower5.Likewise,Intelmaderemarkableimprovementsforbothlock:cmpxchgandmfencelatencyintherecentCore2processors.



Inthecaseofbiasedlocking--sinceweknowthattheobjectisunshared--it''sreasonabletoassumethattheactionsofthebiasholdingthreadwouldkeepcachelinesunderlyingtheobject''slockwordanddatainM-orE-state



Interestingly,ifaamemorybarrier,or"fence"instructionissufficientlyfasterthanCASwecanimplementbiasedlockingrevocationwithaDekker-likemechanism.Ifyou''reinterestedindelvingdeeperyoumightfindthefollowingofinterest:AsymmetricDekkerSynchronization.(ThatsamedocumentalsodescribesanoptimizationusedinHotSpotwherewewereabletoeliminateaMEMBARfromthestatetransitioncodepathonmultiprocessorsystems).



KennethRussellandDavidDetlefs(formerlyofSunLabs)haveapaperappearinginOOPSLA''06--EliminatingSynchronization-RelatedAtomicOperationswithBiasedLockingandBulkRebiasing--wheretheydescribeaschemetoreducethecostofrevocation.



It''simportanttonotethatbiasedlockingispermittedunderthetheJavaMemoryModelasclarifiedbyJSR133.RefertotheJavaLanguageSpecification,3rdedition,Chapter17,Section4.DougLea''sJSR-133CookbookisanexcellentresourceforJVMdevelopersorcuriousJavaprogrammers.

Finally,there''snotcurrentlyspaceinthemarkwordtosupportbothanidentityhashCode()valueaswellasthethreadIDneededforthebiasedlockingencoding.Giventhat,youcanavoidbiasedlockingonaper-objectbasisbycallingSystem.identityHashCode(o).Iftheobjectisalreadybiased,assigninganidentityhashCodewillresultinrevocation,otherwise,theassignmentofahashCode()willmaketheobjectineligibleforsubsequentbiasedlocking.Thispropertyisanartifactofourcurrentimplementation,ofcourse.



Asanaside,thelatencyofatomicread-modify-writeoperationssuchasCASwasinpartwhatmotivatedmyselfandAlexGarthwaitetodevelopourMostlyLock-FreeMalloc(ISMM2002)mechanism,whichusesprocessor-localallocationwithoutrequiringatomicsorthebindingofthreadstoprocessors.



SeealsoUS07814488(2002).



Category:GeneralTags:nonePermanentlinktothisentry

?Foraglimpseatthe...|Main|Letssayyou''re...?

Comments:



Thanksforwritingthisup,hopethere''smoretocome.

PostedbyMikaelGueckonAugust19,2006at12:39AMEDT#



btw,ifyouknowRussianthereareseveralotherpostsrelatedtoJIToptimization(includingBiasedLocking)here:http://blogs.sun.com/vmrobot/category/Compilers

PostedbykatyaonSeptember03,2006at01:14AMEDT#



Hi,



Ireadthisarticlewithinterest.BiasedlockingwasintroducedinJDK1.6andimproveduncontendedsynchronizationbyremovingatomiclocks.HoweverIhavereadelsewherethatAtomiclocks(AtomicIntergeretc)areusedinthejava.util.concurrentlibsandimproveperformance.Canyouexplaintheconflictplease.



Thankyou



PostedbyandyblackonJune22,2009at05:25AMEDT#



HelloAndy,



Insteadof''atomiclocks''letssay''atomicinstructions'',which,onx86typicallyhaveaLOCK:prefix(exceptXCHG).



AtomicIntegercouldlegitimatelybeimplementedinanumberofways,suchasvialocks("synchronized"oraj.u.c.ReentrantLock)orviaatomicread-modify-writeinstructionssuchasCAS/LOCK:CMPXCHGetc.WehappentousethelatterinHotSpot.Thatgivesusanon-blockingimplementationbutatthecostofanatomicinstruction.(Atomics,BTW,arebecomingcheaperasprocessordesignersputmoreeffortintheimplementation.Intelhasclearlypaidcloseattentiontoatomiclatency--it''smuchbetteronNehalem,forinstance).Thenon-blockingpropertyisgenerallydesirable.



Hypothetically,ifweimplementedAtomicIntegerwithsynchronizedthenitcouldalsobenefitfrombiasedlockinginthecasewhereonlyonethreadeveraccessedagiveninstance.Butbroadly,weexpectthej.u.cprimitivestobeusedinthecasewherethere''srealsharing.Thatis,they''reoptimizedforthecaseofrealconcurrentaccess,whichistherighttrade-off.



Regards,-Dave



PostedbyguestonJune22,2009at05:54AMEDT#



IsAtomicInteger.getAndIncrement()guaranteedtocompleteonallhwplatforms.ThewayIseeitforhwtoprovidethisguaranteelock:addAndIncrementonthatplatformmustnotallowthreadcontextswitchuntilitcompletes.Soona16threadsystemletsassumeall16threadswentforthisatthesametimetheyareallguaranteedtocompletebeforeotherthreadsrunonthosehwthreads.



PostedbybankkusonDecember11,2009at08:29PMEST#



Theprogresspropertiesaren''tspecifiedastheydependonthecapabilitiesoftheJVMandandthetheunderlyingplatform.Onanx86withLOCK:XADDitcouldbewait-free.OnotherplatformswithCASit''llbelock-free(LD...CASinaloop).OnsomeplatformsandJVMstheimplementationmaybelock-based.



Regardingcontextswitching,ifyouuseLOCK:XADDorsomeothersimilaratomictheneithertheatomicinstructioncompletesoritdoesn''t.IftheimplementationusesaLD...CASorLL..SCidiomthenwecanbepreemptedbetweentheLL/LDorCAS/SC.That''sharmless,however,asatworstwe''lljustretry.



PostedbydavediceonDecember13,2009at12:51AMEST#



HiDave

Thanksfortheclarification.CurioushowweimplementFairlocksinJUC?Clearlyticketlockingcannotbeimplementedonhwthatdoesnthaveawait-freefetchAndIncrement.



PostedbybankkusonDecember13,2009at08:22AMEST#



HiBank,goodquestion.



Thefairlocksare"fair"inthesensethatsuccessionismanagedbypassingownershipdirectlytoanotherthread,insteadofsimplyreleasingthelockandandusingcompetitivesuccession,whichcanallowso-called"barging"byotherthreadstryingtogainthelock.Wetypicallyuseanexplicitlistofthreadsblockedonalockaswewanttounparkthespecificsuccessor.(Wedothisbecausewecan''tallowunboundedspinning).



IIRC,fairlocksuseCASinthe"doorway"protocol(atleastonourreferenceplatforms),soit''spossiblethatathreadthatarrived1stinthedoorwaymightultimatelyenqueuebehindsomeotherthreadthatarrivedmorerecentlyatthelockdoorway.Soit''snotstrictlyfairinthatsense.



Regards

Dave



PostedbyDavidDiceonDecember14,2009at02:34AMEST#



InterestingIhadasimilarimplementationforaFairLock(orIthoughtwasfair)whereIwouldntreleasethelockvarinsteaddolockhandover.ThiswaytryLockstaysfairtooiftherearequeuedcontenders.



WhatIrealizedisthatwithhandovertolistofcontenderstheresstillaproblem.Addingathreadtothelistitselfisnotwait-free.PerhapstheJDKimplementationdoesntsufferfromthis.



BtwthisbloghasbeenveryeducationalformewishIhadfoudnitearlier:-)



PostedbybankkusonDecember14,2009at06:43AMEST#



HiBanks,



MCSandCLHaregoodalgorithms(atleastforstartingpoints)ifyouwanttoconstructalockwithawait-freedoorway.CLHarguablyhasbetterexecutionpropertiesbutalsorequiresasparedummynodeperlockinstance.



Regards,-Dave.

p.s.,someofthelockqueuesfor"synchronized"etc.inhotspotnativecodearelock-free.Iimplementedthemassuchtoavoid"doublecontention"whichcanoccurifwehaveahotlockandwheresome"inner"lockthatprotectsthequeuesisalsohot.



PostedbydavediceonDecember14,2009at07:00AMEST#



Dave-IreadwithinterestyourQRLpaperandtheotherassociatedmaterialsinyourpost.Forthosewhoareinterested,youcangetattheOOPSLApaperdirectlyfromSun,incaseyoudon''thaveanACMaccount:



http://java.sun.com/javase/technologies/hotspot/publications/biasedlocking_oopsla2006_wp.pdf



Iamspecificallyinterestedintheschemeyoumentionin3.2oftheQRLpaperregardingusingahalf-wordstorefollowedbyafull-wordreadtoensurethereadcannotbere-orderedbeforethewrite.



Inparticular,I''vebeentryingtodetermineifthiswillworkunderthex86memorymodel.IntelprovidesaprettydetaileddescriptionoftheIA-32behaviorin:



http://www.intel.com/Assets/PDF/manual/253668.pdf



...section8.2-buttheredoesn''tseemtobeenoughinformationtodeterminewhetherare-orderingthatwouldbreaktheQRLsemanticsispossible.



Forexample,under"8.2.3.5Intra-ProcessorForwardingIsAllowed",wehaveascenariowhereaCPUexecutingawritemayseethisright"before"writesofotherCPUs-andthatthoseotherCPUswillseetheirwritesfirst.ThisseemsdangerousfortheQRLtrick,sinceitmaybepossibleforthebiasedthreadtoseeitsownhalf-wordwritebeforeasuccessfulrevokingCASbyanotherCPUisseen,buttherevokingCPUmayseetheoppositecase-CASsucceededandnohalf-wordwriteyet.



ThisleadstotheoddcasewhereboththewriteandCAShavesucceeded,whichisclearlyimpossibleundersequentialconsistency,oranymodelwherememoryprocessesthroughaseriesofwell-definedstates.ItisnotcleartomewhetheritisallowedundertheIntelmodelhowever,because(a)theydon''tdiscussedextensivelytherelationshipbetweenregular(noLOCK/fence)reads/storesandinterlockedoperationsondifferentCPUs(theydopointoutthattheycan''tberelativelyreorderedonasingleCPU),or(b)whethertworead/storesareconsideredtobeat"sameaddress"iftheoperandswww.wang027.comhavedifferentlengths,butoneiscontainedwiththeotherand(c)don''tspecificallydiscussCASwhichisacompound,conditionalread-then-writeoperation-andwhetheritispossibleforanapparentre-orderingofastoreononeCPU(duetointra-processorforwarding)thatwouldhavecausedaCAStofail,toallowthetoCAStosucceed,butalsoletthelocalCPUseethestore.



Wellthatwasamouthful!Anyinsightyouhavewouldbeappreciated.ItismyunderstandingthatthecurrentmechanismintheJVMisnottousestoresatall,favoringinsteadaninvasiverevocationwhichretroactivelymodifiesthelockingrecords,soIguessthistopicdoesn''tdirectlythere(butI''malsocuriouswhythisapproachwasfavoredoveronewhichusestheregular-storetrickyoudescribe,whichispresumablyquitecheap).



PostedbyTravisDownsonJanuary04,2010at08:09AMEST#



HiTravis,



Thecollocationtrick(sometimesreferredto"wordtearing",althoughthattermisslightlyoverloaded),whileinteresting,isn''tguaranteedsafe.Asapracticalconcern,ifyouexecuteasubwordSTfollowedbyafullwordLDthatcoversthatstored-toaddress,mostprocessorswillstall,allowingthestoretodrainbeforetheloadissatisfied.(We''reassumingafullwordstoretotheaddressisn''talreadyinthestorebuffer,inwhichcaseanimplementation\might\mergethefullwordstorewiththesubsequentsubwordstoreandthensatisfytheloadbylookaside).InotherwordsonmostimplementationstheST;LDXtrickhasMEMBAR#storeloadsemantics,butthispropertyisn''tformallydefinedbythearchitecture,sowe''dneverconsiderusingitinaproduct.



AndinfactsincethiswaswrittenAMDandIntelhaveclarifiedtheirmemorymodels,clearlyindicatingthatsuchunsavorytricksaren''tsafeinthelongterm.(Perhapsitworkstoday,butitmaynotontomorrow''sprocessor).



Furthermore,theperformanceofthetechnique,evenifitweresomehowdeemedsafe,isn''tprofoundlybetterthanatraditionalfence,astheprocessorneedstostall.



Seealsohttp://blogs.sun.com/dave/entry/instruction_selection_for_volatile_fencesforsomemoreuptodateinformationonfence/barrier/membarcosts.



Asyoupointout,inourcurrentimplementationinhotspotweuseasafepointrendezvoustorevokebias.That''sratherheavy-handedbutit''sentirelysafe.



Regards,Dave



PostedbyDavidDiceonJanuary04,2010at02:28PMEST#



Thanks,Dave,foryourquickandcomprehensiveresponse.



Iguessthewarningin8.1.2.2ofthecurrentIntelguideisalongtheselines:



"Softwareshouldaccesssemaphores(sharedmemoryusedforsignallingbetween

multipleprocessors)usingidenticaladdressesandoperandlengths.Forexample,if

oneprocessoraccessesasemaphoreusingawordaccess,otherprocessorsshould

notaccessthesemaphoreusingabyteaccess."



Notrigorousbyanymeans(whytheystartreferringto"semaphores"hereratherthansimplememorylocationaccessasintheremainderofthesectionisbeyondme)-butahintthatthisso-called"wordtearing"maybeano-no.



PostedbyTravisDownsonJanuary04,2010at05:30PMEST#



HiTravis,Iagree--there''sagoodbitofinterpretationrequiredwhenreadingthevariousmemorymodeldocuments.BTW,Ihighlyrecommendthefollowing:http://www.cl.cam.ac.uk/~so294/documents/draft-scTSO.pdf.



Regards,Dave



PostedbyDavidDiceonJanuary05,2010at12:33AMEST#



Iwasre-readingyourpapertoday,andlookingattheHotSpotsourcecode,andwascuriousthatwhenrevokingabiasduetoalockattemptonanobjectbyathreadotherthanthebias-owneranddiscoveringthebias-ownerhastheobjectlocked,thatinadditiontomakingthebias-owner''sthreadappeartohavelightweightlockedtheobject,youdon''talsoinflatetheobject''smonitor.



Isthistoavoidmixinglockingandrevokingconcerns,orperhapsinthehopethatthebias-ownermightunlocktheobjectbeforetherevokingthreadactuallyruns.



Itjustseemslikethatsinceyou''realreadyatasafepointyoucouldavoidsomeofthespin-nastinessassociatedwithinflating.



PostedbygrahamsandersononJanuary06,2010at03:07PMEST#



HiGraham,



You''recorrectthatifobjectOisbiasedtoT1,T1hasOlogicallylocked,andT2triestolockOresultinginrevocationofbiasT1''sbiasfromO,thenwecouldtransitionOdirectlytoinflatedstate.Broadly,though,weweretryingtokeepthestatetransitionsassimpleaspossible.Asyoupointout,forthesakeofcodestewardshipsometimesit''sbettertoavoidmixingconcerns.Also,theinflationpath,asyou''dexpect,isburriedintheslow-pathcontended"enter"mechanism.It''sprudenttoleaveitthatway.Also,thedominantcaseisthatO,whilebiased,islogicallyunlocked.



Regards,-Dave



PostedbyDavidDiceonJanuary07,2010at07:07AMEST#



SomeinfoonSystem.identityHashCodeandusingittoavoidthebiasedlockingcausesanotherproblemw/thedefaultsettings:itusesos:random()thatcontendsonasingletonstate.

Thelatterresultsintoalotofsharing.-XX:hashCode=-1disablesitandreliesonMarsaliga''srandomthathasathreadlocalstate,i.e.concurrentfriendly.



PostedbyguestonApril04,2013at07:16AMEDT#



SomeinfoonSystem.identityHashCodeandusingittoavoidthebiasedlockingcausesanotherproblemw/thedefaultsettings:itusesos:random()thatcontendsonasingletonstate.

Thelatterresultsintoalotofsharing.-XX:hashCode=-1disablesitandreliesonMarsaliga''srandomthathasathreadlocalstate,i.e.concurrentfriendly.



PostedbybestsssonApril04,2013at07:17AMEDT#



ThanksforthecommentregardingidentityHashCode.Iimplementedthealternative-XX:hashCode=NvariantsafewyearsagobuttheJ2SEteamhasn''thadtimetoswitchtothedefaults.Asnoted,thedefaultisracy--that''slegalandarguablybenign--butundesirable.Theglobalstateforos::random()isalsoacoherencehotspot.WecanseetheimpactofthiswithasimplebenchmarkwhereotherwiseindependentthreadsassignhashCodes(weassignlazily)atahighrate.Thecoherencetrafficontheos::random()variablesquicklychokesscalability.Usingalternativeschemesprovidesrelief.That''spreciselywhyIprovidedtheMarsagliashift-xorPRNG.



Regards,-Dave



PostedbyguestonJune03,2013at10:13AMEDT#



Hello,

thanksforthisnicearticle.

IhavequestiontoQuicklyReacquirableLocks.Doesiunderstandit

wellthatThreadperformingRevokewillspinlockonline47untilthebiasthreadreleasesthelock,possiblyafterlongtime?



PostedbyjhronApril26,2015at06:01PMEDT#



Regardingthespinatline47,therevokerneedstowaitforthelocktobereleased.Forthepurposesofexplicationwejustusedtest-and-test-and-setlockswithunboundedspinning.Youwouldn''tdothatinpractice,butit''sfineforthepurposesofexplainingthebiasingmechanism.Regards,-Dave



PostedbyguestonApril27,2015at08:52AMEDT#

献花(0)
+1
(本文系thedust79首藏)