Memory is stored within the cache system in units know as cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size. The most common cache line size is 64 bytes. False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line. Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system. I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code.
To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line. Two threads writing to the same variable can be tracked down at a code level. To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us. Intel VTune is such a profiling tool. In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing.
Figure 1. |
Figure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect.
Java Memory Layout
For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes:
- doubles (8) and longs (8)
- ints (4) and floats (4)
- shorts (2) and chars (2)
- booleans (1) and bytes (1)
- references (4/8)
To show the performance impact let’s take a few threads each updating their own independent counters. These counters will be volatile longs so the world can see their progress.
01.
public
final
class
FalseSharing
02.
implements
Runnable
03.
{
04.
public
final
static
int
NUM_THREADS =
4
;
// change
05.
public
final
static
long
ITERATIONS = 500L * 1000L * 1000L;
06.
private
final
int
arrayIndex;
07.
08.
private
static
VolatileLong[] longs =
new
VolatileLong[NUM_THREADS];
09.
static
10.
{
11.
for
(
int
i =
0
; i < longs.length; i++)
12.
{
13.
longs[i] =
new
VolatileLong();
14.
}
15.
}
16.
17.
public
FalseSharing(
final
int
arrayIndex)
18.
{
19.
this
.arrayIndex = arrayIndex;
20.
}
21.
22.
public
static
void
main(
final
String[] args)
throws
Exception
23.
{
24.
final
long
start = System.nanoTime();
25.
runTest();
26.
System.out.println(
"duration = "
+ (System.nanoTime() - start));
27.
}
28.
29.
private
static
void
runTest()
throws
InterruptedException
30.
{
31.
Thread[] threads =
new
Thread[NUM_THREADS];
32.
33.
for
(
int
i =
0
; i < threads.length; i++)
34.
{
35.
threads[i] =
new
Thread(
new
FalseSharing(i));
36.
}
37.
38.
for
(Thread t : threads)
39.
{
40.
t.start();
41.
}
42.
43.
for
(Thread t : threads)
44.
{
45.
t.join();
46.
}
47.
}
48.
49.
public
void
run()
50.
{
51.
long
i = ITERATIONS +
1
;
52.
while
(
0
!= --i)
53.
{
54.
longs[arrayIndex].value = i;
55.
}
56.
}
57.
58.
public
final
static
class
VolatileLong
59.
{
60.
public
volatile
long
value = 0L;
61.
public
long
p1, p2, p3, p4, p5, p6;
// comment out
62.
}
63.
}
Results
Running the above code while ramping the number of threads and adding/removing the cache line padding, I get the results depicted in Figure 2. below. This is measuring the duration of test runs on my 4-core Nehalem at home.
Figure 2. |
The impact of false sharing can clearly be seen by the increased execution time required to complete the test. Without the cache line contention we achieve near linear scale up with threads.
This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory. They are independent objects. However experience shows that objects allocated at the same time tend to be co-located.
So there you have it. False sharing can be a silent performance killer.
From http://mechanical-sympathy./2011/07/false-sharing.html
Martin has had a passion for pushing the boundaries of software and electronics since childhood. He was sort of kid who literally took the video recorder apart and then put it back together again. Since then he has worked on PC based stock market data feeds in the early 90s and later the first generation of Internet banks. He worked on the development of the one of the largest content management systems and then the world's largest betting exchange at Betfair. Martin has always been attracted to business problems where high-performance computing can open new opportunities. Often in this work he has been able to achieve results that were previously thought to be impossible. He brings his mechanical sympathy for hardware to the software that he creates, which has taken him deep into the subjects of concurrency and parallel computing, where he as established a reputation and considerable expertise. More recently Martin was a founder, and now CTO, of LMAX a retail financial trading company. Where he works on the creation of the world’s first retail financial exchange using a radical new architecture that focuses on simplicity, elegance in design and takes it all back to basics. Martin is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website.