Welcome to jeffreypicard.com!

Latest Blog Post

False Sharing

False Sharing

2014-11-17

I recently read Alexander Krizhanovsky's Linux Journal article on Lock-Free producer consumer queues. The article is very infromative, but what I want to focus on is a separate performance issue that he addressed without mention in his code, false sharing. His queue has members to keep track of the head and tail, which likely would be laid consecuitively in memory and thus likely in the same cache line. This can create an issue where updating one forces the other out of the cache, even though there is no need to since they are being operated on separately. Using gcc's aligned attribute we can force these variables to be each on their own cache line and avoid this issue.

I decided to look more into this and ended up finding a C++ today article about it. This was of course written in C++, and used boost (ick), and his solution was to expand the arrays beyond the needed size to force them to not be in the same cache, line whereas the aligned attribute is much more elegant. So, I took a crack a putting together some sample code in pure C myself.

My code is on github for the curious. It's a very simple program, we declare two static int arrays of length 8 (4*8=32 bytes), so this should be one cacheline on most modern 64 bit systems (and was on the one I tested on). Then we start two threads and have each loop over the arrays for i = 1...1000000000 and insert the current i into i % V_SIZE (Krizhanovsky's article used a trick to declare the sizes as a power of 2 and then use a mask = size - 1 to avoid the division, but I looked at the assembly output and on -O2 it optimizes the mod to an and, more on this later.) Now let's take a look at the performance results, this is entirely unscientific but I believe is still representative of reality.

Without aligning on cache boundaries, we have the following times.
Note: 0x7fff92506610 - 0x7fff925065f0 = 32
$ time ./a.out
0x7fff925065f0, 0x7fff92506610

real	0m6.632s
user	0m13.055s
sys	0m0.106s
After aligning on cache boundaries.
Note: 0x7fffd7865900 - 0x7fffd78658c0 = 64
$ time ./a.out
0x7fffd78658c0, 0x7fffd7865900

real	0m2.078s
user	0m3.845s
sys	0m0.251s
That's a 3x speedup, not too shabby.

Now let's take a look at the % vs & issue. At first I used the mask trick, so my indexing code was:

#define V_SIZE 8
...
const unsigned int V_MASK = V_SIZE - 1;
for (i = 0; i < N; i++) {
	v[i & V_MASK] = i;
}

Taking a look at the assembly output (gcc -Wall -pthread -O2 -S false_sharing.c -o false_sharing.s), this compile as you would expect, here's our compute function.

  6 compute:
...
 12 .L2:
 13         movq    %rax, %rdx
 14         andl    $7, %edx
 15         movl    %eax, (%rdi,%rdx,4)
 16         addl    $1, %eax
 17         cmpl    $1000000000, %eax
 18         jne     .L2
 19         xorl    %eax, %eax
 20         ret

It turns out, we can do out naive code for indexing:

for (i = 0; i < N; i++) {
	v[i % V_SIZE] = i;
}

And when we look at the assembler output, it's exactly the same. I think this is a case where it comes down to, do we want to always trust the compiler to optimize this? v.s. do we want to write slightly less readable code? Granted, once a programmer is used to bit manipulation and what integers looks like in memory, anding with a mask probably isn't much less readable.

Other Blog Posts

2014-10-13 matasano crypto challenges 2

2014-09-09 matasano crypto challenges 1

2014-09-03 Solving an Ad Tech Machine Learning Problem using Clustering and Viterbi

2014-08-08 Pig Combined Input Format

About

I am a software engineer living in New Hampshire. I work mostly on Big Data and Open Source Blockchain software.

My main academic interests are in Algorithms and High Performance Computing. My other interests include GNU/Linux for use on servers, workstations and desktop computers, Computer Security, Combinatorial Optimization, and programming in general.

My preferred programming language is C, although I am comfortable with C++, Java and Python. I also have some experience with Scheme, PHP and MATLAB.

email: jeff <at> jeffreypicard <dot> com