I recently read Alexander Krizhanovsky's Linux Journal article on Lock-Free producer consumer queues. The article is very infromative, but what I want to focus on is a separate performance issue that he addressed without mention in his code, false sharing. His queue has members to keep track of the head and tail, which likely would be laid consecuitively in memory and thus likely in the same cache line. This can create an issue where updating one forces the other out of the cache, even though there is no need to since they are being operated on separately. Using gcc's aligned attribute we can force these variables to be each on their own cache line and avoid this issue.
I decided to look more into this and ended up finding a C++ today article about it. This was of course written in C++, and used boost (ick), and his solution was to expand the arrays beyond the needed size to force them to not be in the same cache, line whereas the aligned attribute is much more elegant. So, I took a crack a putting together some sample code in pure C myself.
My code is on github for the curious. It's a very simple program, we declare two static int arrays of length 8 (4*8=32 bytes), so this should be one cacheline on most modern 64 bit systems (and was on the one I tested on). Then we start two threads and have each loop over the arrays for i = 1...1000000000 and insert the current i into i % V_SIZE (Krizhanovsky's article used a trick to declare the sizes as a power of 2 and then use a mask = size - 1 to avoid the division, but I looked at the assembly output and on -O2 it optimizes the mod to an and, more on this later.) Now let's take a look at the performance results, this is entirely unscientific but I believe is still representative of reality.
Without aligning on cache boundaries, we have the following times.$ time ./a.out 0x7fff925065f0, 0x7fff92506610 real 0m6.632s user 0m13.055s sys 0m0.106sAfter aligning on cache boundaries.
$ time ./a.out 0x7fffd78658c0, 0x7fffd7865900 real 0m2.078s user 0m3.845s sys 0m0.251sThat's a 3x speedup, not too shabby.
Now let's take a look at the % vs & issue. At first I used the mask trick, so my indexing code was:
#define V_SIZE 8 ... const unsigned int V_MASK = V_SIZE - 1; for (i = 0; i < N; i++) { v[i & V_MASK] = i; }
Taking a look at the assembly output (gcc -Wall -pthread -O2 -S false_sharing.c -o false_sharing.s), this compile as you would expect, here's our compute function.
6 compute: ... 12 .L2: 13 movq %rax, %rdx 14 andl $7, %edx 15 movl %eax, (%rdi,%rdx,4) 16 addl $1, %eax 17 cmpl $1000000000, %eax 18 jne .L2 19 xorl %eax, %eax 20 ret
It turns out, we can do out naive code for indexing:
for (i = 0; i < N; i++) { v[i % V_SIZE] = i; }
And when we look at the assembler output, it's exactly the same. I think this is a case where it comes down to, do we want to always trust the compiler to optimize this? v.s. do we want to write slightly less readable code? Granted, once a programmer is used to bit manipulation and what integers looks like in memory, anding with a mask probably isn't much less readable.
2014-10-13 matasano crypto challenges 2
2014-09-09 matasano crypto challenges 1
2014-09-03 Solving an Ad Tech Machine Learning Problem using Clustering and Viterbi
2014-08-08 Pig Combined Input Format
I am a software engineer living in New Hampshire. I work mostly on Big Data and Open Source Blockchain software.
My main academic interests are in Algorithms and High Performance Computing. My other interests include GNU/Linux for use on servers, workstations and desktop computers, Computer Security, Combinatorial Optimization, and programming in general.
My preferred programming language is C, although I am comfortable with C++, Java and Python. I also have some experience with Scheme, PHP and MATLAB.
email: jeff <at> jeffreypicard <dot> com