Simple Benchmark

A specified number of threads run for a large number iterations, each time modifying a thread-specific counter. False sharing happens between these counters. Counters are embedded inside a padding object to avoid false sharing if desired.

The reason for the false sharing results being similar with 3 and 4 threads is that (for reasons unknown) the CPU seemed to be using mostly 3 cores in both cases. I can't explain why the running time is so high with two threads in the non-false sharing situation.

(Slightly) More Complicated Benchmark

A specified number of threads separate the work of counting the number of occurences of some value in a large array. In the false sharing scenario, the threads increment counters on the same cache line, otherwise they use counters allocated on the stack to avoid false sharing.