A demonstration of cache invalidation in action. Key words 'thrasing' and 'false sharing'.
Here's a program:
#include <thread>
long arr[9];
void foo(int i) {
for (int j = 0; j < 1000000000; j++)
arr[i] += 1;
}
int main() {
std::jthread one([]{ foo(0); });
std::jthread two([]{ foo(XXX); });
}
Note the 'XXX'. If this is '1', then on average, on my system, the program takes 5 seconds to run. This is also true if its '2-7'. If it's '8', it takes around 1.6 seconds. Why? Cache invalidation.
Cache lines on my system are 64 bytes. A 'long' is 8 bytes. 8 longs is 64 bytes. So 0-7 reside on the same cache line.
Since we're running in two threads, both of them want the same cache line in their private cache, which leads to the situation of 'thrashing' (constant write invalidation). This is noted as 'false sharing' since they are not sharing any one piece of memory, but the sharing is because they both wish to write the same line.
There are tradeoffs in any design and this is one notable disadvantage of larger cache lines. They make this situation more likely.
A performance-conscious programmer that is readily optimizing their system may want to consider this.