Cache is king 👑

justyy (84)in #blog • 3 months ago (edited)

We often talk about CPU speed, but rarely about where the data lives.

Performance is dominated by the proximity of the data. Registers, L1, L2, L3, main memory - each step adds latency and drops throughput. A main memory access can take 200 cycles, 50x slower than L1 cache.

When your working set fits in cache, your code flies. When it doesn’t, the CPU just waits.

In packet processing, this difference decides everything. Each packet triggers table lookups. If those tables stay in cache, you can push millions of packets per second. If not, throughput collapses.

So the next time you design a data structure, ask:

Will it fit in cache?

Because in performance-critical systems, cache isn’t just an optimization - it defines the system.

And not only data but also instructions! I've seen HFT engineers talk about their strategies where they programmed the hot paths to be firing all the time, and only switching the network card on when a packet needs to leave the system. This keeps their instruction cache hot as well.

Keeping the instruction cache hot is just as critical as keeping the data hot, especially in workloads where predictability matters. Shaping your hot path so the CPU never falls out of the I-cache is important, because even a small stall can dominate tail latency. It is an excellent reminder that architecture design is really about keeping both instructions and data as close to the core as possible.

So many technical decision makers are set on blanket strategy: eg. cloud for everything - that they think any virtualized workload can work in any virtualization environment and the underlying hardware and the virtualization is just a commodity. This doesn't apply for virtualized network functions where vendors knew exclusive thread core pinning gives the execution threads exclusive use of CPU cache. The vendors knew interrupt coalescing in virtualized environments decreases "CPU usage" at the expense of latency. They knew about NUMA locality and even put it all in their docs. Of course when the sales guys come around, they want to be aligned with high level strategy and use the best optimized benchmarks and have another separate discussion about cloud or hypervisor support without nuance. Yeah that will work*<fine print: you'll need 3x the licenses/hardware and still won't get optimal performance>. There is such a lack of interest in low level performance and such a skills gap where it seems to be addressed by adding layers of abstraction and vendors to obscure accountability. If Everest was the test of tech leadership or vendor accountability, it would be nice to know which would die on the hill or sell parkas at the bottom. Totally. Once you rely on cache behavior, core pinning, and NUMA locality, the platform stops being interchangeable. The low level details matter far more than most high level strategies

most of the heavy AI workloads still run straight into the same memory hierarchy limits. The models keep getting bigger, but the physics of moving data around the chip haven’t changed much. Understanding locality is still a big part of getting good performance.

Arrays give the CPU exactly what it wants: contiguous memory and predictable access patterns. That means the prefetcher can actually do its job, the cache lines get used efficiently, and you avoid the pointer-chasing penalties you get with scattered structures. It is one of the simplest ways to stay cache-friendly.

The same in multi axes motion control for robotics. first axis warm up cache and takes the impact of cache miss, calculation of the next axis takes half time.

The IBM Telum processor can confirm that, with L2 converted to L3 on demand and L4 cache, accessible from any other cpu. Plus the clock speed is always at 5.5 GHz. The chip includes ten 36 MB Level-2 caches1 and expanded virtual Level-3 (360 MB) and Level-4 (2.8 GB) caches

That’s a fascinating chip. The cache sizes are enormous compared to most architectures, and it makes me wonder how that affects access latency at each level. I couldn’t find any published latencies for Telum’s caches, which is a pity, because it would be interesting to see how IBM balances size, fabric distance, and hit latency in practice.