Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.
Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.
For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).
Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.
--------
GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).
GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.
CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.
Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).
---------
So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.
Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.
Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.
For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).
Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.
--------
GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).
GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.
CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.
Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).
---------
So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.