Why do CPUs have multiple cache levels?

jgord · on Aug 8, 2016

When my son was 5 or 6 I had a great discussion about salt containers - the little one you have on the table, the big packet in the pantry, the pallet that gets delivered to the supermarket, the vast piles of salt at the salt mine, and all that salt in the ocean.

Next time we had an egg and he wanted salt I scratched my head and asked him what should we do .. take our egg to the supermarket or we could take a pack lunch to the beach, and maybe wave the egg around in the water ? "No daddy, remember, we have a little salt cache in the kitchen." hehe.

I guess a lot of the world can be seen thru the glasses of caching data or physical things.

csours · on Aug 8, 2016

That's a beautiful analogy. I wonder if we would still use small salt shakers if they cost 1000x a large salt container.

Edit: Cunningham's Law in action! https://meta.wikimedia.org/wiki/Cunningham%27s_Law

zepolen · on Aug 8, 2016

From 100x to 1250x

              container (40’)     shaker
    payload:  27,600 kg           100g
       cost:  1000$ to 5000$      2$ to 5%
    cost/kg:  0.04$ to 0.18$/kg   $20/kg to $50/kg

dghughes · on Aug 8, 2016

Farming is similar $200/tonne of potatoes and one 25kg bag is $10. Quite a difference in what a farmer gets and what the end products costs although not as bad as salt.

CyberDildonics · on Aug 8, 2016

For the amount of salt they contain I would guess that they do.

zepolen · on Aug 9, 2016

Wtf this isn't Cunningham's. I had the same thought and did the math and decided to post the results since other people were thinking the same thing.

Bluestrike2 · on Aug 8, 2016

Most physical analogies make me cringe (the security ones are the worst), but that one's perfect.

softwarefounder · on Aug 8, 2016

amazing analogy

rwmj · on Aug 8, 2016

I was "enlightened" many years ago when I asked a colleague (a great electronic engineer) why we didn't do fast task switching by having two sets of registers. His reply was that this would require every regular access to a register to go through an extra gate (to decide which bank of registers you want to hit), making every access slightly slower.

Larger registers/caches/memories are slower because they need more address decoding, that time scaling approximately linearly as the storage doubles in size.

brigade · on Aug 8, 2016

For all the siblings describing various forms of register banking, you're actually literally describing SMT in which one core executes two or more tasks at the same time, dividing the register set and other processor resources between the threads and effectively switching between the tasks every cycle. Modern x86, Power, and SPARC CPUs implement this, as do basically all GPUs for a while now, all for very good reasons.

Which is to say, it's quite easy for low-level architectural knowledge to become obsolete :p

nkurz · on Aug 8, 2016

Which is to say, it's quite easy for low-level architectural knowledge to become obsolete

Great insight. It's also a reminder that "best practice" is for a given set of assumptions, and it's good to periodically revisit whether these assumptions still hold. If the assumption is that the speed of register access is the limiting factor, it's probably good intuition to avoid anything that would slow this down. If instead the situation has changed so that memory access is hundreds of times slower than memory access (rather than the about the same), giving up some speed of register access for increased parallelism might be a good tradeoff.

It's interesting to think of hyperthreading[1] as being an alternative to adding layers of cache, with GPU's being at the extreme. It's also interesting to ask why 2x hyperthreading has been chosen by Intel, while Power8 allows up to 8x. Is this just market differentiation, or is there something architectural that makes this the right choice for each? And does support for hyperthreading come at a cost in clock speed, or is this already too constrained by heat and power?

[1] While I like avoiding made-up marketing terms, I'm using 'hyperthreading' rather than "SMT" here to distinguish multiple register sets sharing a single core from multiple cores sharing a single cache, and from multiple processors sharing access to RAM (SMP). Is there a separate acronym to distinguish "multicore simultaneous multi threading"?

noobiemcfoob · on Aug 8, 2016

> Which is to say, it's quite easy for low-level architectural knowledge to become obsolete

Truer words never spoken! It doesn't help that the industry has interests in keeping advances trade secrets. Knowledge becomes highly siloed and outmodded rather quickly.

bogomipz · on Aug 8, 2016

I'm familiar with the term "register banking" is this the same as SMT? i.e if a functional unit is free I can execute another instruction there that would otherwise be waiting?

pkroll · on Aug 8, 2016

As cesarb points out the ARM did that, and so did the Z-80 IIRC: basically, your friend was either trolling you or had no idea what he was saying. Addressing is done in parallel. The address decoding isn't the part that drinks time.

nhaehnle · on Aug 8, 2016

No, he's right, although I can imagine that the impact wasn't too bad for early ARM and Z-80.

No matter how it is implemented concretely, decoding or "using" an address of N-bits needs N levels of logical circuitry. Each level of circuitry takes time for signals to pass through. Of course, the time taken within each level is less than the clock cycle time of your CPU, but it adds up. Too many levels, and you can't do the addressing in your allocated cycle time any more.

So there's a trade-off. That trade-off may well have been worth it for the Z-80 and early ARM, but you'll note that apparently ARM switched away from that model as well.

davrosthedalek · on Aug 8, 2016

That's not really true. You can decode N-bits in one step, for example, via 2^N N-bit compares. For small N, that's totally doable.

anjc · on Aug 8, 2016

You still have to then AND the output of that decoder with the data lines which go to the registers right? OP's friend just mentioned 1 extra gate in the data path, and I presume this would be it.

morcheeba · on Aug 8, 2016

Yep, but a 7-input AND gate will have basically the same delay time as an 8-input AND gate. They don't scale linearly as it would if you stacked 6 or 7 gates in series (although that's typically how you do it in math or programming)

I like examples: 74AHC00 2-input gate has typ 4.5ns propagation delay at 25˚C/15pF/3-3.6v http://www.nxp.com/documents/data_sheet/74AHC_AHCT00.pdf

74AHC30 8-input gate has typ 5.0ns propagation delay at 25˚C/15pF/3-3.6v http://www.nxp.com/documents/data_sheet/74AHC_AHCT30.pdf

anjc · on Aug 8, 2016

Oh of course yeah. I was picturing it as a separate AND when it could be bunged in with the ANDs for individual register selects and write/load enables and so on also. Cool, thanks.

Dylan16807 · on Aug 8, 2016

Fanout isn't instant. If you want to power 2^N gates, you need N time to power them all from a source signal.

cfallin · on Aug 8, 2016

Yep, the long bitlines are a large factor in timing for register files. The timing is something like O(n) for n storage lines because it grows linearly with capacitance.

anjc · on Aug 8, 2016

You're implying that OP's friend thinks an extra gate would require another clock cycle or something? He's just talking about propagation delay which is obviously increased with extra logic on the bus (in the multiplexor, or the register-bank-select from a decoder anded with the data lines, say.)

cesarb · on Aug 8, 2016

The 32-bit ARM architecture actually does this. It has a "fast interrupt" mode which switches half of the register file to a separate bank. (The newer 64-bit ARM architecture doesn't have this feature.)

DigitalJack · on Aug 8, 2016

I thought the fast interrupt skipped some of the pushing and popping when entering and leaving the interrupt service routine. In other words, the routine was not allowed to touch certain registers, so they didn't have to be saved off.

monocasa · on Aug 8, 2016

It can. On non-M ARMs, pushing and popping is done by the interrupt handler, not by the processor. So if you want a really low latency FIQ, you can just rely on the half of the register bank that gets swapped in, and avoid the parts that don't.

2muchcoffeeman · on Aug 8, 2016

I am not convinced by this explanation.

If I can afford to have double the number of registers (in one or 2 sets), surely the extra time cost of some extra logic is still far quicker than hitting the first level cache? Hitting a cache doesn't just incur the access cost, but we have to store the value in a register somewhere before we can act on it.

Or am I remembering my arch classes wrong?

cfallin · on Aug 8, 2016

It's a quantitative tradeoff question -- for many programs it may be true at first that increasing the register count (say, from 16 to 32) allows more live values in-register and reduces memory traffic, but at a certain point there are diminishing returns, because the extra access time eventually becomes another cycle of latency and this hits all programs. Notable as well is that for many programs (and especially modern ones with large data sets), most cache misses occur because of accesses to large heap data structures, not because there were slightly too few registers and a few locals spilled to the stack.

There are also practical issues with increasing the architectural register set -- you'd have to define a new encoding for instructions, and this (i) adds significant cost (area, power, timing) to the instruction decoding logic, and (ii) imposes a giant cost on the software ecosystem -- new binaries, operating system support for context-switching, etc.

Merad · on Aug 8, 2016

Cache on modern processors is so insanely fast it's already essentially an extension of the register file. IIRC, under ideal circumstances a L1 cache hit on Core i7 can take as little as 2-3 cycles, and averages about 4 cycles.

radarsat1 · on Aug 8, 2016

> time scaling approximately linearly as the storage doubles in size.

scratches head...

... so.. logarithmic scaling with storage size?

seanmcdirmid · on Aug 8, 2016

sounds similar to register windows on SPARC, which is rather used for procedure calls. It isn't so much that it is slow, just that when you have a decent level 1 cache it doesn't add much. Right about why small caches are faster.

Zardoz84 · on Aug 8, 2016

Z80 did it to serve IRQs.

sargun · on Aug 8, 2016

What's the "cost" of registers on an x86-64 chip? How hard would it be to introduce 10 more general purpose registers? Ones that are used by the kernel, say, and one set for user space.

vardump · on Aug 8, 2016

Because you say kernel, I'm assuming you mean ability to switch a set of registers instead of pushing them on the stack when servicing an interrupt request.

Unfortunately you can't get away with just one set of spare registers on x86 interrupt handling. Interrupt can be interrupted by another interrupt. So practically they have to save registers anyways. Unless there'd be one set for each level...

So those "10 extra registers" would be practically useless.

Having written kernel device driver interrupt handlers, I can also say CPU time is not spent saving and restoring registers, but waiting for glacially slow PCI-e MMIO register loads and stores (don't have exact figures at hand, but I remember one access can take 300-800 nanoseconds, that's thousands of CPU clock cycles). During one such fetch you might be able to store and restore registers tens, if not hundreds of times. X86 interrupt handling is a bit like stopping a freight train to pick up a single letter.

However, that kind of feature can be very useful on a microcontroller to help lowering interrupt handling latency.

Dylan16807 · on Aug 8, 2016

> Interrupt can be interrupted by another interrupt. So practically they have to save registers anyways.

Only when an interrupt actually does get interrupted. You could make the common case faster. Not that it matters if PCI-e is the bottleneck.

vardump · on Aug 8, 2016

Then all interrupts need to be prepared for it and check for it anyways. Crucially, the code path that does the check might also get interrupted, even multiple times. No synchronization is possible, and even if it was, it'd be slower than simply saving the registers.

So what do you suggest could be done?

Dylan16807 · on Aug 8, 2016

It doesn't matter if that code path gets interrupted. There's no need to synchronize anything. If interrupt A sees that the extra registers are idle, and then gets interrupted, that's okay. By the time the inner interrupt returns, even if the inner interrupt used them, they will be put back to idle, and A can safely switch to them. Everything else works the same as current hardware.

There's the cost of a branch, but you could remove that by making it a hardware feature. It's a branch that should predict well though.

CyberDildonics · on Aug 8, 2016

Isn't a cache miss around 100 nanoseconds? This sounds like memory latency without all the out of order execution and other modern CPU features.

vardump · on Aug 8, 2016

Nothing to do with (local) DRAM, everything to do with PCI-e bus.

Reading and writing to memory mapped I/O registers is how you send commands to peripherals, read their status, etc. These registers are not really memory at all, but representation on various logic states on the peripheral. So what you write is often not what you read back.

Note that I/O memory mapping has nothing to do with usermode memory mapping.

exDM69 · on Aug 8, 2016

There are about 180 registers in the register file already. There are only names for 16 or so registers, the CPU internally does renaming for them. It would be possible to specify an API that leaves a part of the named registers reserved for kernel, but that wouldn't really help.

When an interrupt or system call happens, all the registers get pushed to the stack. The stack is typically in L1 cache (and subject to various CPU optimizations), so it's really fast to push the 16*64 bit registers to the stack.

System calls, interrupts and context switches are "slow" not because they have trivial overhead like pushing registers. What's really consuming the time is secondary effects like TLB flushes, changes to the page tables, polluting the branch predictor, etc. It takes a very long time for the CPU to "warm up" again after a context switch.

xxs · on Aug 8, 2016

System calls are mode switch that's way faster than a context (i.e. process) switch. The latter virtually always has cold caches.

I've made the remark because it's fairly common to mistake mode and context switches.

exDM69 · on Aug 8, 2016

Yes, and the x86-64 has a fast SYSCALL instruction.

monocasa · on Aug 8, 2016

Fast being in the thousands of cycles last time I checked.

the_mitsuhiko · on Aug 8, 2016

The whole syscall maybe but certainly not the method of invoking it.

bogomipz · on Aug 8, 2016

Can you elaborate on both there being 180 registers and the CPU does renaming of them? Surely the RAX register in an X86_64 CPU is always the same physical location no?

Tuna-Fish · on Aug 8, 2016

> Surely the RAX register in an X86_64 CPU is always the same physical location no?

No, it is not.

Modern CPUs all use Physical Register Files or PRFs for implementing OoO. The way they work is that there is one backing register file of ~150-200 registers, and a naming table in the frontend. Each register file in the backing table can be in one of 3 states -- waiting for data, contains data, or clean. Every time an instruction that writes data to a register is executed, during the register rename stage the renamer picks one register from the clean set, assigns that as the output of the uop and the current value of the logical register, and sets it as waiting for data.

That is, let's say you execute:

add RAX, RAX

and the current value of RAX in the rename table was PRF#1, with PRF#1 carrying data and all other physical registers clean. In the renamer, this would then turn into:

add PRF#2, PRF#1, PRF#1

if this instruction was immediately followed by another add, RAX, RAX, it would then turn into:

add, PRF#3, PRF#2, PRF2, and after renaming that instruction RAX points to PRF#3.

and so on. In PRF machines, architectural registers only exist as pointers into the PRF.

The reason for this design is that it makes OoO execution simple, and reduces unnecessary data movement. An instruction is ready to execute once all it's inputs have data in the PRF, and when it executes there is no need to move data into some special architectural register.

Registers are reaped from the "haves data" set into the clean set once no instruction or architectural register points at them. Note that multiple architectural registers can point to the same physical register: mov rax, rbx is resolved in the frontend just by pointing rax to the same register as rbx, and there is typically a special register for the value 0 that all common register-clearing operations use.

bogomipz · on Aug 8, 2016

This was a fantastic explanation so thanks. A couple of questions:

PRF is Physical register file? As opposed to the logical which can point to different backing store(the actual SRAM.)

Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.

Tuna-Fish · on Aug 8, 2016

> PRF is Physical register file? As opposed to the logical which can point to different backing store(the actual SRAM.)

Yep. PRF means the single backing store where all values go, and PRF systems are named for it because they are usually contrasted to the other very common way to implement OoO, ROB-backed systems, where there is a specific location for the final architectural value, and possibly multiple in-progress values in the ROB. Intel used values in the ROB in all their CPUs up to Sandy Bridge, which was their first PRF design.

(The main difference between the types is that ROB-backed is faster in circuit delay, but moves more data than a PRF design, so on the same process they potentially clock higher but produce more heat. PRF-based designs are also easier to scale wider than the ROB-based design.)

> Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.

No, you can do renaming and OoO well without uops, it just requires that your instruction set is designed to be amenable to it. Before x86 had uops, it couldn't do OoO and the competing RISC CPUs were way faster because of this. PPro added uops precisely because they made it possible for an x86 cpu to do OoO, and eventually x86 beat all the competing RISC chips at their own game.

monocasa · on Aug 8, 2016

> PRF is Physical register file?

Yep, it's the actual renamed registers.

> Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.

No, they're orthogonal concepts. The first OoO processor (the System 360 model 91) didn't have uops.

bogomipz · on Aug 8, 2016

Sorry that shouldn't have said OoO, register renaming and micro-ops I meant.

monocasa · on Aug 8, 2016

It did register renaming too. It was designed by Tomasulo of Tomasulo's algorithm.

bogomipz · on Aug 9, 2016

Wow, this is kind of blowing my mind that they were doing this with a hard-wired control unit in 1969(I think I have the date right.) Big Blue, its easy now a days to forget just how amazing they were. Do you have links you can recommend? Thanks.

JoeAltmaier · on Aug 8, 2016

And the 'Mill' design just uses a single pointer into a ring buffer? Cool.

Tuna-Fish · on Aug 8, 2016

Actually, the Mill uses just the forwarding network which I completely ignored in this explanation.

The really short explanation about the forwarding network is that writing/reading the PRF takes too much time for you to read it and do work on the same cycle. If you operated on it alone, all operations that were dependent on each other would have multi-cycle latencies.

Since this sucks, in addition to the PRF the machines have a forwarding network, where the result of every instruction is broadcast to all the execution units of a similar type. If the broadcasted PRF number matches the register you want, it's read from the forwarding network instead of the PRF.

AFAICT, the Mill works by completely eschewing the final register file, and exclusively using the forwarding network with some local storage at each execution unit for the past values on the network.

JoeAltmaier · on Aug 8, 2016

Excellent description! Thanks.

ikeyany · on Aug 8, 2016

I am not who you responded to, but: the registers are always in the same physical location, but internally, they are virtually renamed for efficient reuse and to avoid resource contention. If the CPU knows the stream of assembly instructions to be executed, it can abstract away actual register names and show a larger number of registers than what's physically available.

Symmetry · on Aug 8, 2016

A larger cost would be from the extra bits in the instruction stream it takes to specify the new registers and the resultant extra icache pressure.

joseraul · on Aug 8, 2016

  it's really fast to push the 16*64 bit registers to the stack

Since the CPU has 180 registers (with only 16 names), why don't we need to push all 180 to store context?

Dylan16807 · on Aug 8, 2016

All that extra state is there to help overlap instructions on a time scale of nanoseconds. Every instruction has its own name->register mapping. If you're switching to kernel mode or maybe if you miss a branch or for whatever other reason, the CPU consolidates you back down to running one instruction. Once that happens, you only have one name->register mapping and all the registers have their correct values. The hidden state is reduced to nothing and you only have to save the "real" state of those 16.

cfallin · on Aug 8, 2016

To expand on another answer here -- there are more values in the PRF than there are names because "old" instructions in flight can refer to outdated versions of architectural registers (the names). But a new instruction can only see the latest versions of the architectural registers, so only those 16 actually matter w.r.t. future execution: when we switch back to the task, its new instructions will ony need the 16 live values.

exDM69 · on Aug 8, 2016

Because the CPU works "as if" there were only 16.

bogomipz · on Aug 8, 2016

But modern CPU TLBs even have context process IDs(CPIDs)now so many times the whole cache doesn't totally need to "warm up" again.

kqr · on Aug 8, 2016

As far as I understand it, the x86 uses many more registers than you think "under the hood", and it's exposing just a few "virtual registers" that are mapped to any of the underlying ones. So what you're thinking of is already done to an extent, except the at the kernel-level it's invisible.

(Exposing more virtual registers have two main costs, then: you lose backwards compatibility for all programs that use the extended register set, and to back them up with the same technique you need several more real registers.)

unsigner · on Aug 8, 2016

x86-64 chips have hundreds of general purpose registers; google "register renaming"

vardump · on Aug 8, 2016

Hardly relevant in (obvious) interrupt service routine context. These registers you're talking about are only used for reordering instruction stream to avoid stalls. Nice for general performance, but useless when servicing interrupts.

amelius · on Aug 8, 2016

His reply does not mean anything without a quantitative analysis.

posterboy · on Aug 8, 2016

The idea is close to duplicating the whole core, thus we have multi-core solutions.

daly · on Aug 8, 2016

Suppose you want to make a salad.

register: a tomato in your hand level 1 cache: a tomato on the counter level 2 cache: a tomato in the refrigerator level 3 cache: a tomato at the store main memory: a tomato on the plant at the farm disk: a tomato seed being planted

kondro · on Aug 8, 2016

Please don't put your tomatoes in the refrigerator. Now you've gone and corrupted them.

dpark · on Aug 8, 2016

This is a myth unless you eat all your tomatoes in a day or two.

http://www.seriouseats.com/2014/09/why-you-should-refrigerat...

kondro · on Aug 8, 2016

I will reserve judgement. This wasn't a particularly great experiment to disapprove one with a lot more scientific rigour.

dpark · on Aug 8, 2016

I'm not aware of any rigorous experiments that demonstrate that tomatoes are better left at room temperature than in a refrigerator. The French study referenced indicates that volatiles are reduced (no surprise) but that the effect is partially negated by allowing them to come back to room temperature before consumption. From what I can tell, this study does not compare quality of tomatoes left at room temperature for days to refrigerated tomatoes.

The article I linked did a more in-depth comparison and still came to the same conclusion. http://www.seriouseats.com/2014/09/tomato-taste-test-refrige...

I don't consider 2 separate testings with 16 subjects to be a poor experiment. Could it be even better? Sure. Is it a lot more rigorous than the anecdotal claims people propagate about refrigerated tomatoes? Absolutely.

nabla9 · on Aug 8, 2016

This analogy is not explaining why.

Why not just make more hands? Why multiple levels?

seanmcdirmid · on Aug 8, 2016

We are limited to two hands. The kitchen can only hold so many tomatoes and other things you need to cook. You could buy all the produce you needed for the year and store it at your house, but it is wasteful (you actually don't need it, storing it is expensive, and it displaces other things you need in your house like your toilet).

In the old days, humans would live near their wild food and not cache much. Then the Neolithic revolution happened, we started caching seeds and then surplus produce in cities, and eventually refrigeration came along and we could cache more. Each new level of caching allowed us to do more (increase population). Ya, we could still live as hunter gatherers, but we would be doing less. We could still be using Eniacs, also, we would just be computing slower.

taneq · on Aug 8, 2016

Hands cost a lot more than benchtop space, which costs a lot more than cupboard space, which costs a lot more than supermarket space.

xxs · on Aug 8, 2016

More hands - Imagine the brain power required to coordinate more than 2 hands. It's already really hard to do two separate tasks. (You may also ask why not slice and dice a couple of tomatoes simultaneously) Add the energy allocated just for the brain and vascular system and you'd understand why having a tomato onto the table is way simpler.

posterboy · on Aug 8, 2016

The right analogy would be bigger hands. Another reason than already mentioned IIRC is the connection on the cpu, it's a problem of geometry: Access speed decreases with physical distance on the core.

The other problems might be neatly visualized with geometric examples as well, but I'm not sure I understand them right. I'm comfortable to assume it's black magic, when fgiesen aka ryg is talking about it.

drwl · on Aug 8, 2016

The post the author wrote goes more in depth than this. Per your analogy, it doesn't quite explain if multiple people (cores) want to make a salad. For example, how does the tomato on the counter and refrigerator get allocated?

posterboy · on Aug 8, 2016

It's probably over the top of many heads, so a simple explanation is probably what they expect from the title.

drwl · on Aug 8, 2016

I'm confused, are you referring to the answer to the thread question or to the article's explanation?

posterboy · on Aug 8, 2016

Obviously the few sentence answer here is much simpler than the (reposted) over my head article discussing cache strategies. If the article contains the answer, it's not up front. It starts with a quote that goes along the lines of "I know why cache is necessary ...", so then I didn't read much further to find out if the explanation will be repeated, because I remember reading the article before. I jumped to the conclusion that the necessity for a cache also explains the necessity for caches of caches.

Edit: I was trying to say, the in depth explanation there and the ad-hoc hint given here both have their place.

On the other hand, maybe not, if one should read the article first, but who does that, right? /s

milcron · on Aug 8, 2016

You need two newline characters for HN to put the text on a new line.

i336_ · on Aug 8, 2016

> For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data L1 bandwidths well over 300GB/s per core if you have just the right instruction mix; very unlikely in practice, but you still get bursts of very hot activity sometimes.

Reading that reminded me of http://stackoverflow.com/questions/8389648/how-do-i-achieve-.... I don't 100% understand either domain, but I think this link is relevant - it's asking how to achieve the theoretical max of 4 FLOPs per CPU cycle.

vardump · on Aug 8, 2016

> it's asking how to achieve the theoretical max of 4 FLOPs per CPU cycle.

Nowadays you can do 32 FLOPs per core per cycle, single precision (counting FMA as add + mul).

i336_ · on Aug 8, 2016

Wow. What's the minimum CPU series for that?

vardump · on Aug 8, 2016

Haswell.

i336_ · on Aug 8, 2016

Ah, thanks.

cletus · on Aug 8, 2016

Great SO question. Tanks for the link. Honestly, I'm shocked it hasn't been closed as "not constructive" or "subjective".

forgotpwtomain · on Aug 8, 2016

I find the real world analogies quite weak and unnecessary.

If you haven't read it already, this is worth mentioning: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

oolongCat · on Aug 8, 2016

>I find the real world analogies quite weak and unnecessary.

I actually thought they were good, I always loved when teachers/professors took the time to explain things using real world analogies. Something about them just makes me remember things better.

amelius · on Aug 8, 2016

Also, their name is already an analogy: pointer.

mungoman2 · on Aug 8, 2016

I think they are great for building intuition. What don't you like about them?

_phaq · on Aug 8, 2016

I generally like real-world analogies, but in this case I do find it too elaborate and unnecessary, too.

As a straight answer to the question, it would have sufficed to explain that accessing a larger cache takes more time and resources than accessing a small cache.

Then one could have compared that to desk vs. cabinet once to make it visual, but there's no need to extend that analogy for each individual cache level.

That just exhausts the reader and makes it near-impossible to tell which parts of the analogy are relevant/accurate and which parts are just fluff to make the analogy work.

Alternatively, if you really want to explain each individual detail of caches, then do go with such an elaborate analogy, but then explain at every step to what it corresponds and which part of the analogy is relevant.

You shouldn't write out a page-worth of mostly accurate text and then write a paragraph afterwards to explain how the analogy fits.

Chances are you've lost half of your readership at that point and many (myself included to be honest) will quit reading at exactly that point, because they feel like you're repeating yourself.

smallnamespace · on Aug 8, 2016

> it would have sufficed to explain that accessing a larger cache takes more time and resources than accessing a small cache.

I think the point of the long analogy is to hammer in the intuition that physical locality has a concrete price in the real world, and the cache hierarchy is simply a consequence.

qq66 · on Aug 8, 2016

I also dislike most real-world computing analogies because it encourages the use of mental heuristics that are usually just not that relevant to the situation at hand. I remember when there was a big scandal about MBA applicants getting their admissions status by feeding in query strings to the URL, there was a big debate about whether it was like "breaking a lock" or "opening an unlocked door" which doesn't really respect the unique moral reasoning that you need to use to understand the actual situation at hand.

peyton · on Aug 8, 2016

Sometimes long analogies like that of the article have trouble making obvious distinctions between design choices and The Way Things Work.

For instance, this article's analogy makes assumptions about cache eviction policy and cache coherence that map nicely to an office setting. Other design choices might not map well. So intuition developed here might not be flexible enough to reason about caches in the real world.

Not to say the analogy detracts from the article. The author includes a list of caveats. I really enjoyed the piece.

EliRivers · on Aug 8, 2016

This kind of thing reminds me of people trying to teach pointers through clumsy analogies to envelopes and letters, or parcels and addresses, or cars and door handles, or some other awful mess.

As with pointers, when it comes to caches, a clear explanation without analogies is far superior.

posterboy · on Aug 8, 2016

The abstract and transparent explanation would involve a lot of maths, in which case a blog post might be the wrong place to start.

EliRivers · on Aug 8, 2016

I'm not convinced it would.

They're different kinds of storage, with principal differences in access speed, sharing between cores, physical location and so on. No mathematics needed for a working understanding.

smallnamespace · on Aug 8, 2016

I think you have it backwards.

The differences in access speed, density, and power between different types of caches are due to how our physical world works.

Therefore,uUsing a physical analogy is totally appropriate since we see the same constraints in different contexts; namely, there's an energetic cost to be paid for increased physical locality, and this is intuitively obvious to most people.

The specific details (e.g. eviction policy, cache topology, etc.) are mostly irrelevant for understanding this key point.

EliRivers · on Aug 11, 2016

Analogies are unhelpful here. I don't need an analogy to help people understand that different kinds of memory have different characteristics. I can just say it.

It is easier to understand if I simply state that there are different kinds of memory, and that they have different constraints in terms of speed, density and power. That is really, really easy to understand and from there I can accurately reason about it.

I don't need an analogy to understand any of this. Just saying that different kinds of memory have different speed, densities and power consumptions explains everything and provides all the intuition needed.

SixSigma · on Aug 8, 2016

One article worth reading is :

Machine perception of time, if only nanoseconds were seconds

[1] http://umumble.com/blogs/hardware/machine-perception-of-time...

dTal · on Aug 8, 2016

This is great.

ybaumes · on Aug 8, 2016

I though having multiple cache levels was about a trade-off between performances and costs. The closer to the cpus (or the fater cache lvl), the more expensive it is.

crististm · on Aug 8, 2016

Yes. I don't know where he gets the idea that a large L1 cache is to a CPU the same as a 150mx150m desk to a human. Address decoding is done in parallel, not sequentially. And desks are as large as people are comfortable to produce and use.

Likewise, if the RAM would be as cheap to produce as SRAM like it is as DRAM, it would be as fast as the CPU (since it is using the same technology as the CPU) and we would not need the cache at all. Imagine gigabytes of L1 cache!

Symmetry · on Aug 8, 2016

Well, address decoding can be started in parallel if your page size lets you do virtually indexed, physically tagged caches which applies to only some processors. But that's a separate issue from the relationship between cache size and cache speed. That's governed by three things.

First, the larger your cache the more layers of muxing you need to select the data you need, meaning more FO4s of transistor delay.

Second, the larger your cache the physically bigger it is. That means more physical distance between the memory location and where it is used. That means more speed of light delay.

And third there's the issue of resolving contention for shared versus unshared caches.

So despite the fact that you're using the same SRAM in both your L1 and L3 but access to the former takes 4 clock cycle but access to the later takes 80.

gchadwick · on Aug 8, 2016

There's also the fact that as you get down the cache hierachy the cache becomes more complicated. An L1 does lookups for a single processor, and responds to snoops. An L3 probably has several processors hanging it off and may deal with running the cache coherency protocol (e.g. implements a directory of what lines are where and sends clean or invalidation snoops when someone wants to upgrade a line from shared to unique). As a result you've got layers of buffering, arbitration and hazarding to get through before you can even touch the memory array.

smallnamespace · on Aug 8, 2016

> And desks are as large as people are comfortable to produce and use.

Think about what this implies though -- a desk that is too large becomes difficult for a person to use (for one, the person would have to start walking to access certain parts of it).

Likewise, L1 cache sizes are bounded, because the larger the cache becomes, the more difficult it is to address a particular location, and the cache also becomes physically larger such that speed-of-light propagation delays will slow the entire cache down.

Symmetry · on Aug 8, 2016

No, a cell of L1 cache is exactly as expensive as a cell of L3 cache (ignoring weird stuff like eDRAM).

Now, SRAM is these days made with 6 or 8 transistors while the the DRAM you use in your main memory only takes 1 transistor per cell. Also your DRAM is built with a different sort of silicon process so it cheaper on a transistor to transistor basis. But generally the dollar cost is the same for memory in any given location.

emcq · on Aug 8, 2016

Given a fixed area which fits a fixed number of transistors at the same cost, you allocate some portion of those transistors to compute and memory cells.

If you want to maintain your number of memory cells without decreasing the number of compute transistors, you need to grow your area which increases costs. That can be a very expensive thing here.

Additionally engineer time around layout and architectural costs are different for those different placements and cache requirements, so the cost is not uniform, but amortized it is not as significant as things like chip area.

Symmetry · on Aug 8, 2016

Changing a chip from having 8kB of L1 Dcache to 16kB might be far more expensive in design terms than making a similar change in L3 cache but from a blank slate would either be more expensive to design in the first place? When I look at the layout of a late model x86 the regular structures of the caches stand out in the die photos among the irregular hand-tuned logic. Yes, there are follow on effects on the layout from changes in cache size but I don't see any reason a priori to say whether increasing the L1 size will tend to make designing the rest of the core logic harder or easier.

So I still don't see any reason to back off from saying that a cell of L1 costs as much as a cell of L3, modulo concerns about keeping the cache size a power of 2.

kristianp · on Aug 8, 2016

Why have L1 caches have been the same size for quite a few generations of Intel Core processors?

"Currently Intel's L1D (level 1 data) cache is 512 lines with 64 bytes each, 32 kB. Been that way for a pretty long time. L1D latency with a pointer is mostly 4 cycles. Not sure, but I think having 1024 entries would increase that to 5 cycles.." - vardump, 550 days ago: https://news.ycombinator.com/item?id=9001238

The total L1 cache increases as you increase the number of cores though.

rsync · on Aug 8, 2016

Isn't cost the reason ?

That is, the reason you don't use battery backed DRAM for all of your photos is not because you don't want to, but because 8TB of the stuff would be very expensive. And so most of us have RAM leading to SSD leading to spinning platters.

So the reason to have a CPU cache at all is (insert interesting explanations of caches here). But the reason to have more than one CPU cache is because of the relative cost of the first cache, right ?

If cost was no object, wouldn't you just have a huge primary cache ?

sliverstorm · on Aug 8, 2016

Speed is a big issue, as well as address space. Caches can't be huge and really fast, partly because a fast cache is crazy power hungry, and a big cache has large distances to travel. So your on-die caches are hierarchical in part so that you can have a fast cache and a big cache.

The second problem, address space- larger address space requires more addressing bits, which makes the CPU bigger, and bigger means slower. This constrains cache as well as memory. The final level, disk, has that as an advantage over DRAM because it is not memory-mapped. It's not frequently a problem, but for example Nehalem, despite being a "64-bit" architecture, actually only supports 44-bits of address, setting a hard limit at 16 tebibytes of RAM (not considering MMIO etc)

vvanders · on Aug 8, 2016

Not really, SRAM(what most caches use) isn't just more costly is also consumes a lot more power. You'd be hitting thermal limits much sooner than with traditional DRAM.

Symmetry · on Aug 8, 2016

Is that right? Traditionally DRAM has used much more power because it requires a periodic refresh whereas the power consumption of SRAM is purely leakage. Now, maybe leakage power has grown so much in recent years that this is no longer true but if so I'd find that very surprising. Do you have any numbers?

vvanders · on Aug 8, 2016

In SRAM you get Leakage + Switching current. For cases like a cache where you're going to be constantly churning bits this can drive up power quite a bit.

I don't claim to be an expert so there are probably cases where it's lower. However when I was looking into fast RAM for storing data from an FPGA for a simple Logic Analyzer most of the SRAM was almost 2-4x what DRAM was for power consumption at the same storage size(with SRAM being a blazing fast cycle access).

Lagged2Death · on Aug 8, 2016

A hardware project I worked on some years ago included a chip that contained a small SRAM and a lithium battery, all soldered right onto the board, as a form of nonvolatile storage. Some SRAM can indeed be very low-power, but caches are probably a different animal.

lukeschlather · on Aug 8, 2016

Heating might be a concern. Also the speed of electricity means that density would be a concern depending on how large your memory array is.

lamontcg · on Aug 8, 2016

If you're going to combine all the CPU caches into L1 why not also put the RAM and the SSD into the CPU L1 cache as well and just have 1TB of L1 CPU cache?

Working out the cost and power consumption and die size of that might be instructive -- as a kind of reductio ad absurdum...

Noseshine · on Aug 8, 2016

So this document surely belongs here:

What Every Programmer Should Know About Memory

https://www.akkadia.org/drepper/cpumemory.pdf

You also get an answer for the question asked by the headline of this thread - in great detail (most people will probably skip a lot of details).

amelius · on Aug 8, 2016

I also recommend: https://www.amazon.com/Computer-Architecture-Fifth-Quantitat...