When my son was 5 or 6 I had a great discussion about salt containers - the little one you have on the table, the big packet in the pantry, the pallet that gets delivered to the supermarket, the vast piles of salt at the salt mine, and all that salt in the ocean.
Next time we had an egg and he wanted salt I scratched my head and asked him what should we do .. take our egg to the supermarket or we could take a pack lunch to the beach, and maybe wave the egg around in the water ? "No daddy, remember, we have a little salt cache in the kitchen." hehe.
I guess a lot of the world can be seen thru the glasses of caching data or physical things.
Farming is similar $200/tonne of potatoes and one 25kg bag is $10. Quite a difference in what a farmer gets and what the end products costs although not as bad as salt.
I was "enlightened" many years ago when I asked a colleague (a great electronic engineer) why we didn't do fast task switching by having two sets of registers. His reply was that this would require every regular access to a register to go through an extra gate (to decide which bank of registers you want to hit), making every access slightly slower.
Larger registers/caches/memories are slower because they need more address decoding, that time scaling approximately linearly as the storage doubles in size.
For all the siblings describing various forms of register banking, you're actually literally describing SMT in which one core executes two or more tasks at the same time, dividing the register set and other processor resources between the threads and effectively switching between the tasks every cycle. Modern x86, Power, and SPARC CPUs implement this, as do basically all GPUs for a while now, all for very good reasons.
Which is to say, it's quite easy for low-level architectural knowledge to become obsolete :p
Which is to say, it's quite easy for low-level architectural knowledge to become obsolete
Great insight. It's also a reminder that "best practice" is for a given set of assumptions, and it's good to periodically revisit whether these assumptions still hold. If the assumption is that the speed of register access is the limiting factor, it's probably good intuition to avoid anything that would slow this down. If instead the situation has changed so that memory access is hundreds of times slower than memory access (rather than the about the same), giving up some speed of register access for increased parallelism might be a good tradeoff.
It's interesting to think of hyperthreading[1] as being an alternative to adding layers of cache, with GPU's being at the extreme. It's also interesting to ask why 2x hyperthreading has been chosen by Intel, while Power8 allows up to 8x. Is this just market differentiation, or is there something architectural that makes this the right choice for each? And does support for hyperthreading come at a cost in clock speed, or is this already too constrained by heat and power?
[1] While I like avoiding made-up marketing terms, I'm using 'hyperthreading' rather than "SMT" here to distinguish multiple register sets sharing a single core from multiple cores sharing a single cache, and from multiple processors sharing access to RAM (SMP). Is there a separate acronym to distinguish "multicore simultaneous multi threading"?
> Which is to say, it's quite easy for low-level architectural knowledge to become obsolete
Truer words never spoken! It doesn't help that the industry has interests in keeping advances trade secrets. Knowledge becomes highly siloed and outmodded rather quickly.
I'm familiar with the term "register banking" is this the same as SMT? i.e if a functional unit is free I can execute another instruction there that would otherwise be waiting?
As cesarb points out the ARM did that, and so did the Z-80 IIRC: basically, your friend was either trolling you or had no idea what he was saying. Addressing is done in parallel. The address decoding isn't the part that drinks time.
No, he's right, although I can imagine that the impact wasn't too bad for early ARM and Z-80.
No matter how it is implemented concretely, decoding or "using" an address of N-bits needs N levels of logical circuitry. Each level of circuitry takes time for signals to pass through. Of course, the time taken within each level is less than the clock cycle time of your CPU, but it adds up. Too many levels, and you can't do the addressing in your allocated cycle time any more.
So there's a trade-off. That trade-off may well have been worth it for the Z-80 and early ARM, but you'll note that apparently ARM switched away from that model as well.
You still have to then AND the output of that decoder with the data lines which go to the registers right? OP's friend just mentioned 1 extra gate in the data path, and I presume this would be it.
Yep, but a 7-input AND gate will have basically the same delay time as an 8-input AND gate. They don't scale linearly as it would if you stacked 6 or 7 gates in series (although that's typically how you do it in math or programming)
Oh of course yeah. I was picturing it as a separate AND when it could be bunged in with the ANDs for individual register selects and write/load enables and so on also. Cool, thanks.
Yep, the long bitlines are a large factor in timing for register files. The timing is something like O(n) for n storage lines because it grows linearly with capacitance.
You're implying that OP's friend thinks an extra gate would require another clock cycle or something? He's just talking about propagation delay which is obviously increased with extra logic on the bus (in the multiplexor, or the register-bank-select from a decoder anded with the data lines, say.)
The 32-bit ARM architecture actually does this. It has a "fast interrupt" mode which switches half of the register file to a separate bank. (The newer 64-bit ARM architecture doesn't have this feature.)
I thought the fast interrupt skipped some of the pushing and popping when entering and leaving the interrupt service routine. In other words, the routine was not allowed to touch certain registers, so they didn't have to be saved off.
It can. On non-M ARMs, pushing and popping is done by the interrupt handler, not by the processor. So if you want a really low latency FIQ, you can just rely on the half of the register bank that gets swapped in, and avoid the parts that don't.
If I can afford to have double the number of registers (in one or 2 sets), surely the extra time cost of some extra logic is still far quicker than hitting the first level cache? Hitting a cache doesn't just incur the access cost, but we have to store the value in a register somewhere before we can act on it.
It's a quantitative tradeoff question -- for many programs it may be true at first that increasing the register count (say, from 16 to 32) allows more live values in-register and reduces memory traffic, but at a certain point there are diminishing returns, because the extra access time eventually becomes another cycle of latency and this hits all programs. Notable as well is that for many programs (and especially modern ones with large data sets), most cache misses occur because of accesses to large heap data structures, not because there were slightly too few registers and a few locals spilled to the stack.
There are also practical issues with increasing the architectural register set -- you'd have to define a new encoding for instructions, and this (i) adds significant cost (area, power, timing) to the instruction decoding logic, and (ii) imposes a giant cost on the software ecosystem -- new binaries, operating system support for context-switching, etc.
Cache on modern processors is so insanely fast it's already essentially an extension of the register file. IIRC, under ideal circumstances a L1 cache hit on Core i7 can take as little as 2-3 cycles, and averages about 4 cycles.
sounds similar to register windows on SPARC, which is rather used for procedure calls. It isn't so much that it is slow, just that when you have a decent level 1 cache it doesn't add much. Right about why small caches are faster.
What's the "cost" of registers on an x86-64 chip? How hard would it be to introduce 10 more general purpose registers? Ones that are used by the kernel, say, and one set for user space.
Because you say kernel, I'm assuming you mean ability to switch a set of registers instead of pushing them on the stack when servicing an interrupt request.
Unfortunately you can't get away with just one set of spare registers on x86 interrupt handling. Interrupt can be interrupted by another interrupt. So practically they have to save registers anyways. Unless there'd be one set for each level...
So those "10 extra registers" would be practically useless.
Having written kernel device driver interrupt handlers, I can also say CPU time is not spent saving and restoring registers, but waiting for glacially slow PCI-e MMIO register loads and stores (don't have exact figures at hand, but I remember one access can take 300-800 nanoseconds, that's thousands of CPU clock cycles). During one such fetch you might be able to store and restore registers tens, if not hundreds of times. X86 interrupt handling is a bit like stopping a freight train to pick up a single letter.
However, that kind of feature can be very useful on a microcontroller to help lowering interrupt handling latency.
Then all interrupts need to be prepared for it and check for it anyways. Crucially, the code path that does the check might also get interrupted, even multiple times. No synchronization is possible, and even if it was, it'd be slower than simply saving the registers.
It doesn't matter if that code path gets interrupted. There's no need to synchronize anything. If interrupt A sees that the extra registers are idle, and then gets interrupted, that's okay. By the time the inner interrupt returns, even if the inner interrupt used them, they will be put back to idle, and A can safely switch to them. Everything else works the same as current hardware.
There's the cost of a branch, but you could remove that by making it a hardware feature. It's a branch that should predict well though.
Nothing to do with (local) DRAM, everything to do with PCI-e bus.
Reading and writing to memory mapped I/O registers is how you send commands to peripherals, read their status, etc. These registers are not really memory at all, but representation on various logic states on the peripheral. So what you write is often not what you read back.
Note that I/O memory mapping has nothing to do with usermode memory mapping.
There are about 180 registers in the register file already. There are only names for 16 or so registers, the CPU internally does renaming for them. It would be possible to specify an API that leaves a part of the named registers reserved for kernel, but that wouldn't really help.
When an interrupt or system call happens, all the registers get pushed to the stack. The stack is typically in L1 cache (and subject to various CPU optimizations), so it's really fast to push the 16*64 bit registers to the stack.
System calls, interrupts and context switches are "slow" not because they have trivial overhead like pushing registers. What's really consuming the time is secondary effects like TLB flushes, changes to the page tables, polluting the branch predictor, etc. It takes a very long time for the CPU to "warm up" again after a context switch.
Can you elaborate on both there being 180 registers and the CPU does renaming of them? Surely the RAX register in an X86_64 CPU is always the same physical location no?
> Surely the RAX register in an X86_64 CPU is always the same physical location no?
No, it is not.
Modern CPUs all use Physical Register Files or PRFs for implementing OoO. The way they work is that there is one backing register file of ~150-200 registers, and a naming table in the frontend. Each register file in the backing table can be in one of 3 states -- waiting for data, contains data, or clean. Every time an instruction that writes data to a register is executed, during the register rename stage the renamer picks one register from the clean set, assigns that as the output of the uop and the current value of the logical register, and sets it as waiting for data.
That is, let's say you execute:
add RAX, RAX
and the current value of RAX in the rename table was PRF#1, with PRF#1 carrying data and all other physical registers clean. In the renamer, this would then turn into:
add PRF#2, PRF#1, PRF#1
if this instruction was immediately followed by another add, RAX, RAX, it would then turn into:
add, PRF#3, PRF#2, PRF2, and after renaming that instruction RAX points to PRF#3.
and so on. In PRF machines, architectural registers only exist as pointers into the PRF.
The reason for this design is that it makes OoO execution simple, and reduces unnecessary data movement. An instruction is ready to execute once all it's inputs have data in the PRF, and when it executes there is no need to move data into some special architectural register.
Registers are reaped from the "haves data" set into the clean set once no instruction or architectural register points at them. Note that multiple architectural registers can point to the same physical register: mov rax, rbx is resolved in the frontend just by pointing rax to the same register as rbx, and there is typically a special register for the value 0 that all common register-clearing operations use.
This was a fantastic explanation so thanks. A couple of questions:
PRF is Physical register file? As opposed to the logical which can point to different backing store(the actual SRAM.)
Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.
> PRF is Physical register file? As opposed to the logical which can point to different backing store(the actual SRAM.)
Yep. PRF means the single backing store where all values go, and PRF systems are named for it because they are usually contrasted to the other very common way to implement OoO, ROB-backed systems, where there is a specific location for the final architectural value, and possibly multiple in-progress values in the ROB. Intel used values in the ROB in all their CPUs up to Sandy Bridge, which was their first PRF design.
(The main difference between the types is that ROB-backed is faster in circuit delay, but moves more data than a PRF design, so on the same process they potentially clock higher but produce more heat. PRF-based designs are also easier to scale wider than the ROB-based design.)
> Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.
No, you can do renaming and OoO well without uops, it just requires that your instruction set is designed to be amenable to it. Before x86 had uops, it couldn't do OoO and the competing RISC CPUs were way faster because of this. PPro added uops precisely because they made it possible for an x86 cpu to do OoO, and eventually x86 beat all the competing RISC chips at their own game.
> Register renaming and OoO operations are really enabled by micro-ops? In other words on a CPU with completely hard-wired control unit such things would never be possible.
No, they're orthogonal concepts. The first OoO processor (the System 360 model 91) didn't have uops.
Wow, this is kind of blowing my mind that they were doing this with a hard-wired control unit in 1969(I think I have the date right.) Big Blue, its easy now a days to forget just how amazing they were. Do you have links you can recommend? Thanks.
Actually, the Mill uses just the forwarding network which I completely ignored in this explanation.
The really short explanation about the forwarding network is that writing/reading the PRF takes too much time for you to read it and do work on the same cycle. If you operated on it alone, all operations that were dependent on each other would have multi-cycle latencies.
Since this sucks, in addition to the PRF the machines have a forwarding network, where the result of every instruction is broadcast to all the execution units of a similar type. If the broadcasted PRF number matches the register you want, it's read from the forwarding network instead of the PRF.
AFAICT, the Mill works by completely eschewing the final register file, and exclusively using the forwarding network with some local storage at each execution unit for the past values on the network.
I am not who you responded to, but: the registers are always in the same physical location, but internally, they are virtually renamed for efficient reuse and to avoid resource contention. If the CPU knows the stream of assembly instructions to be executed, it can abstract away actual register names and show a larger number of registers than what's physically available.
All that extra state is there to help overlap instructions on a time scale of nanoseconds. Every instruction has its own name->register mapping. If you're switching to kernel mode or maybe if you miss a branch or for whatever other reason, the CPU consolidates you back down to running one instruction. Once that happens, you only have one name->register mapping and all the registers have their correct values. The hidden state is reduced to nothing and you only have to save the "real" state of those 16.
To expand on another answer here -- there are more values in the PRF than there are names because "old" instructions in flight can refer to outdated versions of architectural registers (the names). But a new instruction can only see the latest versions of the architectural registers, so only those 16 actually matter w.r.t. future execution: when we switch back to the task, its new instructions will ony need the 16 live values.
As far as I understand it, the x86 uses many more registers than you think "under the hood", and it's exposing just a few "virtual registers" that are mapped to any of the underlying ones. So what you're thinking of is already done to an extent, except the at the kernel-level it's invisible.
(Exposing more virtual registers have two main costs, then: you lose backwards compatibility for all programs that use the extended register set, and to back them up with the same technique you need several more real registers.)
Hardly relevant in (obvious) interrupt service routine context. These registers you're talking about are only used for reordering instruction stream to avoid stalls. Nice for general performance, but useless when servicing interrupts.
register: a tomato in your hand
level 1 cache: a tomato on the counter
level 2 cache: a tomato in the refrigerator
level 3 cache: a tomato at the store
main memory: a tomato on the plant at the farm
disk: a tomato seed being planted
I'm not aware of any rigorous experiments that demonstrate that tomatoes are better left at room temperature than in a refrigerator. The French study referenced indicates that volatiles are reduced (no surprise) but that the effect is partially negated by allowing them to come back to room temperature before consumption. From what I can tell, this study does not compare quality of tomatoes left at room temperature for days to refrigerated tomatoes.
I don't consider 2 separate testings with 16 subjects to be a poor experiment. Could it be even better? Sure. Is it a lot more rigorous than the anecdotal claims people propagate about refrigerated tomatoes? Absolutely.
We are limited to two hands. The kitchen can only hold so many tomatoes and other things you need to cook. You could buy all the produce you needed for the year and store it at your house, but it is wasteful (you actually don't need it, storing it is expensive, and it displaces other things you need in your house like your toilet).
In the old days, humans would live near their wild food and not cache much. Then the Neolithic revolution happened, we started caching seeds and then surplus produce in cities, and eventually refrigeration came along and we could cache more. Each new level of caching allowed us to do more (increase population). Ya, we could still live as hunter gatherers, but we would be doing less. We could still be using Eniacs, also, we would just be computing slower.
More hands - Imagine the brain power required to coordinate more than 2 hands. It's already really hard to do two separate tasks. (You may also ask why not slice and dice a couple of tomatoes simultaneously)
Add the energy allocated just for the brain and vascular system and you'd understand why having a tomato onto the table is way simpler.
The right analogy would be bigger hands. Another reason than already mentioned IIRC is the connection on the cpu, it's a problem of geometry: Access speed decreases with physical distance on the core.
The other problems might be neatly visualized with geometric examples as well, but I'm not sure I understand them right. I'm comfortable to assume it's black magic, when fgiesen aka ryg is talking about it.
The post the author wrote goes more in depth than this. Per your analogy, it doesn't quite explain if multiple people (cores) want to make a salad. For example, how does the tomato on the counter and refrigerator get allocated?
Obviously the few sentence answer here is much simpler than the (reposted) over my head article discussing cache strategies. If the article contains the answer, it's not up front. It starts with a quote that goes along the lines of "I know why cache is necessary ...", so then I didn't read much further to find out if the explanation will be repeated, because I remember reading the article before. I jumped to the conclusion that the necessity for a cache also explains the necessity for caches of caches.
Edit: I was trying to say, the in depth explanation there and the ad-hoc hint given here both have their place.
On the other hand, maybe not, if one should read the article first, but who does that, right? /s
> For a “Haswell” or later Core i7 at 3GHz, we’re talking aggregate code+data L1 bandwidths well over 300GB/s per core if you have just the right instruction mix; very unlikely in practice, but you still get bursts of very hot activity sometimes.
>I find the real world analogies quite weak and unnecessary.
I actually thought they were good, I always loved when teachers/professors took the time to explain things using real world analogies. Something about them just makes me remember things better.
I generally like real-world analogies, but in this case I do find it too elaborate and unnecessary, too.
As a straight answer to the question, it would have sufficed to explain that accessing a larger cache takes more time and resources than accessing a small cache.
Then one could have compared that to desk vs. cabinet once to make it visual, but there's no need to extend that analogy for each individual cache level.
That just exhausts the reader and makes it near-impossible to tell which parts of the analogy are relevant/accurate and which parts are just fluff to make the analogy work.
Alternatively, if you really want to explain each individual detail of caches, then do go with such an elaborate analogy, but then explain at every step to what it corresponds and which part of the analogy is relevant.
You shouldn't write out a page-worth of mostly accurate text and then write a paragraph afterwards to explain how the analogy fits.
Chances are you've lost half of your readership at that point and many (myself included to be honest) will quit reading at exactly that point, because they feel like you're repeating yourself.
> it would have sufficed to explain that accessing a larger cache takes more time and resources than accessing a small cache.
I think the point of the long analogy is to hammer in the intuition that physical locality has a concrete price in the real world, and the cache hierarchy is simply a consequence.
I also dislike most real-world computing analogies because it encourages the use of mental heuristics that are usually just not that relevant to the situation at hand. I remember when there was a big scandal about MBA applicants getting their admissions status by feeding in query strings to the URL, there was a big debate about whether it was like "breaking a lock" or "opening an unlocked door" which doesn't really respect the unique moral reasoning that you need to use to understand the actual situation at hand.
Sometimes long analogies like that of the article have trouble making obvious distinctions between design choices and The Way Things Work.
For instance, this article's analogy makes assumptions about cache eviction policy and cache coherence that map nicely to an office setting. Other design choices might not map well. So intuition developed here might not be flexible enough to reason about caches in the real world.
Not to say the analogy detracts from the article. The author includes a list of caveats. I really enjoyed the piece.
This kind of thing reminds me of people trying to teach pointers through clumsy analogies to envelopes and letters, or parcels and addresses, or cars and door handles, or some other awful mess.
As with pointers, when it comes to caches, a clear explanation without analogies is far superior.
They're different kinds of storage, with principal differences in access speed, sharing between cores, physical location and so on. No mathematics needed for a working understanding.
The differences in access speed, density, and power between different types of caches are due to how our physical world works.
Therefore,uUsing a physical analogy is totally appropriate since we see the same constraints in different contexts; namely, there's an energetic cost to be paid for increased physical locality, and this is intuitively obvious to most people.
The specific details (e.g. eviction policy, cache topology, etc.) are mostly irrelevant for understanding this key point.
Analogies are unhelpful here. I don't need an analogy to help people understand that different kinds of memory have different characteristics. I can just say it.
It is easier to understand if I simply state that there are different kinds of memory, and that they have different constraints in terms of speed, density and power. That is really, really easy to understand and from there I can accurately reason about it.
I don't need an analogy to understand any of this. Just saying that different kinds of memory have different speed, densities and power consumptions explains everything and provides all the intuition needed.
I though having multiple cache levels was about a trade-off between performances and costs. The closer to the cpus (or the fater cache lvl), the more expensive it is.
Yes. I don't know where he gets the idea that a large L1 cache is to a CPU the same as a 150mx150m desk to a human. Address decoding is done in parallel, not sequentially. And desks are as large as people are comfortable to produce and use.
Likewise, if the RAM would be as cheap to produce as SRAM like it is as DRAM, it would be as fast as the CPU (since it is using the same technology as the CPU) and we would not need the cache at all. Imagine gigabytes of L1 cache!
Well, address decoding can be started in parallel if your page size lets you do virtually indexed, physically tagged caches which applies to only some processors. But that's a separate issue from the relationship between cache size and cache speed. That's governed by three things.
First, the larger your cache the more layers of muxing you need to select the data you need, meaning more FO4s of transistor delay.
Second, the larger your cache the physically bigger it is. That means more physical distance between the memory location and where it is used. That means more speed of light delay.
And third there's the issue of resolving contention for shared versus unshared caches.
So despite the fact that you're using the same SRAM in both your L1 and L3 but access to the former takes 4 clock cycle but access to the later takes 80.
There's also the fact that as you get down the cache hierachy the cache becomes more complicated. An L1 does lookups for a single processor, and responds to snoops. An L3 probably has several processors hanging it off and may deal with running the cache coherency protocol (e.g. implements a directory of what lines are where and sends clean or invalidation snoops when someone wants to upgrade a line from shared to unique). As a result you've got layers of buffering, arbitration and hazarding to get through before you can even touch the memory array.
> And desks are as large as people are comfortable to produce and use.
Think about what this implies though -- a desk that is too large becomes difficult for a person to use (for one, the person would have to start walking to access certain parts of it).
Likewise, L1 cache sizes are bounded, because the larger the cache becomes, the more difficult it is to address a particular location, and the cache also becomes physically larger such that speed-of-light propagation delays will slow the entire cache down.
No, a cell of L1 cache is exactly as expensive as a cell of L3 cache (ignoring weird stuff like eDRAM).
Now, SRAM is these days made with 6 or 8 transistors while the the DRAM you use in your main memory only takes 1 transistor per cell. Also your DRAM is built with a different sort of silicon process so it cheaper on a transistor to transistor basis. But generally the dollar cost is the same for memory in any given location.
Given a fixed area which fits a fixed number of transistors at the same cost, you allocate some portion of those transistors to compute and memory cells.
If you want to maintain your number of memory cells without decreasing the number of compute transistors, you need to grow your area which increases costs. That can be a very expensive thing here.
Additionally engineer time around layout and architectural costs are different for those different placements and cache requirements, so the cost is not uniform, but amortized it is not as significant as things like chip area.
Changing a chip from having 8kB of L1 Dcache to 16kB might be far more expensive in design terms than making a similar change in L3 cache but from a blank slate would either be more expensive to design in the first place? When I look at the layout of a late model x86 the regular structures of the caches stand out in the die photos among the irregular hand-tuned logic. Yes, there are follow on effects on the layout from changes in cache size but I don't see any reason a priori to say whether increasing the L1 size will tend to make designing the rest of the core logic harder or easier.
So I still don't see any reason to back off from saying that a cell of L1 costs as much as a cell of L3, modulo concerns about keeping the cache size a power of 2.
Why have L1 caches have been the same size for quite a few generations of Intel Core processors?
"Currently Intel's L1D (level 1 data) cache is 512 lines with 64 bytes each, 32 kB. Been that way for a pretty long time.
L1D latency with a pointer is mostly 4 cycles. Not sure, but I think having 1024 entries would increase that to 5 cycles.." - vardump, 550 days ago:
https://news.ycombinator.com/item?id=9001238
The total L1 cache increases as you increase the number of cores though.
That is, the reason you don't use battery backed DRAM for all of your photos is not because you don't want to, but because 8TB of the stuff would be very expensive. And so most of us have RAM leading to SSD leading to spinning platters.
So the reason to have a CPU cache at all is (insert interesting explanations of caches here). But the reason to have more than one CPU cache is because of the relative cost of the first cache, right ?
If cost was no object, wouldn't you just have a huge primary cache ?
Speed is a big issue, as well as address space. Caches can't be huge and really fast, partly because a fast cache is crazy power hungry, and a big cache has large distances to travel. So your on-die caches are hierarchical in part so that you can have a fast cache and a big cache.
The second problem, address space- larger address space requires more addressing bits, which makes the CPU bigger, and bigger means slower. This constrains cache as well as memory. The final level, disk, has that as an advantage over DRAM because it is not memory-mapped. It's not frequently a problem, but for example Nehalem, despite being a "64-bit" architecture, actually only supports 44-bits of address, setting a hard limit at 16 tebibytes of RAM (not considering MMIO etc)
Not really, SRAM(what most caches use) isn't just more costly is also consumes a lot more power. You'd be hitting thermal limits much sooner than with traditional DRAM.
Is that right? Traditionally DRAM has used much more power because it requires a periodic refresh whereas the power consumption of SRAM is purely leakage. Now, maybe leakage power has grown so much in recent years that this is no longer true but if so I'd find that very surprising. Do you have any numbers?
In SRAM you get Leakage + Switching current. For cases like a cache where you're going to be constantly churning bits this can drive up power quite a bit.
I don't claim to be an expert so there are probably cases where it's lower. However when I was looking into fast RAM for storing data from an FPGA for a simple Logic Analyzer most of the SRAM was almost 2-4x what DRAM was for power consumption at the same storage size(with SRAM being a blazing fast cycle access).
A hardware project I worked on some years ago included a chip that contained a small SRAM and a lithium battery, all soldered right onto the board, as a form of nonvolatile storage. Some SRAM can indeed be very low-power, but caches are probably a different animal.
If you're going to combine all the CPU caches into L1 why not also put the RAM and the SSD into the CPU L1 cache as well and just have 1TB of L1 CPU cache?
Working out the cost and power consumption and die size of that might be instructive -- as a kind of reductio ad absurdum...
Next time we had an egg and he wanted salt I scratched my head and asked him what should we do .. take our egg to the supermarket or we could take a pack lunch to the beach, and maybe wave the egg around in the water ? "No daddy, remember, we have a little salt cache in the kitchen." hehe.
I guess a lot of the world can be seen thru the glasses of caching data or physical things.