> and the case was decided against AMD because of the single FPU per core module...

monocasa · on Jan 23, 2023

The UltraSparc T1 shared the one FPU and the one logical L2 between all 32 threads/8 cores. L2 is very common to share between cores, the world has more or less converged on a shared L2 per core complex, so 4 to 8 cores. And you still see vector units shared between cores where it makes sense too, for instance Apple's AMX unit is shared between all of it's cores.

It's really only the frontend and it's data path to L1 that's a good argument here, but that's not actually listed in the complaint.

And even then, I can see where AMD was going. The main point of SMT is to share backend resources that would otherwise be unused on a given cycle, but these have dedicated execution units so it really is a different beast.

paulmd · on Jan 23, 2023

> And even then, I can see where AMD was going. The main point of SMT is to share backend resources that would otherwise be unused on a given cycle, but these have dedicated execution units so it really is a different beast.

Sure, but wouldn't it be ideal that if a thread wasn't using its integer unit and the other thread had code that could run on it, you'd allow the other thread to run?

"CMT" is literally just "SMT with dedicated resources" and that's a suboptimal choice because it impairs per-thread performance in situations where there's not anything to run on that unit. Sharing is better.

If the scheduler is insufficiently fair, that's a problem that can be solved. Guarantee that if there is enough work, that each thread gets one of the integer units, or guarantee a maximum latency of execution. But preventing a thread from using an integer unit that's available is just wasted cycles, and that's what CMT does.

Again: CMT is not that different from SMT. It's SMT where resources are fixed to certain threads, and that's suboptimal from a scheduling perspective. And if you think that's enough to be called a "core", well, Intel has been making 8-core chips for a while then. Just 2 cores per module ;)

Consumers would not agree that's a core. And pinning some resources to a particular thread (while sharing most of the rest of the datapath) does not change that, actually it makes it worse.

> It's really only the frontend and it's data path to L1 that's a good argument here, but that's not actually listed in the complaint.

That's just a summary ;) El Reg themselves discussed the shared datapath when the suit was greenlit.

https://www.theregister.com/2019/01/22/judge_green_lights_am...

And you can note the "such as" in the summary, even. That is an expansive term, meaning "including but not limited to".

If you feel that was not addressed in the lawsuit and it was incorrectly settled... please cite.

Again: it's pretty simple, stay far far clear of deceptive marketing and it won't be a problem. Just like NVIDIA got slapped for "3.5GB" even though their cards did actually have 4GB.

With AMD, "cores" that have to alternate their datapath on every other cycle are pretty damn bottlenecked and that's not what consumers generally think of as "independent cores".

monocasa · on Jan 23, 2023

> Sure, but wouldn't it be ideal that if a thread wasn't using its integer unit and the other thread had code that could run on it, you'd allow the other thread to run?

> "CMT" is literally just "SMT with dedicated resources" and that's a suboptimal choice because it impairs per-thread performance in situations where there's not anything to run on that unit. Sharing is better.

> If the scheduler is insufficiently fair, that's a problem that can be solved. Guarantee that if there is enough work, that each thread gets one of the integer units, or guarantee a maximum latency of execution. But preventing a thread from using an integer unit that's available is just wasted cycles, and that's what CMT does.

Essentially, no, what you're suggesting is a really poor choice for the gate count and numbers of execution units in a Jaguar. The most expensive parts are the ROBs and their associated bypass networks between the execution units. Doubling that combinatorial complexity would probably lead to a much larger, hotter single core that wouldn't clock nearly as fast (or have so many pipeline stages that branches are way more expensive (aka the netburst model)).

> And you can note the "such as" in the summary, even. That is an expansive term, meaning "including but not limited to".

Well, except that I argue it doesn't include those at all; shared L2 is extremely common, and shared FPU is common enough that people don't really bat an eye at it.

> If you feel that was not addressed in the lawsuit and it was incorrectly settled... please cite.

I'm going off your own citation. If you feel that after that these were brought up in the court case itself you're more than welcome to cite another example (ideally not a literal tabloid, but keeping the standards of the court documents you cited before).

> With AMD, "cores" that have to alternate their datapath on every other cycle are pretty damn bottlenecked and that's not what consumers generally think of as "independent cores".

That's not how these work. OoO Cores are rarely cranking away their frontends at full tilt, instead they tend to work in batches filling up a ROB with work that will then be executed as memory dependencies are resolved. The modern solution to taking advantage of that is to aggressively downclock the front end when not being used to save power, but I can see the idea of instead keeping it clocked with the rest of the logic and simply sharing it between two backends as a valid option.

johnklos · on Jan 24, 2023

But even this article states that the L2 and front end aren't bottlenecks on simultaneous operation.

Perhaps it'd be more accurate to say that the case was lost primarily based on the strength of the argument that there are four FPUs, considering how there had been other examples of independent cores sharing L2 and other things.