Maybe not. Superscalar is when say... Think of the following assembly code. Add ...

bjourne · on Nov 30, 2024

Agree with this. Calling SIMD superscalar is a misnomer since it is single instruction (multiple data) with very wide data paths. Superscalar implies multiple different instructions in parallel, such as adding a pair of numbers, while subtracting another pair (or even dividing).

pyrolistical · on Nov 29, 2024

I wish hardware exposed an api that allowed us to submit a tree of instructions so the hardware doesn’t need figure out which instructions are independent.

Lots of this kind of work can be done during compilation but cannot be communicated to hardware due to code being linear

dragontamer · on Nov 29, 2024

That's called VLIW and Intel Itanium is considered one of the biggest chip failures of all time.

There is an argument that today's compilers are finally good enough for VLIW to go mainstream, but good luck convincing anyone in today's market to go for it.

------

A big problem with VLIW is that it's impossible to predict L1, L2, L3 or DRAM access. Meaning all loads/stores are impossible to schedule by the compiler.

NVidia has interesting barriers that get compiled into its SASS (a level lower than PTX assembly). These barriers seem to allow the compiler to assist in the dependency management process but ultimately still require a decoder in the NVidia core final level before execution.

neerajsi · on Nov 29, 2024

Vliw is kind of the dual of what pyrolistical was asking for. Vliw lets you bundle instructions that are known to be independent rather than encode instructions to mark known dependencies.

The idea pyrolistical mentioned is closer to explicit data graph execution: https://en.m.wikipedia.org/wiki/Explicit_data_graph_executio....

creato · on Nov 30, 2024

VLIW is still in use in multiple DSP products on the market today, and they are good successful products in their niche.

They work very well if your code can be written as a loop without branches (or very limited branches) in the body, and a lot of instruction level parallelism in the body.

Unfortunately for Intel, most code doesn't look like that. But for most workloads that happen to also be a good case for SIMD, it is (can be) great.