plus doesn't the M1 have specialized hardware. what's the neuro-engine or whatever it is that they call it for speeding up ML? i imagine at it's core it's a bunch of instructions for doing vector operations.
side bar: is there documentation for the instruction set or abi for that hardware?
The Apple Neural Engine is separate from the CPU; it's not additional registers and instructions for the CPU, like a vector unit. You go through the Core ML framework to use it, just like you go through Metal or OpenGL to use the GPU.
The value of SIMD on a CPU these days is really a middle-ground where you value latency above throughput, so you probably would have the same trade off as getting the data to and from a GPU
That definitely changes the calculus, but as I've mentioned in a different comment there doesn't seem to be literally any microarchitectural documentation to read, so I (don't own an M1) have nothing to go off unfortunately.
I'll make a wild guess that getting data to the neural engine is still probably not quick because I assume it's some kind of statically scheduled type affair (exposed pipeline?). We literally seem to know almost nothing about it sadly.
side bar: is there documentation for the instruction set or abi for that hardware?