Yea, totally agree that memory is at the heart of the problem. Not only in needing a unified view, but also in bandwidth constraints that seem to be the bottleneck in so much code (von neumann bottleneck).
I think a real solution would not just be software only, it would include a hardware component (or tuned to work with a particular chip): a SoC with a few fat cpu cores, many lite core, a GPU, some fpga fabric, all sharing a large cache subsystem and memory.
I think a real solution would not just be software only, it would include a hardware component (or tuned to work with a particular chip): a SoC with a few fat cpu cores, many lite core, a GPU, some fpga fabric, all sharing a large cache subsystem and memory.