AMD is in a really good position right now. Intel still has the lead in low idle...

dragontamer · on Sept 20, 2019

The main selling point for Intel isn't speed anymore.

Intel still has superior performance counters and debugging features. Mozilla's rr (Record and Replay framework) only works on Intel for example, and Intel vTune is a very good tool. AVX512 is also an advantage, as you've noted.

There are other instruction set advantages: I think Intel has faster division / modulus operator, and also has single-clock pext / pdep (useful in chess programs).

For most people however, who might not be using those tools, I'd argue that AMD's offerings are superior.

nwmcsween · on Sept 20, 2019

It also has superior vulnerabilities

znpy · on Sept 20, 2019

For the consumer market, this small advantages are not worth considering, imho. Amd processors do more and cost less, and power consumption is being optimised at each iteration.

For server/pro market they might be worth considering but again, the huge BOM cut that you get by choosing AMD processor might be worth the performance penalty.

luizfelberti · on Sept 20, 2019

Is AVX512 really an advantage though?

Intel is infamous for severely downclocking the processor for these and other AVX/SSE family instructions, to a point where sometimes using them makes the program slower than it would be otherwise, especially if you're constantly provoking frequency switches between them and regular instructions.

AMD might not have implemented AVX512 specifically yet (there's nothing legally keeping them from doing so however, they have patent sharing agreements with Intel regarding the entire x86/x64 ISA and extensions), but what they currently DO have is all common SIMD extensions implemented (up to SSE4 and AVX2 if I'm not mistaken) without incurring any frequency penalties on clock speeds for using them.

I can live without AVX512 for now, even though I'd be happier to have it. But I would really rather not have it if it came out in the same crap implementation that Intel has.

celrod · on Sept 21, 2019

You can look at binning statistics for non-avx/avx2/avx512 clock speeds: https://siliconlottery.com/pages/statistics

For example, the worst 7980XEs do 4.1/3.8/3.6 GHz for each of these respectively. 0.5 GHz down clock isn't too bad. You can change these settings in your bios however you'd like on the unlocked CPUs (ie, the HEDT lineup + W3175X). I do find those down clocks are necessary; I've passed 100C with a 360mm radiator and roaring fans. AVX2 loads don't get anywhere near that hot.

But for all that, I do see a substantial benefit on many workloads from avx512 -- at least 50% better performance than what I'd get from avx2.

I definitely think it's nice to have, especially if you enjoy vectorizing code and looking at assembly. With much bigger performance wins on the line, it's more rewarding and more fun -- and you have more tools to play with, like vector masking (unfortunately, gather/scatter have been disappointing). Fun or not though, if you offer me avx512 on one hand vs twice the cores with full avx2 for the same price on the other, I'd have a hard time rationalizing avx512.

0-_-0 · on Sept 21, 2019

> I've passed 100C with a 360mm radiator and roaring fans

Damn, which radiator? Was there a GPU under load in your water loop?

celrod · on Sept 25, 2019

Celsius S36. My GPU is on a separate loop and was idle at the time.

I was running benchmarks of Intel MKL's zgemm vs zgemm3m because of a Julia PR that recommended replacing the former with the latter. I don't think anything hits a CPU quite as hard as a good BLAS.

I think my thermal paste may be bad, because the CPU idles hot -- nearly 35 C. I ordered a Direct Die-X and MO-RA-420 radiator, so I'm planning on swapping the AIO for an open loop with way more radiator area and flow through the fins.

Running dgemm, that CPU would hit just a tad below 2 teraflops. I'd like to get it just over that (and run much cooler).

zwaps · on Sept 21, 2019

I personally run AMD at home, but in all the benchmarks I have seen, Intel wins by a large margin in AVX enabled tasks - often double the speed. Downclocking did not seem to be a factor here.

I think saying that Intel wins in AVX tasks is absolutely fair.

For example, I had a simulation that had to run on the CPU for reasons, but made use of AVX. Intel was consistently faster on any system I tested.

celrod · on Sept 21, 2019

The 7nm Ryzen parts should have double the avx2 as the older parts. Zen1 has half the throughput per cycle (or twice the reciprocal throughput) when using ymm (256 bit) registers vs xmm (128 bit) in general

https://www.agner.org/optimize/instruction_tables.pdf

If you want to look at `vmov`s or arithmetic like `vadd` or `vmul`. Particularly glaring is that for moves between memory and (xmm vs ymm) Zen1 has a recirpical throughput of (1 vs 2), ie that on average it is able to complete an xmm-memory move once per cycle, and a 256-bit move one every two cycles. Skylake-X instead has 0.5 for xmm/ymm/zmm-memory. That is, it can move up to 512-bits between a register and memory twice per cycle. That is 8-times the throughput.

Arithmetic isn't as bad, but Zen1's reciprocal throughput goes from 0.5 to 1 on xmm to ymm, while Intel stays at 0.5 independent of vector size.

I haven't seen data on the 7nm Ryzen parts, but their marketing claimed it was supposed to have full width avx2, so I imagine things are different now, and that 7nm Ryzen will do just as well for avx workloads per core and clock as all the Intel parts without avx512.

EDIT: Some instructions on Intel get slower with wider vectors, like vdiv, vsqrtpd, vgather...

floatboth · on Sept 21, 2019

> sometimes using them makes the program slower than it would be otherwise

oh no, that program will be super fast.

It would make other programs that run concurrently with the AVX one slower.

rbanffy · on Sept 20, 2019

Even if you do, in most cases you won't use them on all your workloads. This means not all your boxes need to be Intel.

dragontamer · on Sept 20, 2019

I would hope that your production hardware matches your developer / staging / testing hardware.

Lets say production is 50% slower than what's tested in staging / developer test cases. Is it the data in production that causes this performance loss? Or is it hardware differences?

If you are using Intel tools to debug performance problems on developer / testing stages, you probably want to keep using those Intel tools in staging / production. There are enough cache differences and instruction-level differences (speed of "division" instruction. PEXT vs PDEP. Cache differenches, branch predictor differences, TLB differences) between the chips.

Intel has interesting optimizations: an Intel Ethernet card drops the data off in L3 cache (bypassing DDR4 RAM entirely). These little differences in the driver / motherboard / CPU can have a huge difference in performance, and complicate performance testing / performance debugging.

If you are deploying to AMD hardware for production, you probably want to be running AMD hardware in testing / developer stages as well. You want all your hardware performing as similarly as possible.

AstralStorm · on Sept 20, 2019

Said interesting networking optimization is a gaping security home that has already been exploited in the wild.

dragontamer · on Sept 20, 2019

From my understanding, that vulnerability exists only if RDMA is also enabled.

RDMA, the ability to share RAM as if it were local RAM (through a memory-mapped IO mechanism) across Ethernet is not a common setup. The fact that you can perform cache-timing attacks over RDMA + Intel L3 cache is a testament to how efficient the system is if anything.

Consider this interpretation: RDMA + DDIO is so fast, you can perform cache-timing attacks over Gigabit Ethernet(!!). NetCAT (the "vulnerability" you describe) is proof of it.

Cache-timing / side channel attacks aren't exactly the kind of vulnerabilities that most people think of though. Its kinda cool, but its nothing as crazy as Meltdown / Spectre were.

heinrich5991 · on Sept 21, 2019

>Mozilla's rr (Record and Replay framework) only works on Intel for example

Can you source that? I can't find it on the Wikipedia page[1] or its homepage[2].

[1]: https://en.wikipedia.org/w/index.php?title=Rr_(debugging)&ol... [2]: https://rr-project.org/

0xcde4c3db · on Sept 21, 2019

https://github.com/mozilla/rr/issues/2034

Mathnerd314 · on Sept 20, 2019

I looked for some numbers on low idle power, apparently 3k significantly improved something: https://www.reddit.com/r/AMD_Stock/comments/ado0ix/ryzen_mob.... This reviewer seems impressed, something about only using 10W in idle on desktop: https://youtu.be/M5pHUHGZ7hU?t=363.

And a head-to-head test of a laptop available in AMD and Intel variants says it has better battery life, although the screen panel could be the reason: https://www.notebookcheck.net/Lenovo-ThinkPad-T495-Review-bu...

I'd like more info, and Intel still probably has a lot of firmware tweaks etc. that AMD has to implement to win microbenchmarks, but to a first order it's not clear Intel has a lead there anymore.

ogrisel · on Sept 20, 2019

The BLAS & Lapack subset of the API of the Intel Math Kernel Library (MKL) is very well implemented in open source projects such as OpenBLAS and BLIS:

https://github.com/flame/blis

Both are well optimized for AMD CPUs.

vj44 · on Sept 20, 2019

I work in this space... and let's just say that MKL is definitely NOT well optimized for AMD's chips. You'll be lucky to get 10-20% efficiency. Nevermind openblas.

_rh1k · on Sept 20, 2019

This is well documented: https://www.agner.org/optimize/blog/read.php?i=49#49, https://www.agner.org/optimize/blog/read.php?i=49#112.

It goes very far back to MMX: https://yro.slashdot.org/comments.pl?sid=155593&cid=13042922

tldr: Intel's compiler doesn't optimize using standardized instructions on non-Intel hardware.

localhost · on Sept 20, 2019

Intel optimizes these libraries down to the stepping level of the CPUs. So not surprising if they are not optimized at all for AMD

jejones3141 · on Sept 20, 2019

Is it anything like the way their compiler detected SSEn in a way that guaranteed it wouldn't use those instructions on AMD processors even if they supported them?

_rh1k · on Sept 20, 2019

Of course. It's very much intentional, and "not optimized for AMD" is putting it very very mildly. They don't need to optimize purely for stepping level, they could provide sane codepaths for when the CPU flags indicate certain features.

See my other comment on this topic.

eyegor · on Sept 20, 2019

Yes but it wasn't a (if not amd). It was a series of checks based on specific families of Intel cpus, such as haswell, sandy bridge, etc. So it was never actually querying whether the cpu supported instruction x, it was asking what family it belonged to and then applying static rules based on that. Maybe nuance, but it also has the potential to hurt their processors if not kept up on so maybe less malice and more convenience?

_rh1k · on Sept 20, 2019

They've been explicit about their motivations in this regard (claiming innocence). Then they backtracked when convenient (surprise!), but in a way that still broke AMD processors. See here: https://www.agner.org/optimize/blog/read.php?i=49#49

By the way, it's interesting to note that Intel has a disclaimer on every MKL documentation page about this; my speculation: this was required by terms of a settlement.

From the above link:

>The Intel CPU dispatcher does not only check the vendor ID string and the instruction sets supported. It also checks for specific processor models. In fact, it will fail to recognize future Intel processors with a family number different from 6. When I mentioned this to the Intel engineers they replied:

> > You mentioned we will not support future Intel processors with non-'6' family designations without a compiler update. Yes, that is correct and intentional. Our compiler produces code which we have high confidence will continue to run in the future. This has the effect of not assuming anything about future Intel or AMD or other processors. You have noted we could be more aggressive. We believe that would not be wise for our customers, who want a level of security that their code (built with our compiler) will continue to run far into the future. Your suggested methods, while they may sound reasonable, are not conservative enough for our highly optimizing compiler. Our experience steers us to issue code conservatively, and update the compiler when we have had a chance to verify functionality with new Intel and new AMD processors. That means there is a lag sometime in our production release support for new processors.

> In other words, they claim that they are optimizing for specific processor models rather than for specific instruction sets. If true, this gives Intel an argument for not supporting AMD processors properly. But it also means that all software developers who use an Intel compiler have to recompile their code and distribute new versions to their customers every time a new Intel processor appears on the market. Now, this was three years ago. What happens if I try to run a program compiled with an old version of Intel's compiler on the newest Intel processors? You guessed it: It still runs the optimal code path. But the reason is more difficult to guess: Intel have manipulated the CPUID family numbers on new processors in such a way that they appear as known models to older Intel software. I have described the technical details elsewhere.

marmaduke · on Sept 20, 2019

Parent said OpenBLAS and Blis, not MKL, are optimized for AMD.

jdsully · on Sept 21, 2019

I feel like at this point if you use an intel library or compiler you should know its Intel only. If you aren’t using it in a controlled environment stick to clang/gcc.

I can’t really blame them. Why support your competitor?

gameswithgo · on Sept 20, 2019

does this remain true on the zen2 cpus which finally do avx properly?

wumpus · on Sept 20, 2019

Intel is famous for checking for 'IntelInside' instead of cpu feature bits, and taking a generic and slow code path if it's not IntelInside.

zelon88 · on Sept 20, 2019

I think most server operators look at overall performance. Once you start buying hardware specifically for one purpose you're cornering yourself.

Besides, who spends $15,000 on a mid-high end server to run single threaded applications anyway?

eyegor · on Sept 20, 2019

I happen to know of several companies doing physics problems that scale poorly across cores that spend far north of that, usually building out small clusters. Then you run 100s of independent simulations since each individual one doesn't really scale.

paranoidrobot · on Sept 20, 2019

You seem to be say that both

a) Single instance of application doesn't scale over multiple cores, and

b) Multiple instances of application scales well over multiple independent servers

Can you explain why they are unable to efficiently run multiple instances of the application on the same CPU (with multiple cores)?

The only thing I could think of would be running up against IO/Memory bandwidth limits.

eyegor · on Sept 20, 2019

They can, what I'm saying is that a single application doesn't scale well over multiple cores. Multiple instances on a single cpu generally works fine, but the biggest impact on performance is per core speed.

Edit: I was really just responding to "who spends $15,000 on a mid-high end server to run single threaded applications anyway?". I would absolutely consider this a "single threaded application".

Mistletoe · on Sept 20, 2019

What are the physics problems?

eyegor · on Sept 20, 2019

Fluid flow and most particle simulations with a large number of particles. The limiting factor is the inter particle interactions, so all the calculations have to feed back into each other.

BubRoss · on Sept 21, 2019

Both of those problems are well worn and can scale to as many cores as we can put in a single computer.

Whether it is a navier-stokes grid/image fluid simulation, arbitrary points in space that work off of nearest neighbors or a combination of both (by rasterizing into a grid and using that to move the particles), there are many straightforward ways to use lots of CPUs.

Fork join parallelism is a start. Sorting particles into a kd-tree is done by recursively partitioning and the partitions can be distributed amount cores. The sorting structure can be read but not written by as many cores as you want, and thus their neighbors can be searched and found by all cores at once.

gameswithgo · on Sept 20, 2019

simulations that don't scale, that do scale after all.

eyegor · on Sept 20, 2019

If you spawn 100 independent instances, it's not really the problem itself scaling. The point is that given a single set of operating conditions you won't see any meaningful gains going from 2 to 100 cores. Using idle resources for other simulations doesn't make the problem itself scale.

z3t4 · on Sept 21, 2019

I would say most apps don't benefit from multiple cores. So single threaded performance is still important.

Jonnax · on Sept 20, 2019

I hear a lot about AVX-512 being really good.

Is there any software that's commonly used that has a measurable performance boost with it? Or is it more specialised stuff?

dragontamer · on Sept 20, 2019

> I hear a lot about AVX-512 being really good.

Its a great instruction set, Absolutely great, AVX512 supports gather/scatter, a whole slew of efficient processing instructions, etc. etc.

However, AVX512 has poor implementations right now. Skylake-X is one of the only implementations, and running it drops the clock-rate in ways that are difficult to predict. (One core running AVX512 drops the clock of other cores, slowing down the throughput of the entire server).

Traditionally, the first implementation of these instruction sets always a degree of "emulated". For example, the gather/scatter instructions aren't much faster than load/stores in practice.

So while the AVX512 instruction set could theoretically be efficiently implemented, it seem like Skylake-X's implementation leaves much to be desired. Hopefully future implementations will be better.

------

The other major implementation of AVX512 is Xeon Phi, which has been deemed end-of-life. I like the idea of Xeon Phi, but it just didn't seem to work out in practice.

gameswithgo · on Sept 20, 2019

the clock rate has nothing to do with a bad implementation, all the computation just makes a lot of heat. the performance increase is still massive, often more than 2x avx, which also throttles btw

DuskStar · on Sept 20, 2019

The clock rate issue isn't the fact that it downclocks a core when moving to AVX512, it's that it downclocked all the other cores on the processor at the same time.

robocat · on Sept 20, 2019

From what I have read, AVX512 only affects the one core (downclocking License level L1 or L2), it is older CPUs with AVX2 that affected all cores.

Independently thermal throttling can occur which would affect all cores, although presumably the CPU is generating heat per numeric operation, so AVX512 is neutral versus other instructions per numeric operation.

gameswithgo · on Sept 20, 2019

on intel cpus the license levels basically are discrete thermal throttles. vs and which doesn’t do that, just constantly monitors thermals and adjusts clock.

intels method makes benchmarking simpler! but may leave performance on the table.

gameswithgo · on Sept 20, 2019

i understand why downclocks are an issue, and i understand that on some intel cpus the while chip downclocks with certain instructions. i was commenting on the supposed reason the down clocks happen, and mentioning that performance is spectacular despite then (assuming you schedule your workload appropriately).

mbauman · on Sept 20, 2019

One challenge with AVX-512 is that it can actually _slow down_ your code. It's so power hungry that if you're using it on more than one core it almost immediately incurs significant throttling. Now, if everything you're doing is 512 bits at a time, you're still winning. But if you're interleaving scalar and vector arithmetic, the drop in clock speeds could slow down the scalar code quite substantially.

https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

robocat · on Sept 20, 2019

Also see https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us...

* The processor does not immediately downclock when encountering heavy AVX512 instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Light 512-bit instructions will move the core to a slightly lower clock. * Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms). * The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

The_rationalist · on Sept 20, 2019

Latest kernel as an api for knowing at runtime if AVX 512 create throttle, allowing to dynamically disable it when it decrease performance

semi-extrinsic · on Sept 20, 2019

How fast does the CPU step up and down the throttling caused by AVX-512?

Basically, if you are interleaving like you suggest, does the processor detect this and reduce the throttling by the "duty cycle" of 512-bit operations?

If not, could there be a way to tell the CPU to do this?

PeCaN · on Sept 20, 2019

>How fast does the CPU step up and down the throttling caused by AVX-512?

It's actually really, really slow. On newer (I think around Skylake-X, which is when AVX-512 was introduced) CPUs it takes up to 500 microseconds i.e. millions of cycles to activate AVX-512. This can't really be made faster because they actually need to give the voltage regulators time to adjust or the chip literally brows out. During this time AVX-512 instructions execute on the AVX-256 datapath¹.

Once AVX-512 is activated the clock of that core is reduced by about 25% and it starts a 2ms timer which is reset whenever another AVX-512 instruction is issued. AFAIK Intel doesn't say how long it takes to raise the frequency again once the timer expires.

(This is something of a simplification because there are actually two AVX power licenses, the first allowing AVX-256 and a limited set² of AVX-512 instructions, reducing clock by about 15%, and the second allowing everything. Also, executing a single AVX-512 instruction doesn't immediately request a higher power license, you have to execute a certain number of them.)

―

This is actually the better version. On Haswell executing any AVX-256 instruction would reduce the frequency of every core by about 15-20%. But hey, at least it only takes about 150k cycles to activate (not much of a consolation, I know). Beats me how long it stays throttled for.

(I don't know what exactly Broadwell did. I don't think it throttled all cores, but it didn't have the additional power license with reduced throttling that Skylake has.)

―

¹ Or the 128-bit datapath if the core is at the lowest power license (which still lets you use 128-bit SSE instructions, and basic AVX-256 instructions).

² Basically anything that doesn't execute on the floating-point unit, which means no floating point and no integer multiplication (which uses the FPU). This is actually kinda the saving grace of the whole thing, since it means you can vectorize things like memcmp and strlen without requiring the highest power license.

KingMachiavelli · on Sept 20, 2019

From what I've read on AVX-512, the big disadvantage is that the AVX-512 instructions are very CPU intensive so the max clock speeds is reduced if you heavily use AVX-512 instructions.

Another disadvantage is that you have to recompile your code to use AVX512 but it seems general enough that compilers will use the instructions (to an extent) without specialized code [1].

[1] https://www.phoronix.com/scan.php?page=news_item&px=GCC-8-AV...

robocat · on Sept 20, 2019

The problem appears to be that whether you get a performance increase or a performance penalty depends on the duty cycle of your AVX512 instructions.

What is especially deceiving is profiling a function in a loop for more than say 50ms, when the normal function execution takes say 0.5ms. Long running functions get the most gain, while short running functions cause the most pain.

That is because downclocking AVX512 lasts 2ms (with a 0.5ms setup). Certain instruction mixes will cause a general slowdown (10% degradation measured by CloudFlare under actual usage) even though the test profiling might predicts a performance gain. Single AVX512 instructions when the CPU is running at full speed have a counterintuitive perverse performance penalty - apparently running 4x slower than when the CPU changes to the slower L1 or L2 clocks.

Sustained AVX512 usage has predictable performance.

“ Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.” is a strong indication that you need to be very careful about where you use the instructions.

Running an encoder for 1 second - likely candidate. Occasional 1ms functions or single AVX512 instructions on a web server - likely penalty.

pkaye · on Sept 20, 2019

I've used it once but on Intel the clock is throttled with AVX-512 so the overall program performance improved only 1% because the non-SIMD code was running at a slower clock.

devonkim · on Sept 20, 2019

Newer versions of the x265 encoder for h.265 / HEVC video standard get significant speed-ups. I believe Handbrake was recently updated for it.

av3csr · on Sept 20, 2019

I think RPCS3 also uses it in the JIT