GPUs mixing graphics and compute

Quizzical · September 2015

It's important to distinguish between:
1) This particular implementation of an algorithm doesn't run well on Nvidia hardware, and
2) This particular algorithm cannot be made to run well on Nvidia hardware.

The former is going to happen all the time, as code optimized for one architecture isn't necessarily good code for another. That's less of a problem with graphics where what you do is more standardized and predictable, but far more of a problem if you move away from graphics and into compute tasks that the GPU vendors didn't anticipate.

The latter will happen a lot, too, with code that just doesn't make sense to run on GPUs--whether AMD, Nvidia, or anything else. (For example, anything single-threaded.) And there are certainly some algorithms that run far better on one GPU vendor than the other. If you're leaning heavily enough on local memory, for example, even an aging Radeon HD 7970 will have a pretty good shot at beating a GeForce GTX Titan X outright, and a Radeon R9 Fury X will be well over double that.

But a lot of times (1) happens without it being a case of (2). For example, let's consider integer multiplication. On Kepler, only 1/6 of the shaders can do integer multiplication. If you need all integer multiplication all the time, Kepler is going to choke. But if you only need a little bit of it here and there, Kepler will handle it well.

Maxwell, meanwhile, doesn't have an integer multiplication operation at all. Rather, it has to chain together multiple operations to do a simple integer multiply. But if you need all integer multiplication all the time, all the shaders can do it, so Maxwell will beat out Kepler.

Meanwhile, if you want 32-bit integer multiplication on GCN, that takes four passes, so essentially a chain of four operations. That's going to lose badly to Nvidia. Unless, that is, you happen to know that you only need 24-bit integer multiplication and not 32-bit. You probably don't have a 24-bit data type, but if you happen to know that your high eight bits in a 32-bit integer are all zero, you can use mul24. In that case, AMD has that operation in all of the shaders as a full speed operation, and so GCN handily destroys both Kepler and Maxwell.

But if what you really needed fit 24-bit integer multiplication and you ask for 32-bit instead, you make your multiplication slower on GCN by a factor of four. There's no reason to use a mul24 operation on Nvidia GPUs, as that doesn't map to anything they have in silicon. So if you take Nvidia-optimized code that is integer multiplication heavy and could have fit mul24 but doesn't ask for it and run it on AMD, it looks like Nvidia wins by a huge margin. That flips around into a huge win for AMD as soon as you fix the code to use mul24.

This doesn't necessarily mean that you have to write completely independent code for every architecture, or even every GPU vendor. A few #ifdef statements in sections where you know that different architectures strongly prefer that an algorithm be implemented in different ways can often suffice.

You might ask, why do different architectures handle integer multiplication so differently? Because it's not used much for graphics, so the GPU vendors aren't saying, we have to do this and this or else game won't perform well. If you need floating-point multiplication rather than integer, for example, all shaders in all remotely modern GPU architectures can do 32-bit floating-point FMA as a full-speed operation. If something isn't useful for graphics, whether GPU vendors will decide to put it in for compute purposes is less predictable.

Howdy, Stranger!

GPUs mixing graphics and compute

Comments

Howdy, Stranger!

Quick Links

GPUs mixing graphics and compute

Comments