How the new graphics APIs hope to greatly reduce CPU overhead

Quizzical · March 2014

A graphics engine can be broken into three parts:

1) the CPU does some stuff,

2) the GPU does some stuff, and

3) the CPU passes stuff along to the GPU to tell it what to do.

When AMD Mantle was announced last year, the promise was that it would greatly reduce the overhead of (3). AMD later announced that they were going to port it to OpenGL. Microsoft now says that it's coming to DirectX 12. So naturally, that leaves it to Nvidia to explain what all the fuss is about:

http://www.slideshare.net/CassEveritt/beyond-porting

If you want to read the slide deck yourself, start at side 23. Before that, it talks about how to fix a problem that basically amounts to "the programmer is an idiot" and isn't about new API capabilities. But if you've never used DirectX or OpenGL, it will probably be hard to follow, so I'd like to explain it in terms that someone who has dome more traditional CPU programming (e.g., in C++ or Java) may be able to follow. If you're already familiar with DirectX or OpenGL, you may wish to skip to the ***** line below.

GPUs have a tremendous amount of computational power, but it has to be harnessed in certain ways. Accessing video memory is "expensive", in that if more than a tiny fraction of the data you need has to come from video memory, the game will choke for lack of video memory bandwidth.

A GPU is a "state machine". Various OpenGL commands can change its state. Once everything is in the state that you want it to be in, you send a draw command, and it draws stuff assuming the current state. There are a lot of options that can be changed that I won't go into; some things are set once at start-up and forever left alone.

Changing state is expensive, however. So one way to reduce CPU overhead is to not change state more often than you have to. But sometimes you do have to change some things; if you change nothing between one draw call and the next, then the second draw call will draw exactly the same thing in exactly the same place as the previous one. I'll give examples of what you change later.

There is a standard graphics pipeline, which I'll describe here briefly. I'm going to skip the tessellation stages for simplicity, as they're not relevant to this thread and are always optional. Apart from that, there are three programmable shader stages: vertex shaders, geometry shaders, and fragment shaders. DirectX uses the same names, except "pixel" rather than "fragment" shaders. For simplicity, I'm going to assume that all primitives are triangles, even though OpenGL can also deal with points and lines--and geometry shaders can readily take in primitives of one dimension and output primitives of another.

Vertex shaders take in one vertex of your model at a time, do something or other to process it, and then give output to geometry shaders.

Geometry shaders take in one triangle at a time, and read in the output from each of the vertices of the triangle. They can then do something or other and output arbitrarily many (potentially zero!) vertices with some simple instructions on which sets of three consecutive vertices form a triangle.

Next comes rasterization, which is a fixed-function processing stage. Rasterization takes in a triangle, figures out where it would appear on the screen, and breaks it into fragments with one fragment for each pixel that the triangle would cover on the screen. A single triangle can easily produce hundreds of thousands of fragments or none at all. It gets data for each fragment by interpolating the data from the three vertices of the triangle.

Next comes the fragment shader. A fragment shader is run once for each fragment produced by rasterization, and takes in the data for its specified pixel. It does something or other and computes the color that that pixel should be, as well as the depth. (There is a "default" depth computed in rasterization which is often used here.)

After this comes some fixed-function stuff in which the GPU decides whether to change the color of the pixel based on the output of the fragment shader.

Vertex shaders, geometry shaders, and fragment shaders all had a "do something or other" portion. These are the programmable shader stages and the engine programmer is responsible for writing a program (called a "shader") that tells the shader exactly what it is supposed to do. The programmer can then link a shader of each type in a "program".

But it is obviously unreasonable for a programmer to write a completely independent program for every single thing that he wants in his game. Rather, he writes the program to be able to take some options, with various ways to feed data into it. A shader can declare variables, write to them, and later read from them just like typical CPU-side programming languages.

Unlike typical CPU programming languages, however, the data that a shader can see is very restricted. Separate invocations of the same shader stage cannot see each other, as they have to be able to run concurrently. Data that is simultaneously visible to multiple shader stages or multiple invocations of the same shader stage has to be read-only, so that one shader's computations cannot depend on another that is running concurrently. These are both essential restrictions in order to make it possible to exploit the parallelism that a GPU offers; without them, if one invocation of a shader modified data, there could be no guarantees that another invocation of the same shader saw it before or after modification.

There are three types of data that a shader is able to read externally: vertex data, uniform data, and texture data. Vertex data is the input to vertex shaders. A "vertex" is an arbitrary collection of data, probably containing a position, likely a normal vector and texture coordinates, and perhaps other things. The programmer can decide upon the precise meaning of his vertex data, but has to tell the vertex shader how to read, for example, a sequence of 8 floats into the variables that he will use.

Different vertices in the same object must have the same amount and types of vertex data, but the data itself can be different. For example, one vertex can have position (1, 0, 0), while another has position (0, 1, 0), and so forth. This is what you rely on to make sure that different parts of a model are different right from the start. They'll start with different input values, then perform the same computations on all of them to get different output values. A model can have arbitrarily many vertices--potentially many thousands in a single draw call.

Later programmable shader stages do not get to see the initial vertex data. Rather, they get to see the output of the previous pipeline stage and see it in similarly to how a vertex shader can see the data for its particular vertex.

Uniform data is for very small amounts of data that are likely to need to be accessed frequently. In OpenGL, uniforms are matrices of size at most 4x4--and usually smaller. Uniform values have to be set from the CPU side (part 3 of a graphics engine above), and are then visible but read-only to all invocations of all shader stages in the program. Because uniforms are very small, they can fit inside of GPU cache and be accessed frequently without having to touch video memory.

One simple example of why you would want to use a uniform is to describe the position of an object as a whole relative to the camera. Rather than having to recompute all of your vertex data every single frame, if an object isn't moving in the game world but the camera does move, you can just compute the position of the object relative to the camera and change that. Vertex shaders can add this to a relative position of each vertex given in the vertex data to get the position of each vertex relative to the camera--and it's vastly faster to do these computations on the GPU than the CPU.

Texture data is for when you need large amounts of data to be visible to shaders. You can think of textures as a typically large, multi-dimensional array (usually 2D, but can be 1D or 3D if desired) that accepts fractional inputs. In a typical CPU programming language, if myArray is an array, myArray[7.2] is invalid. A GPU's texture unit will know to interpret that as either myArray[7] or 0.8 * myArray[7] + 0.2 * myArray[8], depending on your sampler settings. Actually, mipmaps make it more complicated than that, but let's not go there. With non-integer array indices already allowed, textures also normalize the domain to [0, 1), so that 0.5 is always in the middle, but that is immaterial to today's discussion.

Like uniforms, textures have to be set by the CPU in order for a program to know which texture to use. Like vertex data and unlike uniforms, textures are often buffered, where a program will save a bunch of textures in video memory and then tell a draw call which particular texture to use. Passing a ton of data from the CPU to the GPU is expensive. Copying a bunch of textures over once and then subsequently saying "use that one" is much faster, though it still carries a considerable performance hit.

Textures are commonly used for the "pictures" that get stretched across objects within a game world, though this is not their only use. Textures are often too big for GPU cache, so they often have to be accessed from video memory. Still, video memory is there for a reason, and doing a bunch of computations together with a single texture call to get the final color for a pixel is often an appropriate use of them.

*****************************************************************************

The CPU overhead from part (3) of a game engine comes from changing states. Let's show what an example from a graphics engine can look like. Anything that starts with gla.(something) is an OpenGL call of some sort. I've put periods at the start of lines to maintain spacing. I've in-lined some stuff from my game for simplicity (as I've actually implemented it, there are five subroutines used), but here's a snippet of code that you should skip past for now and possibly glance up at later:

if (notEmpty[28]) {

....gla.glUseProgram(programs[28]);

....int bindNeeded = texProgram[28];

....if (texActive != bindNeeded && bindNeeded >= 0) {

........gla.glActiveTexture(GL3.GL_TEXTURE0 + bindNeeded);

........texActive = bindNeeded;

....}

....if (frameUpdate[28]) {

........gla.glUniform3fv(lightSourceUnif[28], 1, lightSourceBuffer);

........if (isometric && isoSpecUsed[28]) {

............gla.glUniform3fv(isoSpecUnif[28], 1, specularBuffer);

........}

........frameUpdate[program] = false;

........if (camMatrixUpdate[28]) {

............gla.glUniformMatrix3fv(camMatUnif[28], 1, false, camMatrixBuffer);

............camMatrixUpdate[28] = false;

........}

........if (persVectorUpdate[28]) {

............gla.glUniform4fv(persVectorUnif[28], 1, persVectorBuffer);

............if (OGL4 && isometric) {

................gla.glUniform1f(tessUnif[28], curIsoTess);

............}

............persVectorUpdate[program] = false;

........}

....}

....if (firstPlane != vaoActive) {

........gla.glBindVertexArray(vao[firstPlane]);

........vaoActive = firstPlane;

....}

....for (RenderInputSurface theSurface : postSortQueue.take()) {

........if (theSurface.texID != texCurrent[0]) {

............gla.glBindTexture(GL3.GL_TEXTURE_2D, textureList.get(theSurface.texID));

............texCurrent[0] = theSurface.texID;

........}

........gla.glUniform3fv(shinyUnif[28], 1, theSurface.shininess);

........gla.glUniform2fv(axesUnif[28], 1, theSurface.axes);

........gla.glUniform3fv(moveVecUnif[28], 1, theSurface.moveVector);

........gla.glUniformMatrix3fv(objMatUnif[28], 1, false, theSurface.rotMatrix);

........gla.glDrawArrays(GL3.GL_TRIANGLE_STRIP, 0, 4);

....}

}

*****************************************************************************

In case you've looked at the code more than I recommended, 28 is my ID number for the program.

The code above is the code to draw all of the tree branches in a frame--of which there could easily be several dozen. I've highlighted the word "for" in red, as that's the most important part. That's a standard Java for-each loop. The code above it will run at most once--and there are a lot of conditional statements to try to run it zero times, if possible. For example, persVectorUpdate is nearly always false, and only set to true if the player resizes the game window. But the code in the for-each loop could run dozens or hundreds of times in a single frame.

Neither AMD nor Nvidia nor Microsoft nor anyone else has major ambitions about reducing the CPU overhead of the stuff outside of the for-each loop. That only runs once in a while, anyway. glUseProgram is slow (I time it at about 2 microseconds on a Core i5-750), but switching the active vertex array object and potentially setting up to five uniforms once per frame is no big deal.

But look at what's in the for-each loop: choosing a texture (glBindTexture), setting four uniforms (glUniform-anything), and then issuing a draw call (glDrawArrays). It tries to avoid the call to glBindTexture as much as possible; the tree branches are already sorted by texture; hence my name for the queue of postSortQueue (which, in case you know Java and are trying to parse the code, is a LinkedBlockingQueue of arrays of RenderInputSurfaces, where the latter is a struct-like class of my own that has the data needed to draw something).

This is where most of the CPU overhead is, and this is what everyone is trying to get rid of. But note that at the very start of the for-each loop, I already know exactly what data I'm going to pass along to the video card. The only computations done are to determine whether I have to change the texture or can re-use the previous texture.

So someone came up with the idea of saying, wouldn't it be great if, rather than having to do a state change and resynchronize the CPU with the GPU a whole bunch of times, I could load all of my stuff into big arrays CPU-side and pass it all to the GPU at once? That could reduce the CPU driver overhead for this part of the work by more than 90% in some cases. That's one of the main selling points of AMD Mantle, DirectX 12, and whatever OpenGL version will get the last of these optimizations in it.

For uniforms, that's very easy to do. It's very easy to write a subroutine to take an arbitrary collection of arrays of some fixed size and pack all of the data into a single, larger array. It's also a lot faster to do that CPU-side than for video drivers to have to worry about synchronizing things with the GPU.

For draw calls, it's again easy. In this particular case, all of my draw calls are exactly the same. That's not so for some other programs, for a variety of reasons, but it's still easy to make a list of what all of the draw calls would be.

That leaves textures. The solution, at least as explained in the slide deck linked above, is to make an array of textures that are all the same size, and then make an array of references to textures in that array. That will say which texture to use on which draw call. That will allow the GPU to set things up for a particular texture size just once, and then when switching to the next object, it will only need a different video memory pointer of where the next texture starts.

Did you spot the catch in that last paragraph? Let me repeat it: an array of textures that are all the same size. See, textures can be big. And textures can be different sizes. Some objects are simply a lot bigger than others, and need much higher resolution textures because of it. Textures have mipmaps, and mipmaps can be different sizes. And that makes textures complicated.

This leads to the issue that you'd like to have one fixed array of textures of a given size, and stick all textures of that size in the same array. But how big should the array be? Make it too small and bad things happen. Make it too big and you waste massive amounts of memory. The promise is that you can use virtual memory, and specify which textures in the array are actually in video memory and which go in virtual memory--and the latter can be the empty slots so that you don't have to reallocate an array later. But how much virtual memory do you get? Is it okay to use 1 GB of virtual video memory? 10 GB? 100 GB? If you can use 100 GB of virtual video memory and nothing will care, then this isn't a problem. But is that the case?

Reducing the CPU overhead of video drivers by 90% is certainly an ambitious goal. And while it can surely be attained in some corner cases, and you could blow far beyond 90% in some demos, I'm not so sure that it's practical to make it into a typical result. Really, though, if a typical game programmer who wasn't willing to shoe-horn his game into doing mysterious things in search of the last drop of driver efficiency "only" reduced the CPU overhead of video drivers by 80%, would that be such a bad thing?

Ridelynn · March 2014

Umm, the funny part is that CPU overhead isn't really a problem.... Unless your using an AMD CPU.

CPUs have been overmatched compared to every other bottleneck for the last 4-5 years, particularly Intel CPUs with their faster IPC. Sure, less overhead is better, but if that isn't your major bottleneck to begin with, you aren't adding a lot to the overall equation.

The "problem" isn't so much to do with CPU overhead, it's more to do with parallelization - which AMD would excel with greater core counts, but programming paradigms and tools haven't been able to really catch on to yet.

I don't mean to get into an AMD vs Intel arguement here - but just take a look at the Mantle results... we see ~10% gains on high end gaming rigs, and like 50-75% gains on low end rigs using AMD APUs. That's great for the lower end of the spectrum, but you aren't exactly pushing the envelope, your just raising the baseline.

13lake · March 2014

Wait, so textures being textures(high resolution ones especially), wouldn't a very large amount of video ram on the gpu (as you've mentioned) enable a creation of a big enough array to accommodate all the necessary textures and provide complete utilization of all the various ways to reduce the overhead.

And wouldn't that give an edge to gpus with 8,10 or more GB of ram, compared to 3 or 4 GB of ram ? As in the developers could implement all the techniques for driver overhead, and lets say hypothetically achieve 90% overhead reduction, but only in the GPU models with more ram, thus giving the more expensive gpus a pretty decent or even huge performance advantage compared to the cheaper gpus with less vram ?

Which would in turn provide better sales of the double vram models we've seen the past few years which don't sell as well, 4GB 770/760 for instance, or 6GB 780 Ti, etc

Originally posted by Ridelynn
...

Mantle gives as much benefit and even more in some cases, with weaker Intel processors especially 2 core, ...

Quizzical · March 2014

Originally posted by 13lake
Wait, so textures being textures(high resolution ones especially), wouldn't a very large amount of video ram on the gpu (as you've mentioned) enable a creation of a big enough array to accommodate all the necessary textures and provide complete utilization of all the various ways to reduce the overhead.

The problem is that no matter how much video memory you give a video card, it's basically a trivial matter to make more and higher resolution textures to fill it up and run out. Check the installation size of any game you've got. Check how much video memory you have. Which is larger? Several years from now, you might have a card with more video memory. What do you think will happen to game installation sizes in the meantime?

For most games, by capacity, the game installation size is almost entirely stuff that gets buffered into video memory. The textures especially are going to be highly compressed on your hard drive and take a lot more space in video memory. Depending on particular methods, if textures can be modified CPU-side in a variety of ways before being passed to the GPU (e.g., armor dyes), then a single texture on your hard drive could need to correspond to many textures in video memory. Also, in video memory as a rule of thumb, you should automatically add an extra 1/3 to the capacity needed to account for mipmaps.

Quizzical · March 2014

Originally posted by Ridelynn
Umm, the funny part is that CPU overhead isn't really a problem.... Unless your using an AMD CPU.

CPUs have been overmatched compared to every other bottleneck for the last 4-5 years, particularly Intel CPUs with their faster IPC. Sure, less overhead is better, but if that isn't your major bottleneck to begin with, you aren't adding a lot to the overall equation.

The "problem" isn't so much to do with CPU overhead, it's more to do with parallelization - which AMD would excel with greater core counts, but programming paradigms and tools haven't been able to really catch on to yet.

I don't mean to get into an AMD vs Intel arguement here - but just take a look at the Mantle results... we see ~10% gains on high end gaming rigs, and like 50-75% gains on low end rigs using AMD APUs. That's great for the lower end of the spectrum, but you aren't exactly pushing the envelope, your just raising the baseline.

Not a problem to consumers of existing games doesn't mean the same thing as not a problem to people trying to code games. Games have to scale back their number of API calls from what they might like because, unlike some other graphical settings, if the problem is too much CPU overhead from too many graphics API calls, it means the game won't run smoothly at any settings, no matter what you pick.

Admittedly, this is arcane, insider stuff. That's why it took so long to give an explanation. For that matter, that's why so many months passed before I found anyone who tried to give a coherent explanation.

But to take a more traditional CPU programming analogy, suppose that humans had just invented object-oriented programming, but had implemented it really inefficiently. A rule of thumb came to be widely accepted that if you can write a procedural program to do something, and you can write an object-oriented program to do the same thing, the latter will only run 20% as fast.

So for anything that was remotely performance-sensitive, people wouldn't use object-oriented. People would use it for some things that are very complex but not performance-sensitive, and think it was nice for that. Kind of like how people use scripting languages for some things today.

Then people figured out how to implement object-oriented programming much better so that it typically ran about 95% as fast as procedural programming. Would that matter?

To people who aren't programmers, it would be hard to explain why it would be a big deal. After all, the commercially-available software written in an object-oriented programming language seems to run just fine. But to those who are, there are a lot of situations where going oop makes things a lot easier, but you just couldn't do it before because it would kill your performance.

This is kind of like that, though admittedly not nearly as big of a deal as object-oriented programming. Rather than having to do some convoluted optimizations to cram things into fewer API calls, you'll be able to spread things into more API calls, so that the code is structured more naturally, making it easier to write, easier to read, easier to understand, and easier to make tools for that your artists don't hate. Meanwhile, you batch many of the API calls to that it's equivalent to having twice as many API calls as before, but with half the performance hit. Surely that's a good thing all around.

Is it the sort of killer feature that you advertise on the box of your game? Of course not. But a decade from now, once even fairly obsolete hardware can handle this wave of new stuff quite well and game designers can assume that everyone has it, it will be very good that it happened. Except for those game engines still running DirectX 9.0c for their customers who are still running Windows XP and wondering why their system gets malware so easily.

Ridelynn · March 2014

Originally posted by Quizzical
Originally posted by Ridelynn That's great for the lower end of the spectrum, but you aren't exactly pushing the envelope, your just raising the baseline.

But a decade from now, once even fairly obsolete hardware can handle this wave of new stuff quite well and game designers can assume that everyone has it, it will be very good that it happened.

You make a compelling argument, and end up coming around to pretty much exactly what I had just said.

It's not a bad thing, I agree. I just don't see it as earth shattering, and I guess I'm just not as impressed or convinced of it's ultimate importance in the grand scheme.

TheLizardbones · March 2014

Pardon my ignorance, but I thought Mantle was aimed primarily at AMD's APU style chips where the GPU doesn't have its own ram soldered to it. Is the goal to improve everything, or to just push the APU style chips into mainstream gaming performance? Is this the kind of thing that would even be noticeable with Intel's Haswell chips, or the pretty standard CPU with a video card setup?

Quizzical · March 2014

The goal of graphics APIs is always to improve things any ways that they can.

Where I think this sort of thing could help most is in mobile devices. If you've got a tablet whose SoC has a TDP of 4 W (which Intel will call an SDP of 2 W or something stupid and meaningless like that), reducing CPU power consumption by 0.3 W by this sort of efficiency frees up power headroom for more performance elsewhere.

Mobile devices is also where you don't have high single-threaded performance to brute-force your way through problems. I don't know how much interest there is in 3D gaming on a tablet or cell phone (or perhaps rather, how much there would be if the games were there), but efficiency improvements like this will make it easier to bring such games to market.

TheLizardbones · March 2014

Originally posted by Quizzical
The goal of graphics APIs is always to improve things any ways that they can.

Where I think this sort of thing could help most is in mobile devices. If you've got a tablet whose SoC has a TDP of 4 W (which Intel will call an SDP of 2 W or something stupid and meaningless like that), reducing CPU power consumption by 0.3 W by this sort of efficiency frees up power headroom for more performance elsewhere.

Mobile devices is also where you don't have high single-threaded performance to brute-force your way through problems. I don't know how much interest there is in 3D gaming on a tablet or cell phone (or perhaps rather, how much there would be if the games were there), but efficiency improvements like this will make it easier to bring such games to market.

That actually makes a lot of sense. I think that while mobiles are not a major 3D gaming market, they could be. As more people have phones that can play 3D games, even if they are just puzzles, more people will just expect it. For me, I was pretty happy with Deus Ex: The Fall and I'm looking forward to the next installment. If they can improve the performance on the same hardware, so much the better.

thinktank001 · March 2014

Originally posted by Ridelynn

You make a compelling argument, and end up coming around to pretty much exactly what I had just said.

It's not a bad thing, I agree. I just don't see it as earth shattering, and I guess I'm just not as impressed or convinced of it's ultimate importance in the grand scheme.

http://semiaccurate.com/2014/03/18/amd-eidos-launch-mantle-version-thief/

Nvidia and Intel probably won't agree with your assessment .

mbrodie · March 2014

even though i already have a i7 4770k and i've never had an issue with CPU bottlenecking, i've been considering upgrading my GTX680 to a

Sapphire Radeon TRI-X R9 290, 4GB

they're fairly cheap here and have some really good reviews + the 3 fan system and also only run 3 - 7 fps slower then a GTX780Ti from what i was reading... so it seems like a pretty solid card, especially with the new framework coming in

Edit - and judging from the link above me with mantle enabled it blows it out of the water!

Ridelynn · March 2014

Originally posted by thinktank001
Originally posted by Ridelynn You make a compelling argument, and end up coming around to pretty much exactly what I had just said. It's not a bad thing, I agree. I just don't see it as earth shattering, and I guess I'm just not as impressed or convinced of it's ultimate importance in the grand scheme.

http://semiaccurate.com/2014/03/18/amd-eidos-launch-mantle-version-thief/

Nvidia and Intel probably won't agree with your assessment .

Directly from your link:

While Nvidias DirectX 11 performance scaled with increases in raw CPU performance AMDs GPU performance under Mantle plateaus after $190 CPU price point.

That was exactly my assessment.

Ridelynn · March 2014

Originally posted by mbrodie
even though i already have a i7 4770k and i've never had an issue with CPU bottlenecking, i've been considering upgrading my GTX680 to a Sapphire Radeon TRI-X R9 290, 4GBthey're fairly cheap here and have some really good reviews + the 3 fan system and also only run 3 - 7 fps slower then a GTX780Ti from what i was reading... so it seems like a pretty solid card, especially with the new framework coming in Edit - and judging from the link above me with mantle enabled it blows it out of the water!

Mhmm. If you read the link you are referring to:

MultiGPU will be enabled or disabled by developers rather than by AMD. Thus support for multiple GPU under Mantle in a given game will depend entirely on the whims of the games developer. For AMD this means that they will no longer be held accountable for bad multiGPU performance, but the means that devs will likely have to do more work to offer multiGPU support. Its hard to say at this point whether thats good or bad for multiGPU users although we have yet to see any Mantle enabled game with solid multiGPU support.

Let me know how that triple Xfire rig works out for you on Mantle

Howdy, Stranger!

How the new graphics APIs hope to greatly reduce CPU overhead

Comments

Sapphire Radeon TRI-X R9 290, 4GB

Howdy, Stranger!

Quick Links

How the new graphics APIs hope to greatly reduce CPU overhead

Comments

Sapphire Radeon TRI-X R9 290, 4GB