I actually didn't catch this until this weekend, but GP100 is really a very different architecture from GP104. The latter is basically a straight die shrink of Maxwell with only minor differences. Meanwhile, one could argue that GP100 has as much in common with AMD's GCN architecture as GP104. (And one could argue against that pretty convincingly, too, but still.)
Here's a compute unit from Maxwell:
http://images.anandtech.com/doci/8526/GeForce_GTX_980_SM_Diagram_FINAL.png?_ga=1.146626740.1434193510.1463784889And here's one from the GP104 version of Pascal:
https://www.techpowerup.com/reviews/NVIDIA/GeForce_GTX_1080/images/arch2.jpgSee the difference? You might be tempted to cite the polymorph engine at the top of Maxwell, but that's there in Pascal, too. Nvidia just decided not to draw it on their diagram.
Now look at the GP100 version of Pascal:
https://devblogs.nvidia.com/parallelforall/wp-content/uploads/2016/04/gp100_SM_diagram-624x452.pngLots of differences there. For starters, there are only half as many shaders and texture units. The register file sizes are 128 KB rather than 64 KB. It's still 256 KB per compute unit, but divided among half as many shaders means you can have a lot more threads resident per shader. There are double precision units added. The local memory capacity is 64 KB rather than 96 KB, but again, divided among half as many shaders, it's more local memory capacity per shader. It's probably double the local memory bandwidth per shader.
I'm guessing that most readers will have no clue what I'm talking about with register file sizes and local memory, so I'll explain further. On most CPUs, you have a few registers, then larger L1 caches, then larger yet L2, and possibly larger yet L3. That's not how GPUs work at all. Rather, most of your on-die cache is registers, and other caches are much smaller. It's common for registers to account for perhaps 2/3 of your on-die cache, while the other 1/3 is divided among L1, L2, constant memory, local memory, instruction cache, and possibly other things.
On a CPU, context switching between threads is very expensive. Hyperthreading lets you switch between two threads on a core easily, but not three. GPUs typically want to have on the order of a thousand threads resident per compute unit, and switch between them nearly every single time they schedule a new instruction. Each thread has to have its own register space all to itself. Everything is high latency, and having a ton of threads that you can switch between is how you cover up the latency. If you don't have enough register space, you can't have enough threads, and sometimes that means you can't cover up latency very well and take a huge performance hit.
So, let's look at register capacity per shader for recent architectures:
Kepler: 1 1/3 KB
Maxwell: 2 KB
GCN: 4 KB
GP104 Pascal: 2 KB
GP100 Pascal: 4 KB
One problem with registers is that you can't dynamically index them. If your code has arrays that you dynamically index (i.e., foo[i] with i not known at compile time), then that data can't go in registers. It has to go somewhere else, ideally local memory. (Nvidia calls this shared memory, but I'm using the non-vendor specific term.) The problem with local memory is that it's small and has limited bandwidth.
So let's look at local memory per shader on recent architectures:
Kepler: 1/4 KB
Maxwell: 3/4 KB
GCN: 1 KB
GP104 Pascal: 3/4 KB
GP100 Pascal: 1 KB
And bandwidth, in number of local memory accesses that you can do per computation in order to keep both busy:
Kepler: 1/6
Maxwell: 1/4
GCN: 1/2
GP104 Pascal: 1/4
GP100 Pascal: 1/2
You may have never heard of these things before, but they're a huge deal for many non-graphical compute purposes. Note that GP100 is identical to GCN, GP104 is identical to Maxwell and markedly worse than GP100 and GCN, and Kepler is just terrible, in spite of actually coming later than the early GCN cards.
Meanwhile, GF110 Fermi is also 4 KB on registers per shader and has 1.5 KB of local memory per shader, but those numbers should probably be halved to be comparable to newer architectures because of the doubled shader clock speed. That still leaves them in line with Maxwell and better than Kepler. Local memory accesses per computation were 1/2 for GF100 and GF110, or 1/3 for the lower end Fermi chips, still better than Maxwell or GP104. So Nvidia did do a compute-heavy architecture way back in 2010, and abandoned that approach for Kepler.
Some talk here recently has been about asynchronous compute, and whether Maxwell or Pascal can do it. Maxwell is performing relatively much worse as compared to GCN in games that use asynchronous compute as compared to those that don't, and Pascal seems to continue that trend.
But it's likely that "asynchronous" isn't the problem. GCN is simply a much more compute-heavy architecture than Maxwell, Kepler, or the GP104 version of Pascal. Do a lot of non-graphical compute in your game and it's likely to make your game look very favorable to AMD. GP100 reaches parity with the older GCN cards (or likely a little better, as Nvidia is presumably still more clever about scheduling), but GP104 is still way, way behind. The issue isn't DirectX 12 versus 11. It's not asynchronous compute as opposed to other compute. It's compute, period.
Nvidia hasn't been competitive with AMD at all in compute performance if your bottleneck is things happening on-chip (as opposed to global memory, PCI Express, etc.) since GCN arrived more than four years ago. GP100 may catch up, but the lower end Pascal chips won't. So you can see why AMD is pushing for more compute in games, while Nvidia is trying to prevent it, or at least push games to use CUDA, whose primary virtue is that it won't run on AMD.
Comments
거북이는 목을 내밀 때 안 움직입니다
OTOH, AMD is doing quite a few architectural changes in Polaris/Vega, and, what is most interesting is PDA - Primitive Discard Accelerator - which should combat those "unseen objects and tesselation" as it should just discard unseen triagles (primitives) before they even enter pipeline which should reduce amount of what GPU has to process considerably. Indepth details are not yet known but this is kind of thing that GPUs need, more efficiency.