A server with 8 of the new GPUs in it costs $149,000. Nvidia is taking pre-orders now and promising delivery starting in Q3 of this year. The cards won't be available other than in the $149k server until Q4, assuming no delays happen between now and then. Remember that Nvidia announced GP100 around this time last year, but the cards didn't go on sale until March of this year and still aren't that widely available.
It's apparently on a "new" process node that TSMC made custom for Nvidia. Nvidia is calling it 12 nm FFN. Transistor density is basically unchanged from 16 nm, which makes me wonder if it's just a modified 16 nm process that they call 12 nm for marketing reasons. Claimed energy efficiency is up slightly as compared to GP100, but not as compared to consumer Pascal chips.
It's an extremely ambitious chip with a die size of 815 mm^2. That's not merely the largest GPU ever made; it's about 1/3 larger than the second largest. And that's on a brand new process node. Yields are sure to be terrible, but with what Nvidia is charging for it, you can afford to throw most of the chips produced in the garbage as defective and still make a good profit. That's not profitable if you're hoping to sell GeForce cards for $700, however, so don't expect to see this show up in GeForce cards.
Whereas Pascal was basically a die shrink of Maxwell, this really is a new architecture. Nvidia put massive die space into tensor operations, which will let you do a 4x4 matrix multiply-add of 16-bit floating point values very efficiently. That, like 64-bit floating point operations, is pretty much useless for graphics. But that's kind of the point: while this is technically a GPU, it's really not for graphics.
They beefed up some cache sizes as compared to the GP100 version of Pascal, but GP100 did so as compared to Maxwell while the consumer Pascal didn't. It's not clear whether GeForce cards will get larger cache sizes. Note that AMD has long had more register and local memory capacity in their consumer GPUs than Nvidia; increasing those cache sizes in GP100 really only brought the chip up to parity with the GCN/Polaris cards AMD has been selling since 2012.
Nvidia is also creeping away from the SIMD approach. While previous GPUs had 32 threads in a warp that had to stay together, now Nvidia claims that they have some degree of independent scheduling. It's not clear at all what that means, however. It's hard to imagine that being purely a big chip phenomenon, though, so I expect that the modified scheduling will come to GeForce cards, too.
Comments
"The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."
- Friedrich Nietzsche
The other complication is eight such cards at 300 W each in a single server. Add in a CPU, memory, power supply inefficiencies, and so forth and you might be looking a single server having to dissipate 3000 W. As hard as that might sound, remember that they build HPCs with hundreds or even thousands of such nodes.
If your HPC is going to put out somewhere north of 1 MW of heat, you can't just spray hot air into the room like you do with a desktop. You have to get it out of the room entirely. This is why some data centers locate themselves far to the north so that they the ambient temperature outside the building is very cold, which helps with cooling.
Early 2018 for for consumer Voltas is more likely. SK Hynix has just announced they'll start shipping GDDR6 RAM for undisclosed graphic card manufacturer early 2018
http://www.anandtech.com/show/11297/sk-hynix-announces-plans-to-ship-gddr6-memory-for-graphics-cards-in-early-2018
That's a fair point, I looked it up and its a 300w TDP just like before, so yes, I would agree, more surface area = easier to cool.
I've actually been in Oracle's secondary/backup data center here in CO, it's pretty impressive stuff, and yes, cooling is a whole different ballgame when you have that much equipment in one place.
"The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."
- Friedrich Nietzsche
"Modern" or not, they're still taking the right path. There are single digit games out there that use DX12/Vulkan. So why design a card around something that even a year from now might be in the low teens of games? As far as I can tell this is a good business strategy and its clearly working for them. AMD is/was looking for things to differentiate themselves, and that's great. And it is good futureproofing to an extent. But by the time DX12/Vulkan are really regularly available options in games, most "enthusiasts" will have upgraded their 8/9/10 series NVidia cards anyways.
Anyways. I do agree with you though that they're definitely riding the line of dragging that out too long, though DX12 performance isn't nearly as bad as you're implying on the 9/10 series ;-).
"The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."
- Friedrich Nietzsche
Pretty sure that's exactly what I said ;-).
"The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."
- Friedrich Nietzsche
If you want to know how much total register bandwidth a GPU has, take the TFLOPS number, and multiply by 8 bytes/flop. For example, the Titan Xp is 12.5 TFLOPS, so it has 12.5 * 8 = 100 TB/s of register bandwidth. Yes, really, that's 100 trillion bytes per second of bandwidth. GPU vendors don't advertise that number much, but I think it's impressive when you think about it.
The reason for this is that each instruction needs to be able to read three 32-bit registers and write one 32-bit register. So that's 16 bytes per instruction. So why multiply by 8 rather than 16? Because people have somehow agreed that fma (a * b + c as floats) counts as 2, not 1.
I've never liked that convention, because what else should count as multiple instructions? Integer mads? Bitselect? Yes, having fma as a single instruction rather than multiple instructions can make some things (including graphics) much faster, but there are plenty of other cases where new instructions can make things much faste. What do the instructions in AES-NI count as?
So if GV100 has 15 TFLOPS of single-precision performance, that's 120 TB/s of register bandwidth. The tensor operations are 16-bit "half" precision floating point numbers, but that's still a total of 6 bytes of reads and 2 bytes of writes. How can they get 60 trillion of those per second without having 480 TB/s of register bandwidth?
The answer is that for the matrix multiplication, they can read a value once and then use it four times. The tensor cores do A * B + C as 4x4 matrices of 16-bit floats. That takes 96 bytes of register reads and 32 bytes of writes, or what you'd expect from 8 instructions normally. But instead of reading them, doing eight instructions, and being done, the full matrix multiply and add (can we call this a matrix mad?) involves 64 multiply and 64 add operations. So they get 128 half-precision operations out of that 128 bytes of register bandwidth. Hence the claim of 120 TFLOPS of half-precision tensor operations.
If you love multiplying giant matrices of 16-bit floats, GV100 will be far and away the fastest GPU chip ever made. And I mean fastest by about a factor of 5 until some other chip does something like this. I gather that Nvidia thinks this is important for machine learning.
For most things--including graphics--the tensor cores are pure wasted silicon. While graphics does involve some matrix multiplication, that has to be 32-bit floats, as half-precision when rotating objects will cause awful artifacting. The tensor cores will be useless there. That underscores the point that while nominally a GPU, it's really not for graphics.
That's not to say that the Volta architecture entirely isn't for graphics. It might still be a good graphics architecture if you strip out the tensor cores, double precision arithmetic, ECC memory support and some of the enlarged cache sizes. Such a chip might give the same graphics performance in 2/3 of the die space and vastly less cost. But Nvidia isn't ready to announce the GeForce cards just yet.
But if you're running different benchmarks from everyone else and you can justify why yours are a reasonable way to evaluate video cards, getting different results doesn't mean you're "wrong". You'll get hate from the fans of the vendor that looks worse in your results, of course. But data on how good the GPUs are at things that other people aren't measuring can be valuable. Changing your results to match everyone else's is the review herding that I complained of here:
http://forums.mmorpg.com/discussion/463946/amd-ryzens-sensemi-defeats-the-cpu-z-benchmark#latest
That's why I've long valued HardOCP as a source of GPU reviews. They do things differently from everyone else, and give you different data. If you can only read one review of a GPU, I'd advise against making it HardOCP. But if the choice is 10 reviews that give you the same data, or 9 that give you the same data and one that tells you something useful and independent of the rest, the latter is superior. For a while, Tech Report was also irreplaceable as they were the first to make measuring frame rate consistency a part of their reviews.