For clarity, "high-end desktop" (HEDT) is Intel's terminology for the consumer versions of their platforms that tend to have more cores than their "normal" desktops but no integrated GPU. The cheapest CPUs in the HEDT lineup have generally been around $300 or more. The generations of it have been Nehalem, Gulftown, Sandy Bridge-E, Ivy Bridge-E, Haswell-E, and most recently, Broadwell-E.
Yesterday AMD announced Threadripper, which is basically a two-die Ryzen-based solution with up to 16 Zen cores. This is really AMD's first credible shot at the HEDT market since Intel split their lines to invent it in 2008.
One traditional problem with HEDT parts is that having so many cores means that the cores can't clock all that high. Thus, the HEDT parts typically trail behind the normal consumer quad core CPUs in single-threaded performance. Intel has commonly had the HEDT market use older CPU cores and older process nodes than the normal consumer market, in part because more cores mean larger dies, and that requires more mature process nodes to get acceptable yields.
There have at times been efforts at making a 2-socket desktop, in which you use two separate CPUs. I don't mean two cores; I mean two entirely separate chips that have their own separate CPU socket, memory pool, and so forth. The two CPUs can communicate over some bus; Intel has called it QPI in recent years. Spreading the cores among two sockets means that you can double the number of cores in the system without creating cooling problems or causing clock speed problems from so many cores so near each other.
The problem with the two-socket approach is that for many things, it just doesn't work very well. If a thread running on one CPU needs memory attached to the other CPU, it has to go across the QPI link to get it. For occasional accesses, that's fine, but if half of your memory accesses have to go over QPI, that can give you a huge bottleneck in a hurry. As programs don't have a way to specify which memory pool to use in a two-socket system, if threads are migrating from one CPU to the other a lot without releasing and reallocating memory, you can expect a whole lot of memory accesses to go over QPI.
Intel did push two-socket for high end desktops as recently as Skulltrail, which was basically two Core 2 Quad CPUs. But due to creating bottlenecks in what was then the front-side bus, many programs performed worse on two CPUs than they did on one. After that, Intel relegated the multi-socket approach to servers and used single CPUs for their HEDT platforms.
AMD's proposal is to have a two-socket HEDT system with all its benefits, including double the memory bus with and capacity. But instead of having a giant QPI (or HyperTransport, as AMD has traditionally used) bottleneck, put the two physical CPUs in the same socket, connected by an interposer. That way, you get enormous bandwidth connecting the two CPUs, rather than a big bottleneck.
So why can't they just add a ton of bandwidth to traditional two-socket systems and fix the problem that way? One issue is pin count. You've only got so much room for I/O coming out of your package. Pins take space, and if you want to add more pins, you have to have bigger, more expensive chips, with all of the attendant drawbacks. An interposer can allow massively smaller "pins", allowing far more of them and very wide bus widths to connect multiple dies in the same package. That doesn't let you get massive bandwidth to anything outside of the package, but it does let you connect two CPUs in the same package, or a CPU to a GPU in the same package, or as we've already seen with Fiji, a GPU to HBM.
You might ask, what's the difference between this and the multi-die server chips AMD has had in the past, including Magny-Cours and Valencia? If one die is going to burn 130 W all by itself, what is a two-die package supposed to burn? Clock speeds for the server chips were far too low to be appropriate for consumer use. Ryzen 7 tops out at 95 W, in contrast, with a 65 W consumer version that has all eight cores active. That leaves plenty of space to add more cores without taking a huge hit to clock speeds. At minimum, they could probably at least match Ryzen 7 1700 clock speeds in 130 W, or go higher if they're willing to burn more power. Using an interposer makes it at least possible to spread out the dies a little, which can help with cooling.
But why stop at two dies? Yesterday AMD also announced Epyc, their new server line. The top end part has four dies, for 32 cores total. That will mean considerable drops in clock speeds as compared to Ryzen 7, so it would a stupid part for consumer use. AMD will offer a two-socket version of Epyc, but they're really trying to make a single-socket alternative to what would previously be two-socket Xeon systems. Now that AMD finally has competitive CPU cores for the first time since Conroe arrived in 2006, the massive room available to undercut Xeon prices while still making a hefty profit means that Intel's server division should be scared.
Comments
You raise half of an interesting question that I've always wondered: the absolute need for lower power consumption of the logic circuits. If the same power standard from a few years back (Say at a time of anywhere of 115w to upwards of 230w) applied today the processing power would have to be at the very least theoretically more powerful? Then again, there isn't that much need of extreme power at expense of energy consumption and the more recent dies still outperform the predecessors as expected by Moore's Law. Which also includes power consumption in the theory.
I'm down for the multi-socket boards. Server boards have always had them, but usually no pci e bus, high performance rendering chipsets, or the like without some serious extra cash, and even then wouldn't translate well into a usable say gaming machine.
With recent trends in streaming and multitasking commonly seen today multi-cpu options would make a lot of difference to these users. More so than a bloated number of cores, even with thread lanes information in serial has to pass through a lot of encoder/decoders and parallel starts looking sexier on heavy workloads.
A point that I want to emphasize is multiple dies in Threadripper or Epyc doesn't just add more cores. It adds more memory channels and PCI Express channels, too. It's 32 PCI Express 3.0 lanes and two channels of DDR4 per die.
Thus, Ryzen with a single die has two channels of DDR4 and 32 lanes of PCI Express, Threadripper will have four channels of DDR4 and 64 lanes of PCI Express, and the big Epyc will have 8 channels of DDR4 and 128 lanes of PCI Express. The PCI Express is all of the connectivity coming off of the chip, so if you want SATA ports, ethernet, or whatever, that uses up some of the PCI Express connectivity.
But that does mean that as compared to Ryzen 5 and 7, Threadripper will have double the memory bus width (and hence possible capacity) and about double the PCI Express lanes. If you were to build a 2-socket Ryzen system with one die per socket for the sake of adding more memory capacity and bandwidth and adding more PCI Express connectivity, Threadripper will get you all of those benefits from a single socket. So this isn't like the Core 2 Quad, which also had two CPU dies per socket, but only only added more CPU cores and that's it.
All time classic MY NEW FAVORITE POST! (Keep laying those bricks)
"I should point out that no other company has shipped out a beta on a disc before this." - Official Mortal Online Lead Community Moderator
Proudly wearing the Harbinger badge since Dec 23, 2017.
Coined the phrase "Role-Playing a Development Team" January 2018
"Oddly Slap is the main reason I stay in these forums." - Mystichaze April 9th 2018
It is an elegant way to offer scalability. AMD is just making 1 chip and packaging them in varying quantities of that chip. I wonder if AMDs APUs will do something similar. Use the 8 core chip, and separate GPU.
I don't see OpenCL as being terribly relevant to Threadripper, either. OpenCL is built around the capabilities of GPUs, with a few things (mainly pipes) thrown in for the benefit of FPGAs. You can run OpenCL code on a CPU, and that especially makes sense for debugging FPGA code. But you don't need Threadripper for that.
If the concern is scaling CPU code to use many CPU cores, then I don't see OpenCL as useful there. It's built for a different threading paradigm entirely from what CPUs use, and some CPU programming languages have plenty mature tools for threading your code--and tools that expose the greater versatility that CPUs have rather than the more restricted model OpenCL follows.
If the concern is using AVX to really take advantage of what CPUs can do with SIMD, then OpenCL is at least more plausible. That makes it possible to write your code for what happens in one AVX lane, then have the compiler automatically scale it. You can blow things up in a hurry if you use any instructions that don't have an AVX version, or if you need to constantly change data sizes (e.g., mixing floats and doubles). But while OpenCL isn't ideal for this, neither are any of the other options available (OpenMP, intrinsics, etc.).
But it might be more of a hardware problem than a software one. It's not just that some instructions don't have an AVX version, and so they're problematic to use at all if you're trying to exploit AVX. It's also that CPUs simply don't have a good way to pass data back and forth among AVX lanes. GPUs can put some instructions in some but not all shaders, and then the threads in a warp can get automatically unpacked to route through shaders over the course of multiple clock cycles and repacked without it being all that bad. GPUs can use local memory to pass data back and forth among threads pretty efficiently, but CPUs simply don't have any analogous cache for that.
As for the original question of how Threadripper affects gaming, it really doesn't very much for most people. If you feel the need to have a 16-core desktop for non-gaming reasons, Threadripper will let you have it all in one socket, and with decent single-threaded performance. So such a desktop could double as a capable gaming rig. Today, if you want more than 10 cores, you have to either use multiple sockets (which often creates a QPI bottleneck) or else pay a fortune and still accept poor single-threaded performance. For example, you can get a 16-core Xeon E5-2697A v4 today, but it costs about $3000 and has a stock clock speed of only 2.6 GHz with max single-core turbo of 3.6 GHz.
Guess my wait for the revolutionary next gaming chip continues...
All time classic MY NEW FAVORITE POST! (Keep laying those bricks)
"I should point out that no other company has shipped out a beta on a disc before this." - Official Mortal Online Lead Community Moderator
Proudly wearing the Harbinger badge since Dec 23, 2017.
Coined the phrase "Role-Playing a Development Team" January 2018
"Oddly Slap is the main reason I stay in these forums." - Mystichaze April 9th 2018
As for work applications I'll certainly be checking this chip out, I do a ton of multi-tasking.
Can you clarify more about extended vector instructions? This is relatively new tech still and I'm a bit confused on word size relevant to thread paralleling.
@Quizzical
Let's suppose that you have code that looks something like this:
float foo[4] = //something
float bar[4] = //something
float baz[4];
for (int i = 0; i < 4; i++) {
baz[i] = foo[i] + bar[i];
}
What SSE and AVX can do is that if you've packed foo and bar into the vector registers, it can do all four of the floating point adds in a single instruction rather than taking four instructions. OpenMP will try to convert the above code to do exactly that. OpenCL would take an approach more like:
__kernel void floatadd(const __global float * restrict foo, const __global float * restrict bar, __global float * restrict baz) {
int gid = get_global_id(0);
baz[gid] = foo[gid] + bar[gid];
}
One problem with this is putting data into the vector registers. If all you want is four adds like above, then it takes four operations to put the four components of foo into a vector register, four more operations to do the same for bar, and if you need to unpack the components of baz later, four more operations to unpack them. That overwhelms the advantages you get by doing the four adds in one operation rather than four.
If, on the other hand, you have hundreds of consecutive operations where you're doing exactly the same thing to each component of the vector, sometimes you can pack the data into vector registers once, do all of the operations using AVX instructions so that it takes one instruction instead of four, and then unpack the data at the end. That can provide huge savings because it's one instruction instead of four for hundreds of consecutive things.
But what you can do with this is really restricted. If you need to take the cosine of something in the middle of your code, that breaks up the AVX instructions. The compiler would have to unpack the data from vector registers, take the cosine of one component at a time, and then repack it into vector registers. If that only happens once in a chain of hundreds of instructions, then maybe you can shrug and ignore it as inconsequential. But if it happens for a lot of instructions, the extra overhead of packing and unpacking the data can easily make using AVX slower than not using it.
Mixing data sizes will screw up AVX, too. If you've got 128-bit vector registers, then you can pack four floats or two doubles into a register. If you need to add floats to doubles at some point, then it has to unpack the floats, cast them to doubles, repack them as doubles, and then do the addition. You can get the same issue with different sizes of integer data types, too.
In CPU terms, this is all happening inside of a single thread. Scaling to use multiple threads is a different issue entirely. OpenCL will try to handle that scaling for you, so that if you have components that you might think of as being 1024 bits wide, it can be 8 threads that each process 128 bits on an architecture with AVX, or 4 threads that each process 256 bits on an architecture with AVX2, or whatever. So you can write the code once and have it automatically packed into the maximum width SSE or AVX instruction.
GPUs have a totally different cache hierarchy, so they're able to handle either of these situations cleanly. Let's consider Nvidia's Pascal architecture for concreteness. Pascal has threads in warps of 32, and its shaders also come in sets of 32, which are divided into two sets of 16. In order to have all of the threads in a warp execute some instruction, it grabs a set of 16 shaders and has half of the threads start on that instruction in one clock cycle and the other half in the next clock cycle.
The two different sets of shaders have some instructions in common, but also have some instructions that the other set doesn't have. If you want to do a bit shift, for example, only half of the shaders can do it, so the instruction will have to go on that half of the shaders. A floating point multiply is present in all of the shaders, so if that's what is called for, the scheduler can freely put it on either set of shaders as available. The scheduler will figure out which instructions have to go on which shaders and schedule them accordingly.
Nvidia also has special function units for things like trig functions, logarithms, exponentials, and some other things. That allows an instruction to be usable even though it's only laid out in eight shaders. What I'm pretty sure it does is to use the special function unit, it basically claims the set of shaders that includes it for four clock cycles instead of two, and puts 8 threads through it per clock cycle. Thus, using cosine as in the above example is as bad as two instructions rather than one, but that sure beats taking a boatload of instructions to pack and unpack the data.
(continued in next post)
To cover up this latency, GPUs need to have an enormous number of threads resident. It's not just the 32 threads in a warp. You might want to have 8 or so warps handled by the same set of shaders to cover up latency. In a lot of situations, the scheduler will say, let's see if we can do the next instruction for this thread... nope, it's not ready. Well, how about this other thread? It will try to pick some warp that is ready to go, and bounces around between warps every single clock cycle. GPUs commonly need tens of thousands of threads resident simultaneously to properly exploit the hardware.
And it's not just tens of thousands of threads in total, but tens of thousands of threads, each processing the same instructions on their own data, and with very little data of their own loaded at a time. Modern GPU architectures tend to have a cap of around 256 32-bit registers per thread, and some less recent ones had a cap of around 64 registers per thread. Unlike CPUs, where you have a few registers, then larger L1 cache, then larger still L2 cache, then even larger L3 cache, on a GPU, registers are your main cache. Many GPUs have more register space than all other on-die caches added together. For example, AMD's Fiji chip had 16 MB of registers--and needed them, so that, for example, you could have 65536 threads resident simultaneously, each of which had 256 bytes all to itself.
GPUs are throughput optimized, not latency optimized. If all you care about is when some total amount of processing is done, you can play games like scheduling whatever warps are ready and bouncing back and forth between them without worrying that it will take a long time for any particular thread to finish its processing. That makes a ton of sense for graphics, where it doesn't matter when one particular pixel is finished, but only when the entire frame is done. And there are some non-graphical but embarrassingly parallel workloads where it also works well. But it completely fails if you can't scale to more than several hundred threads, or if you need different threads to execute completely independent code at the same time.
AMD can't hang with Intel on single-threaded performance. But Epyc will probably mean AMD can offer the fastest 1-socket or 2-socket servers for a whole lot of workloads. For things that scale well to many CPU cores but don't scale well to multiple sockets, it's likely that AMD will often be able to offer the fastest servers, period. And that's ignoring the price tag, even, where I'd expect AMD to undercut Intel's Xeon prices considerably.
SSE and AVX instructions only help when doing computations is the bottleneck, as they're simply a faster way to do computations in certain situations. They won't help you load data from memory or hard drives any faster. More threads can help to cover up latency when that's what is limiting your performance, as I expect would sometimes happen.
The issues of packing and unpacking data in the SSE or AVX vector registers only matter if you're actually using them. Most code doesn't, as either it's not of the constrained structure that can benefit or else performance is fast enough that further optimization doesn't matter. Or both. If you want to add a float to a double as ordinary scalars, it will have to cast the float to a double first, but that's extremely fast as it's just moving bits around in some fixed pattern.
And thanks again! It's been nice to have some good conversations today.
As a developer you need to develop for the widest audience, not for a specific platform. What that means today is supporting as many threads as you can. A 4c4t is still going to be able to process 16 threads. An 8c16t is just going to process it more efficiently. Given the range of systems you have to develop for, I do see developers taking on a many threaded approach from now on while making it workable on common systems. It really doesn't make sense to poorly utilize the hardware in a PS4 or XBox One. Eight core machines are poised to be incredibly common with ARM and AMD chips.
I'm not recommending anyone buy an i3, I'm just saying, for most people, that's still plenty of CPU, but a lot of people buy a lot more CPU than they need, sometimes legitimately, but often not citing either "futureproofing" or benchmarks on some corner cases that they don't ever run, or very rarely encounter at the best.
Its all a matter of budget: of course you'd buy something else if you had more money, but that isn't always an option.
A desktop i3 is a 4-thread chip (2c4t), and for gaming purposes it seems to hold it's own in non-extreme cases.
The only reason I haven't recommended an i3 over FX-x3xx in those budget cases is because I think the AMD chips present a better value once your in that budget range, even though for gaming purposes, they tend to lag some behind even the i3. Ryzen 3's and APUs may change that calculation a good deal I expect.
The Ryzen 5's aren't competitive there, but they aren't really meant to be. I would expect a Ryzen-based APU to be competitive though - something with, say, a couple of Zen cores and a few Vega GPU units, could be extremely competitive with Intel's inexpensive Pentium, if you look at total system cost.