Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Are GPUs running out of parallelism for graphics?

QuizzicalQuizzical Member LegendaryPosts: 25,531

Way back in 1998, some GPUs moved from having one of various components on a chip to having two of them.  Having two different computational units working independently on the same image at once worked well as it was easy to keep them apart and have plenty of work to keep them busy.

Then came chips with four pixel shaders, or eight, or twelve, or sixteen.  The numbers have steadily increased over the years, and we're now well into the thousands.

Graphics is sometimes cited as an example of an embarrassingly parallel problem that will scale well to arbitrarily many threads.  And it demonstrably has scaled well to thousands.  But it won't scale to infinitely many, or even close to that.  A draw call only needs so many shader invocations at each programmable pipeline stage, and you only have so many draw calls before you switch programs.

What makes me wonder about this is that the Radeon R9 Fury X performs about as well as AMD promised at 4K.  But it performs much worse relative to other hardware than it does at 4K.  It's not just relative to Nvidia; the same is true if you compare it to other GCN cards of the same architecture.

An obvious question to ask is, why would AMD run into this problem but not Nvidia?  But that's pretty simple:  GCN needs a lot more parallelism to really stretch its legs than Maxwell does.  For the entire unified shader era, AMD has offered more brute force computational power than Nvidia, while Nvidia has offered more sophisticated schedulers to better make use of the hardware.  That gap has narrowed some over the last eight years, especially with the introduction of Kepler and GCN in 2012, but it's still there.

For example, when GCN has a thread execute some instruction, it cannot execute another instruction for at least four clock cycles.  Maxwell will happily execute another instruction in the same thread in the very next clock cycle if it's ready and if there aren't enough other threads resident.  GPUs cover this up by having lots of threads resident so you've always got something ready to execute even if a given thread only has to do something every twelve clock cycles or so.  But GCN needs a lot more threads to do this than Maxwell.

If you look at the GCN architecture, it's pretty clear that AMD's plan to get performance was to assume that there were enormous numbers of threads available to keep the hardware busy.  And if you're doing something that would scale well to ridiculous numbers of threads--ridiculous as in billions of threads, not thousands--GCN is a far superior architecture to anything Nvidia has ever made.  Not only would Fiji blow away a Titan X, but a Radeon R9 290X would get quite a few wins over a Titan X, too.

But if you only have a few thousand threads, Fiji won't just lose to a Titan X.  It would also tend to lose badly to a GeForce GTX 750 (bottom of the line Maxwell).  And might not even beat a Radeon HD 7790.  All those extra compute units don't do you any good if you don't have any work to give them.  For similar reasons, a Core i3-4170 will often beat a $7000 Xeon E5-4669 V3 (18 cores, 36 threads, max turbo of 2.9 GHz) in single-threaded algorithms.

At some point, you run out of parallelism and can't scale well to more threads.  That's true of nearly all algorithms, and definitely includes graphics, even if graphics scales to more threads than most CPU algorithms.  GCN might well be approaching that problem for and hitting diminishing returns on adding more hardware.  Which reduces AMD marketing to pushing anything that will yield more parallelism, such as 4K or virtual reality or Eyefinity.

That's not to say that this is the end of GPU improvements.  There's plenty of room for AMD to do what Nvidia has done with Maxwell and keep thousands of shaders busy with a lot fewer threads resident on the chip at a time.

Or it might just be that AMD hasn't figured out how to make HBM play nicely with graphics and a driver update in a month will fix the problem.  Which is why Betteridge's Law of Headlines suggests an answer of "no" to any thread like this one.

Comments

  • HrimnirHrimnir Member RarePosts: 2,415

    I really love this post.  It highlights a lot of the "overview" stuff of graphics you've seen in the last few years.

    I always use to explain to people that AMD tended to focus on the "brute force" hardware method of gaining results.  Whereas nvidia tended to use a little more elegant solutions but not have quite as much "brute force".  This has panned out exactly as you stated and has now become an issue of parallelization (sp?).  Regardless you pretty much said everything i would have said on the topic, just wanted to say i completely agree with it.

    "The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."

    - Friedrich Nietzsche

  • GdemamiGdemami Member EpicPosts: 12,342


    Originally posted by Hrimnir
    I really love this post.  It highlights a lot of the "overview" stuff of graphics you've seen in the last few years.I always use to explain to people that AMD tended to focus on the "brute force" hardware method of gaining results.  Whereas nvidia tended to use a little more elegant solutions but not have quite as much "brute force".  This has panned out exactly as you stated and has now become an issue of parallelization (sp?).  Regardless you pretty much said everything i would have said on the topic, just wanted to say i completely agree with it.

    The problem is, there is no evidence for this part:


    Originally posted by Quizzical

    What makes me wonder about this is that the Radeon R9 Fury X performs about as well as AMD promised at 4K. But it performs much worse relative to other hardware than it does at 4K. It's not just relative to Nvidia; the same is true if you compare it to other GCN cards of the same architecture.

    thus making this funny wall of text moot since it is based on made up stuff. Typical...

  • KiyorisKiyoris Member RarePosts: 2,130
    lighting and transformations take thousands of dot products, the most expensive equation a GPU has to make, it scales up just fine
  • 13lake13lake Member UncommonPosts: 719
    Originally posted by Gdemami

     


    Originally posted by Hrimnir
    I really love this post.  It highlights a lot of the "overview" stuff of graphics you've seen in the last few years.

     

    I always use to explain to people that AMD tended to focus on the "brute force" hardware method of gaining results.  Whereas nvidia tended to use a little more elegant solutions but not have quite as much "brute force".  This has panned out exactly as you stated and has now become an issue of parallelization (sp?).  Regardless you pretty much said everything i would have said on the topic, just wanted to say i completely agree with it.


     

    The problem is, there is no evidence for this part:

     


    Originally posted by Quizzical

    What makes me wonder about this is that the Radeon R9 Fury X performs about as well as AMD promised at 4K. But it performs much worse relative to other hardware than it does at 4K. It's not just relative to Nvidia; the same is true if you compare it to other GCN cards of the same architecture.


     

    thus making this funny wall of text moot since it is based on made up stuff. Typical...

    Fury X loss of performance with reduction of the resolution is based on empirical evidence, the only thing lacking empirical evidence is your assumption that the evidence is lacking.

  • centkincentkin Member RarePosts: 1,527

    Massively parallel only takes you so far, and yes GPUs are starting to hit the problem that CPUs have already hit.  As time passes improvements are becoming more and more incremental and there is only so much we will be able to wrangle with silicon.  The upshot is you are not going to be seeing CPUs that are 5 times more powerful than an enthusiast rig from silicon like ever.  You won't see a GPU that is about 20 times more powerful than the current best from silicon ever also.

    Unless optical or quantum computing manages to hit the mainstream, computers of the future will very much resemble computers of the current day except for cosmetic differences.

  • KiyorisKiyoris Member RarePosts: 2,130
    Originally posted by centkin

    Massively parallel only takes you so far, and yes GPUs are starting to hit the problem that CPUs have already hit

     

    is there a link where you guys are getting this info from? this is the first I have heard about this

  • RidelynnRidelynn Member EpicPosts: 7,383

    Hmm, I think Quiz is wrong here, unless I'm misreading.

    To paraphrase: Fury's problem is that it's designed for massively parallel things, and when things aren't massively parallel, it isn't as efficient. Which is why it performs better at higher resolution: there's more parallel things going on, so Fury presents more of a performance advantage.

    Hmm, ok.

    But if the problem is that the GCN has to wait for 4 cycles after a thread before it can execute a new thread (which I don't even know if that's true, it may or may not be, but for the discussion I'll go with it), and the fix for that was to have an abundance of execution units so that you always have an idle unit available, then wouldn't that offset the problem of not enough parallellism? Seems like the more execution units you have tied up, the bigger that 4-cycle penalty becomes, not less. And that would be exactly the opposite of what Fury is showing now.

    And then to get to the title of the thread: If Fiji is designed to be embarrassingly parallel, how is it running out of parallelism? Is software just not able to branch out enough? Or is it that the hardware has hit the point where more threads just doesn't add to the efficiency any longer and we are hitting other bottlenecks? I don't think this premise has been justified by anything right now.

    Most processes in nature with regard to power follow a square curve: to double the output, it takes the square of the input. To double the output of a pump, it requires squaring the power input. To double the output of a generator, it requires the squaring the fuel input to the prime mover. Electrical Power is equal to the square of the current times the load resistance. Maybe parallelism follows a similar formula, where we need to square the compute units to get a real-woprld doubling of compute power.

  • QuizzicalQuizzical Member LegendaryPosts: 25,531
    Originally posted by Gdemami

     


    Originally posted by Hrimnir
    I really love this post.  It highlights a lot of the "overview" stuff of graphics you've seen in the last few years.

     

    I always use to explain to people that AMD tended to focus on the "brute force" hardware method of gaining results.  Whereas nvidia tended to use a little more elegant solutions but not have quite as much "brute force".  This has panned out exactly as you stated and has now become an issue of parallelization (sp?).  Regardless you pretty much said everything i would have said on the topic, just wanted to say i completely agree with it.


     

    The problem is, there is no evidence for this part:

     


    Originally posted by Quizzical

    What makes me wonder about this is that the Radeon R9 Fury X performs about as well as AMD promised at 4K. But it performs much worse relative to other hardware than it does at 4K. It's not just relative to Nvidia; the same is true if you compare it to other GCN cards of the same architecture.


     

    thus making this funny wall of text moot since it is based on made up stuff. Typical...

    Pretty much any review that tested several resolutions would show you that result.  For example:

    http://www.techpowerup.com/reviews/AMD/R9_Fury_X/31.html

    Though since they didn't test with a Core i3 and a memory channel vacant, you might think that it doesn't count.

  • QuizzicalQuizzical Member LegendaryPosts: 25,531
    Originally posted by Ridelynn

    Hmm, I think Quiz is wrong here, unless I'm misreading.

    To paraphrase: Fury's problem is that it's designed for massively parallel things, and when things aren't massively parallel, it isn't as efficient. Which is why it performs better at higher resolution: there's more parallel things going on, so Fury presents more of a performance advantage.

    Hmm, ok.

    But if the problem is that the GCN has to wait for 4 cycles after a thread before it can execute a new thread (which I don't even know if that's true, it may or may not be, but for the discussion I'll go with it), and the fix for that was to have an abundance of execution units so that you always have an idle unit available, then wouldn't that offset the problem of not enough parallellism? Seems like the more execution units you have tied up, the bigger that 4-cycle penalty becomes, not less. And that would be exactly the opposite of what Fury is showing now.

    And then to get to the title of the thread: If Fiji is designed to be embarrassingly parallel, how is it running out of parallelism? Is software just not able to branch out enough? Or is it that the hardware has hit the point where more threads just doesn't add to the efficiency any longer and we are hitting other bottlenecks? I don't think this premise has been justified by anything right now.

    Most processes in nature with regard to power follow a square curve: to double the output, it takes the square of the input. To double the output of a pump, it requires squaring the power input. To double the output of a generator, it requires the squaring the fuel input to the prime mover. Electrical Power is equal to the square of the current times the load resistance. Maybe parallelism follows a similar formula, where we need to square the compute units to get a real-woprld doubling of compute power.

    I'm not saying "this is definitely the problem!"  I'm speculating, and I thought this was more interesting then yet another "Developers are stupid because they won't make the niche game I want!" thread.

    GPU parallelism is not at all similar to CPU parallelism.  Everything on a GPU that could possibly be high latency probably is.  The GPU approach to parallelism is to have an enormous number of threads, and each clock cycle, pick some that are ready to execute an instruction and do so.  Recent Nvidia architectures schedule a batch of 32 threads called a "warp" simultaneously.  GCN schedules a batch of 64 threads called a "wavefront" on 16 shaders, with 16 threads going for each of four clock cycles.  There is no penalty at all for switching threads among those resident on the chip, and it's very much intended that different threads execute instructions every clock cycle.

    The enormous number of threads is far larger than most non-GPU programmers expect.  A GeForce GTX Titan X can have 48,576 threads resident at a time.  A Radeon R9 Fury X can have 163,840 threads resident at a time.  Yes, we're well into six figures, and yes, that's simultaneously resident on a single chip.  Now, neither of those chips need the max number to perform well, but they all need a substantial fraction of it.  If you can "only" break your workload into 20,000 threads resident simultaneously, a Titan X will be fine, but the Fury X will starve.

    How many threads can you get resident simultaneously in graphics?  I don't know.  But I don't think it's immediately obvious that you can get into six figures, and find it plausible that higher resolutions can offer more parallelism.

    If you have some other explanation for why the Fury X fares so much better relative to the competition at higher resolutions than at lower resolutions, I'd like to hear it.

  • Righteous_RockRighteous_Rock Member RarePosts: 1,234
    We maxed out how many pipes we can lay down, we need to more efficiently use them now.
  • GdemamiGdemami Member EpicPosts: 12,342


    Originally posted by Quizzical
    Pretty much any review that tested several resolutions would show you that result.  For example:http://www.techpowerup.com/reviews/AMD/R9_Fury_X/31.html

    Rofl, that is an issue of techpowerup.com, not the card.

    http://www.overclock3d.net/reviews/gpu_displays/amd_r9_fury_x_review/21

    http://hexus.net/tech/reviews/graphics/84170-amd-radeon-r9-fury-x-4gb/?page=7http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/69682-amd-r9-fury-x-review-fiji-arrives-9.html

    Tons and tons of other tests denying your fantasies. But then again, you are not a person who ever base his claims on evidence...

  • QuizzicalQuizzical Member LegendaryPosts: 25,531
    Originally posted by Gdemami

     


    Originally posted by Quizzical
    Pretty much any review that tested several resolutions would show you that result.  For example:

     

    http://www.techpowerup.com/reviews/AMD/R9_Fury_X/31.html


     

    Rofl, that is an issue of techpowerup.com, not the card.

    http://www.overclock3d.net/reviews/gpu_displays/amd_r9_fury_x_review/21

    http://hexus.net/tech/reviews/graphics/84170-amd-radeon-r9-fury-x-4gb/?page=7http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/69682-amd-r9-fury-x-review-fiji-arrives-9.html

    Tons and tons of other tests denying your fantasies. But then again, you are not a person who ever base his claims on evidence...

    How about if you read your own links before you post?  You just linked to a Fury X beating a GTX 980 Ti outright at 4K, but losing by more than 10% at 1080p.  Meanwhile, it beats the 290X by about 50% at 4K and 30% at 1080p.  That's exactly what I'm talking about:  the Fury X fares better compared to other cards at higher resolutions than lower.

  • GdemamiGdemami Member EpicPosts: 12,342


    Originally posted by Quizzical
    How about if you read your own links before you post?  You just linked to a Fury X beating a GTX 980 Ti outright at 4K, but losing by more than 10% at 1080p.  Meanwhile, it beats the 290X by about 50% at 4K and 30% at 1080p.  That's exactly what I'm talking about:  the Fury X fares better compared to other cards at higher resolutions than lower.

    ...aaand we are back to making up stuff. Nice one.

  • KiyorisKiyoris Member RarePosts: 2,130
    Originally posted by Jean-Luc_Picard
    Originally posted by Kiyoris
    lighting and transformations take thousands of dot products, the most expensive equation a GPU has to make, it scales up just fine

    Guess you've read some graphic programming tutorials from 10 to 15 years ago...

    funny, I have a bachelor in computer science, studying for my masters, here's what's on my desk atm:

     

  • RidelynnRidelynn Member EpicPosts: 7,383

    Nope, I don't have any explanation for why Fiji isn't as competitive at lower resolutions. I also don't feel particularly compelled to speculate about one, although I do find it somewhat academically interesting should one actually come up.

    I just don't think "Starved for threads" makes a lot of intuitive sense... Sure you can somewhat relate it to a faster IPC Core CPU versus a higher core count FX, but as you say - we are well into the tens / breaking into the hundreds of thousands of threads by now - and if there is a severe latency issue with context switching, fewer threads should eliminate that bottleneck, as you'd have sufficient idle cores to be able to eliminate the switching overhead almost entirely - but as you saturate the die, your going to start hitting that performance penalty more and more often.

    The Core i3 to Xeon E5 analogy - maybe, it's plausible if you think the API/driver can't generate enough threads to keep the cores busy on Fiji. But then we're kinda back to a driver issue, or if there just isn't any more threads to be made (as your title poses) an engineering issue (similar to the IPC versus core count debate on CPUs)

  • HrimnirHrimnir Member RarePosts: 2,415

    Given everything thats been said in this thread, the only other speculation i can come up with is that the performance differences at high res can be attributed to the HBM.

    Memory bandwidth is kind of like a water pump, once you have enough to fill the pipes, more isn't helping.  I suspect at higher resolutions the additional memory bandwidth HBM provides is allowing the card to post better numbers, and at lower resolutions the architecture of the card is whats causing the "slowness".

    I.e. if Fury X had identical bandwidth to the 980ti i would suspect the % difference to be roughly the same across the board.

    "The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."

    - Friedrich Nietzsche

  • QuizzicalQuizzical Member LegendaryPosts: 25,531

    Higher resolutions increase the amount of work to render a frame, certainly.  But does a higher resolution increase the ratio of memory bandwidth needed to computations needed?  Because if not, then it's irrelevant.

  • HrimnirHrimnir Member RarePosts: 2,415
    Originally posted by Quizzical

    Higher resolutions increase the amount of work to render a frame, certainly.  But does a higher resolution increase the ratio of memory bandwidth needed to computations needed?  Because if not, then it's irrelevant.

    I can't claim to have enough knowledge of graphics processing to say.  I'll text a buddy of mine who works in the field and see what he has to say.

    Edit:  The other thing too, if the memory is becoming a bottleneck for the 980ti at 4k resolution, then regardless of the compute capabilities of the architecture, the memory bandwidth would be gimping it, and allow then the AMD card to "catch up" so to speak.

    "The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."

    - Friedrich Nietzsche

  • RidelynnRidelynn Member EpicPosts: 7,383

    As nice as HBM seems on paper - I really wonder what the direct effect of HBM is on Fiji performance. It obviously has great physical characteristics, and has a power advantage, but a lot of people were expecting a giantslayer and pinning nearly everything on HBM.

    There is the very obvious 4G RAM limitation. HardOCP did find that you can hit that as a bottleneck on occasion, but most titles they tested did not.

    We can look at the much higher bandwidth, and think that because we see great performance gains from DDR3 to GDDDR5, and on integrated graphics even just from going from slow DDR3 to fast DDR3, that we should see a similar performance jump from GDDDR5 to HBM. But then again, do we really see a significant jump in performance in our computers going from DDR3 to DDR4? Or from DDR3 1333 to 2133 or higher? Maybe bandwidth has hit the point of diminishing return, or rather, it wasn't the bottleneck in the first place, and something else is still the primary bottleneck. And if you look at the differences in clock speeds - a 512-bit bus at an effective 6Ghz, compared to a 4096-bit bus at 1Ghz, the difference in transfer rates isn't nearly as much as it would seem - we go from a theoretical 384Gb/sec on Hawaii XT to 512Gb/sec on Fiji - a nice increase, but it's not an order of magnitude change (like SSD to HDD was), it's not even a doubling. So I think the effect of HBM has been way overstated.

    I think more of this has to do with changes in GCN1.3. I can find shockingly little about 1.3, apart from the fact that it uses HBM. Maybe it is just 1.2 with HBM support and there isn't much more to it than that. Benchmarks show better tesselation performance. HardOCP has this graphic, and this was the best I was able to find: http://www.hardocp.com/image.html?image=MTQzNTEwODU5MTlTMEhPT1prR0FfMV82X2wuZ2lm

    I lean more in the driver camp. Changing a driver that was optimized for a narrow fast bus, to go to a wide slow bus can't be easy. AMD is also working with a decent memory limitation, particularly on a product that's billed for 4k.

    I think performance on Fury will come up over time - we saw that when GCN1.0 came out, and we are still seeing decent performance jumps on 1.1 mostly driven by driver tweaks. But I also don't think the thought of performance going up over time is enough to get someone to buy one of these cards.

    I agree with what another poster said, this feels more like a tech demo for HBM. And it very well could be setting up AMD for a nice Second Gen product, but we aren't at their second generation yet. It feels similar to Fermi - I think nVidia was in a much worse position because of Fermi than AMD is because of Fiji - but I can see a lot of parallels. And nVidia pulled through fine - Keplar was good, and Maxwell is very impressive. Time to see what AMD does with this, as it doesn't have nearly as big of a legion of devouts who will buy anything Team Red puts out just because of the brand name to pull it through a rough patch.

  • GdemamiGdemami Member EpicPosts: 12,342


    Originally posted by RidelynnMaybe it is just 1.2 with HBM support and there isn't much more to it than that.

    Indeed, Fury is very likely 1.2 GCN, no "next gen" card, just recycled and tweaked tech with new name, like AMD loves to do for past years. HBM is no news either, last year Nvidia announced HBM cards for 2016 release.

    At 1080p R9 390x is gaining 10% on R9 290x and Fury is gaining the same on R9 390x.
    At 4k, Fury gains additional 6%, gaining about 16% on R9 390x.

    Completely reasonable scaling.

    There is nothing wrong, it is just unrealistic expectations and fanboism.

  • 13lake13lake Member UncommonPosts: 719
    Originally posted by Gdemami

     


    Originally posted by Quizzical
    How about if you read your own links before you post?  You just linked to a Fury X beating a GTX 980 Ti outright at 4K, but losing by more than 10% at 1080p.  Meanwhile, it beats the 290X by about 50% at 4K and 30% at 1080p.  That's exactly what I'm talking about:  the Fury X fares better compared to other cards at higher resolutions than lower.

     

    ...aaand we are back to making up stuff. Nice one.

    I like how u use multiple logical fallacies and political grade misdirection :

    So is guru3d at fault, i can continue providing you proof 15x times over, there's atleast 15 websites that did testing, and not to mention that even the links  you provided prove the same thing, only that their pictures are so hard to read, you would need to cut and resize, pull out the numbers and calculate % to show it.

    And i applaud you for wasting your time to find the 3 websites on the complete opposite side of the spectrum to prove your false point, not like there's a huge driver and testing settings discrepancy between the results(that provides a small % wise fps difference) which you use to reinforce a logical fallacy.

  • GdemamiGdemami Member EpicPosts: 12,342


    Originally posted by 13lake

    I like how u use multiple logical fallacies and political grade misdirection :So is guru3d at fault, i can continue providing you proof 15x times over, there's atleast 15 websites that did testing, and not to mention that even the links  you provided prove the same thing, only that their pictures are so hard to read, you would need to cut and resize, pull out the numbers and calculate % to show it.And i applaud you for wasting your time to find the 3 websites on the complete opposite side of the spectrum to prove your false point, not like there's a huge driver and testing settings discrepancy between the results(that provides a small % wise fps difference) which you use to reinforce a logical fallacy.

    Oh dear...

    OK, so let's take a look at some other website, shall we? Guru3D is ruled out because they do not provide 1080p results.

    How about this site?
    http://www.bit-tech.net/hardware/graphics/2015/06/24/amd-radeon-r9-fury-x-review/7

    Simple test, 4 cards, 3 latest most demanding games, 3 resolutions.

    Sum of FPS at 1080p / 2160p

    GTX 980ti - 136 / 56
    GTX 980 - 121 / 44
    Fury X - 115 / 54
    R9 290x - 93 / 37

    So now, how much the frame rate drops when you go from 1080p to 2160p?

    GTX 980ti - 41%
    GTX 980 - 36%

    Fury X - 47%
    R9 290x - 40%


    So as you can see, Nvidia "suffers" same way as AMD, despite trololol theorycrafting above.

    This scaling is natural, higher resolutions simply need more relative performance.


    There is another site:
    http://www.pcgamer.com/amd-radeon-r9-fury-x-tested-not-quite-a-980-ti-killer/


    The results are pretty much the same, what is particularly noteworthy tho, is comparison of GTX 980 and GTX 970. Despite GTX 980 having more CUDAs, they are experiencing identical FPS drop at 2160p, giving hint that the it would be memory interface giving the right kick for performance at higher resolutions.

    Just like one HBM provides for that recycled Tonga or w/e is Fury based on.

Sign In or Register to comment.