I have no idea what's wrong with my GPU

Dystopia

Ars Tribunus Angusticlavius
7,674
My specs are below. The original GPU was a 10GB 3080. I have replaced it with two second hand 3090s. I realise that might seem a bit insane, but it's actually the perf/$ sweet spot for running LLMs locally. I benchmarked the 3080 before I took it out.

I have tested the 3090s and they are not performing as expected. In fact they're not even performing the same as each other. In Furmark one of them does what you expect: It runs full tilt, ~300FPS @ 1080p, draws ~350W, and spins the fans moderately. The second one runs at ~70%, ~120FPS @ 1080p, draws ~250W, and spins the fans up to jet engine at takeoff levels. The recalcitrant one is in the top slot and has my display plugged into it. In game benchmarks it actually performs worse than the 3080 it replaced. Plugging the display into the other one flickers like crazy, if the display will even accept the input. My motherboard splits PCIE lanes 8x/8x. The cards are not in SLI. I have updated the Nvidia drivers. Performance in Geekbench ML is much closer, ~1-2% difference.

I got these cards as a pair from the same seller. They're made by PNY.

I am at a loss as to how to proceed.

Ryzen 5600X
Asrock B550 Taichi
32GB DDR4
3090 x2
Windows 10
Seasonic Focus GX 1000W
 
  • Hug
Reactions: blath

IceStorm

Ars Legatus Legionis
24,871
Moderator
In Furmark one of them does what you expect: It runs full tilt, ~300FPS @ 1080p, draws ~350W, and spins the fans moderately.
This card is fine.

The second one runs at ~70%, ~120FPS @ 1080p, draws ~250W, and spins the fans up to jet engine at takeoff levels.
This card's memory is overheating.

I am at a loss as to how to proceed.
Remove the "good" card. With just the "bad" card installed, re-run some tests and check the memory hot spot temps. My guess is you're hitting 110C as that will cause the fans go to 100% even if the GPU core is cool (60-70C). nVidia got a doctor's note from Micron allowing them to run the GDDR6X up to 115C, but at 110C the fans will go to 100%.

You're looking at a repad/repaste of the card. I had to do the same for my 3090 FE as the pads used at the factory had terrible thermal conductivity. Repadding/repasting dropped the memory temps down into the high 80's under load. Core temps went up around 5C. It's a balancing act - the thicker the pads, the better for memory temps, but they reduce pressure on the core, which isn't good for core temps.
 
Last edited:
  • Like
Reactions: continuum

Dystopia

Ars Tribunus Angusticlavius
7,674
I have now tested the cards individually. I started by removing the lower card, thus leaving the suspected bad card alone in the top slot. I reran Furmark. Performance was massively improved. ~300FPS, drawing ~350W, fan noise not as bad, even though I left the side of the case off. Had I only received this one card I would have chalked it up to just being how a 3090 is and wouldn't have suspected something was wrong. I let it run ~15mins until I was pretty confident fan speeds had stabilised. HWMon screenshot while running Furmark:
suspected-bad-card-furmark.png

I didn't run all my game benchmarks because I'm not actually benchmarking, just sanity checking. Performance is now ~12% better than the 3080. MSI Afterburner overlay (sorry about the photo rather than screenshot, but I didn't want to interrupt the benchmark):
bad-card.png

Then I removed the suspect card and installed the other one (originally the bottom card) alone in the top slot. I ran Furmark. It didn't take nearly as long for fan speeds to stabilise, and the final speed was much quieter than the other card. There is definitely a noticeable temperature difference. HWMon:
Suspected-good-card-furmark.png

Game performance was ~11% better than the 3080. There was a significant time gap between the Furmark test and the game test because I was interrupted by dinner. MSI Afterburner (note that this is not the same time as the other):
good-card.png

I also wonder if it's partly an airflow problem. My case is a Lian Li Lancool II, which ranked very near the top of the old GamersNexus case chart, so it's probably not that. On the other hand the cards are triple slot, triple fan, and the end fan is a flow through config. There's very little space between the two cards, and the bottom card exhausts straight into the top one.

How difficult/risky would a repaste be? I haven't actually taken a GPU apart before. Is it basically like doing the same thing with a CPU? I'm mostly trying to gauge whether I should just do it, or whether I should complain to the seller.
 

IceStorm

Ars Legatus Legionis
24,871
Moderator
Your bad card is hitting a memory hotspot temp ~20C greater than your good card - 109C vs 89C. At least one RAM chip is not properly cooled. That's your problem. Sure, you can run it as a single card and it'll mostly be ok, but adding another ~350W of heat to the case is pushing it past the point where it can avoid throttling.

The 3090, performance-wise, was only 10-15% faster than the 3080. Its biggest benefits were the extra bandwidth, extra 14GB of VRAM, and staying in stock long enough to buy. The only reason I own a 3090 FE is because it managed to stay in stock for more than 15 seconds at Best Buy, unlike the 3080s when they dropped.

As for repadding/repasting, look around online for a teardown guide and pad thickness details. The FE cards are pretty bad, as they have very thin ribbon cables and fragile connectors. The process for me took several hours, including cutting the pads by hand to fit. I had a template and even then it took a while to cut out all the pads. Don't bother with the pre-made pad kits, like the ones from Kritical. I tried one for one of my 3080 FE cards. They're incomplete and the wrong thickness. I ended up just using Gelid GP-Extreme again, cutting the pads by hand. There's at least one teardown video showing that PNY used more traditional connectors for their fans, so getting things apart should be less nerve-wracking.

PNY has several different models. What specific model do you have? If it's a Revel, there's a reddit thread on it that says the EK Waterblocks guide for thermal pad placement was applicable.
 
Last edited:

Dystopia

Ars Tribunus Angusticlavius
7,674
PNY has several different models. What specific model do you have? If it's a Revel, there's a reddit thread on it that says the EK Waterblocks guide for thermal pad placement was applicable.

This might sound silly, but I don't know. I've turned the box over several times and I'm buggered if I can figure out which part of it supposed to be a model name. See photo below. Is "XLR8 Gaming" the model name? It's in the place where you expect the model name, but it feels more like marketing fluff than a model name. "EPIC-X RGB" seems even less promising. I didn't worry about this when buying it because it was the cheapest available to me, and things like factory overclocks are irrelevant to my application. It's all about the VRAM. Two 3090s is substantially cheaper than a single 4090 and has twice the VRAM. For LLMs the 4090 is a little faster right up until it runs out of VRAM, at which point two 3090s will leave it in the dust.

3090-pny.png

EDIT: On googling, it appears "XLR8" is indeed the model name. On PNY's site it looks like what I have. The only other PNY 3090 I could find has a fan shroud that does not look like mine. So mine must be the Revel version. I can't find the word "Revel" anywhere on the box (though I did discover that it'll give you cancer if you live in California, I'll have to remember that if I ever go to America with my 3090). Whoever designed this box needs to go back to marketing school. They really didn't need to try at the peak of the pandemic, did they?

I will have to watch that video when I have some time.
 

MadMac_5

Ars Praefectus
3,700
Subscriptor
I also wonder if it's partly an airflow problem. My case is a Lian Li Lancool II, which ranked very near the top of the old GamersNexus case chart, so it's probably not that. On the other hand the cards are triple slot, triple fan, and the end fan is a flow through config. There's very little space between the two cards, and the bottom card exhausts straight into the top one.
This is a BIG part of your problem, especially since removing the "good" card improves performance so much. I had a very similar setup in an otherwise good case for a workstation here in the office, but the two triple-slot 2080 Ti cards were choked for airflow. We eventually solved that problem by scavenging a pair of dual-slot 2080 Ti cards from a Lambda workstation that another researcher upgraded to A6000s.
 

Dystopia

Ars Tribunus Angusticlavius
7,674
Okay, I've watched that video, doesn't look too hard. Unfortunately it turns out there are two revisions of this card. There doesn't appear to be any way to tell which one I have without pulling it apart, and it matters because it changes what thickness of thermal pad is needed.

As for thermal paste for the GPU die, is there any reason Noctua NT-H1 wouldn't be suitable? It's what I have laying around.

This is a BIG part of your problem, especially since removing the "good" card improves performance so much. I had a very similar setup in an otherwise good case for a workstation here in the office, but the two triple-slot 2080 Ti cards were choked for airflow. We eventually solved that problem by scavenging a pair of dual-slot 2080 Ti cards from a Lambda workstation that another researcher upgraded to A6000s.

I think I'm going to have to find a case with a vertical mount that doesn't suck, and use a riser to separate them.
 

Dystopia

Ars Tribunus Angusticlavius
7,674
In thinking about this some more, I reckon if I get a riser I could prop the second card up vertically. There's no actual mount for it, but I did a little proof of concept and the motherboard power cable actually does a decent job of holding the card in place. Depending on how much taller the riser makes the card, it might even be possible to close the door. If I'm really lucky it may even be possible to have enough clearance from the door to get respectable cooling. If only the power connectors attached to the end of the card, like in the old days, instead of the side, then those maybes would be a definite yes.

Or another crazy idea: With a long enough riser, and if I replace my CPU cooler with an AIO, I could cable tie the second card to the roof of the case. It would definitely get plenty of fresh air there. A riser and an AIO probably wouldn't cost any more than a case with a decent vertical mount. Come to think of it, I could pull the AIO out of my NAS (it's only there because it didn't fit in the SFF PC it was meant for), and swap it for the air cooler out of this PC.

Yes, I do all my best thinking late at night, after whisky. How could you tell?

Anyone got any recommendations for risers? Or ones to avoid?
 

IceStorm

Ars Legatus Legionis
24,871
Moderator
For the specific model number, I'd be surprised if there isn't a sticker on the backplate that has the specific model number. Could ask PNY support which revision it is.

Also, you may not have an issue with ALL the RAM chips. It's probably just one of them. Should be safe enough to open up the card and check for contact on all the RAM chips. You'd just need a bit of thermal paste to reapply to the core.

Gelid pads aren't too expensive, around $10-12 per "sheet". Could just buy one of each possible thickness/
 

whoisit

Ars Tribunus Angusticlavius
6,565
Subscriptor
It might even be a design flaw on the cooler. MSI had several cards around the 5700xt era where one ram chip wasn't getting full coverage/pressure from the heatsink. I think some Dell OEM nVidia cards had issues where thermal pads were just missing.

It might be worth either opening it up, or RMA'ing it to make sure everything under the heatsink is installed correctly.
 

Dystopia

Ars Tribunus Angusticlavius
7,674
I set up a couple of NF-F12s like this to ram some air into the gap between the GPUs. It works...when only one GPU is under load. With the top card (the good one) running Furmark, the fans stabilise ~2970RPM. If I run Furmark on the second one also, the fans race up to max speed and then the GPU starts throttling. Oh well, it was worth a try. Not like it cost me anything. Riser cable it is then.

fans.png
 
  • Like
Reactions: continuum