How do I properly debug these NVIDIA crashes?

Some days ago, I experienced some odd lockups in Doom Eternal in Super Gore Nest, googled it, lots of people apparently had a similar issue. Worked around the problem by approaching a specific map area from another angle and by disabling intro video and bethesda login. Didnd't give it much further thought.

Then I wanted to try Portal RTX and the game would lockup and crash a few seconds into the menus becoming accessible. Having crashed Portal like 10 times I decided to try Cyberpunk, now that crashes every time too. I can get into the menus, but starting the in-game benchmark results in game crash maybe 10-20 seconds into the benchmark.

One odd detail: the benchmark seemed to be running way better than it should - on these same settings, normal would be 68-75 fps in the early parts, but right now I am seeing 90-100fps, which is way too unexpectedly good for Path Tracing at 4k on a 4070 TI SUPER.

Another possibly related VERY odd detail: sometimes MSI Afterburner flashes core and memory clocks values that make no sense whatsoever, such as 10000Mhz memory clock, which should be outright impossible for any GPU to have ever existed. This is without any overclocking or tuning of any sort applied.

Display - Event 4101
Display driver nvlddmkm stopped responding and has successfully recovered.

nvlddmkm - Event 0
The description for Event ID 0 from source nvlddmkm cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

If the event originated on another computer, the display information had to be saved with the event.

The following information was included with the event:

\Device\Video5
Error occurred on GPUID: 100

Tried Quake Remaster, this doesn't crash, so apparently it's only the heavier modern games that have a problem.

Checked CPU and GPU temperatures, no issues there, GPU reaches 75c at the worst.
Tried disabling Hardware-accelerated GPU scheduling - no effect.
Tried mildly downclocking both the cores and the memory of my 4070 TI SUPER and slightly decreasing the power limit - no effect.
Tried a whole bunch of different driver versions, doing a DDU clean every time:

551.86 - Current latest driver, seemed to work for a few weeks, problem started yesterday
551.61 - Driver version I previously used without issues, problem persists

Tried a few older versions that seemingly fix similar issues for a lot of people:
546.65 - same problem as above
537.58 - doesn't recognize GPU, probably too old for 4070 TI SUPER

Reinstalled Windows 11, disabled every possible overlay (Steam, MSI Afterburner, Discord, etc), no effect. Then all of a sudden the problem seemingly went away for a week or so. Only to return again today and currently Cyberpunk often crashes before I even get to the main game menu. Tried "Prefer Maximum Performance" in NVIDIA 3d settings as well as tried enabling NVIDIA debug mode, but neither had an effect.

How do I properly debug this?
 
Last edited:
If you've got an XBox controller, update it. Long shot I know, but that was the root cause when I had "Display driver nvlddmkm stopped responding and has successfully recovered." crashes that only occurred on certain games.
Not a long shot in the sense of me actually having an XBOX Elite Series 2 connected using Microsoft's dongle. The crashes do happen even with the controller turned off though, but for good measure I am gonna do some testing with the dongle disconnected entirely.

EDIT: disconnecting the dongle didn't stop the crashes, but downclocking both GPU and memory clocks by 125 Mhz seemingly has... (still had crashes at -50 and at -80, but crashing haas seemingly stopped for now after I pushed the downclock to -125).
 
Last edited:

hobold

Ars Tribunus Militum
2,657
This long shot is unlikely to be a hit: Igor's Lab recently disassembled a faulty Asus TUF model, where thermal paste had been applied badly by the factory.

If there was something faulty with your GPU cooler, then overheating is only one possible symptom. To the contrary, if it was a thermal sensor that doesn't make proper contact, then reported temperatures would be too low and the GPU could try to boost beyond its possibilities.

Nowadays, most thermal measurements are integrated into the silicon itself, so that kind of failure mode will not occur in a "pure" way. But there might be other ways to trigger it, perhaps by a faulty calibration, or a mismatched reference circuit somewhere outside the GPU silicon.
 

Apteris

Ars Tribunus Angusticlavius
8,938
Subscriptor
How do I properly debug this?
Well, it's hard and I don't know exactly, but I'm going to try to provide some pointers anyway.

Step 1 will be to enable crash dumps for your program (it wasn't clear to me from your OP whether you've done this). Let's focus on Doom Eternal: according to this forum post the procedure is:

Bethesda Support said:
You can enable crash dump file creation by entering the following into Launch Commands "+rgl_captureCrashes 1".

To submit a crash dump, click on the following link: Bethesda Customer Support Crash Dump Submission

On Steam: right click on DOOM Eternal -> go to Properties, and under GENERAL -> enter the following under LAUNCH COMMANDS: +rgl_captureCrashes 1

Note: PC crashes dumps will automatically write as a .zip file to the following folder: C:\Users\username\Saved Games\id Software\DOOMEternal\base\crashes

Step 2 will be to look at the produced crash (if you don't have one already, you'll have to make it happen again) and try to understand it. Start here: Using the !analyze Extension.

Then, depending on what you find in the crash dump, you may or may not need to debug the issue further, load and use GPU driver symbols, and so on. Can't predict what else you might need at this point.

If you like jumping down deep and dark rabbit holes, you'll love chasing this down.
 
  • Like
Reactions: the-unknown