Memory testing on ECC RAM - are ECC errors "ok"?

ZPrime

Ars Legatus Legionis
18,725
Subscriptor++
Hey C&MT, it's been a while for me.

I'm putting together a passively-cooled Qotom system (Atom C3758R) to become a new OPNsense router... More accurately, it will probably run Proxmox, with the router virtualized and getting NICs passed through to it. This is for ease of backup, and so I can run some other utility VMs on the same system (pihole, unifi controller, etc) since I don't need 8 cores of C3758 to route 1Gb at home.

I have 2x32GB of ECC DDR4 - yes, overkill for a router, but see above re: virtualization/etc.

Because I'm not a savage, I'm running it through bench testing before doing more with it. Started with the FOSS memtest86+; it made it through a whole "pass" and a bit more, with zero errors. MT86+ did not seem to be reading any ECC information from the board though.

Grabbed Passmark's free edition of Memtest, which does support ECC. Both sticks together were throwing ECC errors less than 2 hours in. Let's do the reductive testing dance... Tested them one by one in the first slot.
One stick made it through 4 whole rounds of Passmark Memtest (8.5 hours) and had one single ECC error during that run.
The other stick throws multiple ECC errors within the first 30-60 minutes; I have it sitting to complete a full run now and will check it in 9 hours to see a real total.

But these are "only" correctable ECC errors... should I care? I feel like the stick with a bunch of them probably should be RMA'd? The one that made it 8.5 hours with a single error, I could potentially believe that's a fluke / legit random bit flip.

TL;DR: Should I RMA the "more bad" stick with a lot of ECC errors, or should I RMA them both? Or don't worry about it at all, because ECC is doing its job?
 
ECC is supposed to be an additional layer of protection atop DRAM that has the same infinitesimally small as unprotected DRAM. What you are seeing is an extremely high error rate, that IMHO justifies RMA.

The fact that you can observe very different error rates between the two DIMMs indicates that it isn't your CPU or motherboard that is causing the failures. And the fact that both sticks are unrealiable IMHO speaks against the maker of those sticks. One is bad luck, two is a suspicious correlation.
 
  • Like
Reactions: ZPrime

ZPrime

Ars Legatus Legionis
18,725
Subscriptor++
That was kind of my gut feeling too, but I've been out of the DIY realm so long I wanted to check. :(

These are both from a "brand" on Newegg called "Nemix RAM;" the chips appear to be Micron, although the SPD information is mostly blank. Seemed like a reputable vendor, but maybe not... It's somewhat difficult to find ECC DDR4 in SO-DIMM factor.
 

w00key

Ars Praefectus
5,907
Subscriptor
I see a ton of Kingston modules and Micron, for example this one Micron MTA18ASF4G72HZ-3G2R. RAM is one thing I would never buy from no-name vendors.

Micron is tier 1, fabs their own stuff. Kingston is a tier 2, they are okay with reliability but sometimes switches chips and keep the stick model number the same. So getting more memory later may end up with mismatching sticks. Should work but may have compatibility issues when running 2 DIMM/channel.
 
  • Like
Reactions: ZPrime

theevilsharpie

Ars Scholae Palatinae
1,199
Subscriptor++
In addition to the high error rate, correctable ECC errors aren't "free", even if the memory controller is able to correct them.

Having to correct an ECC error incurs a significantly latency penalty, and a system running with an ECC DIMM that's constantly throwing correctable errors can be noticeably slow.
 
  • Like
Reactions: ZPrime

ZPrime

Ars Legatus Legionis
18,725
Subscriptor++
Should work but may have compatibility issues when running 2 DIMM/channel.
In this case, the board only has 2 slots so that's not a huge concern. ;)

The CPU maxes out at DDR4-2400... I know that RAM is supposed to be able to clock down to a lower speed, but does that always work?

I just ordered two of the Micron part you listed, directly from the Crucial/Micron webstore (I didn't realize they did direct to customer sales). Normally Micron is my go-to for RAM, but Qotom had a very specific "tested memory list" for this system, and the only ECC listed was SK Hynix or Samsung, IIRC. They implied that the board simply wouldn't boot with untested RAM, which is obviously not the case since these "Nemix" sticks booted fine, ECC errors aside...
 

w00key

Ars Praefectus
5,907
Subscriptor
The CPU maxes out at DDR4-2400... I know that RAM is supposed to be able to clock down to a lower speed, but does that always work?
It's annoying that manufacturers don't list the speeds in SPD. But usually it has something like this, borrowing it from a [H] thread:

634083_1649302537493.png

It would be weird if this one comes with a 1600 (x2=3200) profile only.
 
  • Like
Reactions: continuum