You had one job GRUB

tb12939

Ars Tribunus Militum
1,797
I have what can only be described as a very odd GRUB issue (technically GRUB 2)

A few months ago, Ubuntu 20.04 LTS refused to boot (UEFI mode) on one of our ZFS servers (90 SAS 3.5" bays for bulk storage, booting off a small internal M.2. NVMe drive) - throwing the famous 'alloc magic is broken' error (it has been working fine, just rebooted after a regular kernel update). Normally this indicates a broken grub installation, but no amount of update-grub or other obvious fixes helped. Complete re-installation did nothing useful either. Much gnashing of teeth later, we managed to get these to boot off a spare SATA drive, installed in legacy mode (the SATA drive in UEFI mode also failed) - i think it would probably also work on NVMe in legacy mode, but since they are live servers with ~800TB of data on there, room for experimentation was limited.

As circumstances would randomly have it, another similar machine, yet to be commissioned, seems to have thrown some light on this - Installed with Ubuntu 22.04 LTS, it is happy to boot in UEFI mode off NVMe most of the time, but shows the same 'alloc magic is broken' issues, if specific combinations of SAS drives are inserted (recycled from one of the other ZFS servers), but boots happily with other drives.

Of the 6 recycled drives tested, the presence of specific drives, namely any one of (#1, #2 or #3), plus #4 breaks booting, with or without other drives - all other combinations lacking that key combination seem fine (e.g. #1 #2 #3 together, #4, #5, #6 together, any of #1, #2, #3 with #5 and #6), while additional drives beyond that combination also fail.

Re-installation with 20.04 LTS or 24.04 LTS shows the same symptoms - happy to boot unless those specific drive combinations are present.

WTF?

The 'alloc magic is broken' message effectively indicates a buffer overflow somewhere in GRUB - but it's been there since at least Ubuntu 20.04.1 (the oldest one tried so far) which uses some variant of GRUB 2.04, and is not fixed even in 24.04, which uses GRUB 2.12. And it seems to strangely dependent on exact content of other drives in the system.

At this point I'm tempted to duplicate some of these disks to see if the 'clones' act the same as the original - then i can at least ablate the contents and see what's the key to triggering this.

Suggestions welcome...
 
  • Like
Reactions: VividVerism

tb12939

Ars Tribunus Militum
1,797
is /boot full or doesn't have enough space? That's the first place I'd check.
Is the server reading the boot drive as primary boot?
If it's UEFI, I wonder if the boot partition is slightly off the allocation table. Maybe make sure it's aligned properly (I believe gparted live CD can tell you that)
Space on /boot is fine - these are fresh installs (of 20.04 / 22.04 / 24.04) on a ~1TB system drive.

The NVMe drive is set to be the only boot target in UEFI, and it boots fine on its own, or with any given SAS disk.

It's only the combinations of 2+ SAS disks like those listed above that throw things off, and i have honestly no idea why GRUB even cares all that much about them.

None of these disks have ever been used to boot - they're all former ZFS data disks which showed some problems and got replaced.