I have what can only be described as a very odd GRUB issue (technically GRUB 2)
A few months ago, Ubuntu 20.04 LTS refused to boot (UEFI mode) on one of our ZFS servers (90 SAS 3.5" bays for bulk storage, booting off a small internal M.2. NVMe drive) - throwing the famous 'alloc magic is broken' error (it has been working fine, just rebooted after a regular kernel update). Normally this indicates a broken grub installation, but no amount of update-grub or other obvious fixes helped. Complete re-installation did nothing useful either. Much gnashing of teeth later, we managed to get these to boot off a spare SATA drive, installed in legacy mode (the SATA drive in UEFI mode also failed) - i think it would probably also work on NVMe in legacy mode, but since they are live servers with ~800TB of data on there, room for experimentation was limited.
As circumstances would randomly have it, another similar machine, yet to be commissioned, seems to have thrown some light on this - Installed with Ubuntu 22.04 LTS, it is happy to boot in UEFI mode off NVMe most of the time, but shows the same 'alloc magic is broken' issues, if specific combinations of SAS drives are inserted (recycled from one of the other ZFS servers), but boots happily with other drives.
Of the 6 recycled drives tested, the presence of specific drives, namely any one of (#1, #2 or #3), plus #4 breaks booting, with or without other drives - all other combinations lacking that key combination seem fine (e.g. #1 #2 #3 together, #4, #5, #6 together, any of #1, #2, #3 with #5 and #6), while additional drives beyond that combination also fail.
Re-installation with 20.04 LTS or 24.04 LTS shows the same symptoms - happy to boot unless those specific drive combinations are present.
WTF?
The 'alloc magic is broken' message effectively indicates a buffer overflow somewhere in GRUB - but it's been there since at least Ubuntu 20.04.1 (the oldest one tried so far) which uses some variant of GRUB 2.04, and is not fixed even in 24.04, which uses GRUB 2.12. And it seems to strangely dependent on exact content of other drives in the system.
At this point I'm tempted to duplicate some of these disks to see if the 'clones' act the same as the original - then i can at least ablate the contents and see what's the key to triggering this.
Suggestions welcome...
A few months ago, Ubuntu 20.04 LTS refused to boot (UEFI mode) on one of our ZFS servers (90 SAS 3.5" bays for bulk storage, booting off a small internal M.2. NVMe drive) - throwing the famous 'alloc magic is broken' error (it has been working fine, just rebooted after a regular kernel update). Normally this indicates a broken grub installation, but no amount of update-grub or other obvious fixes helped. Complete re-installation did nothing useful either. Much gnashing of teeth later, we managed to get these to boot off a spare SATA drive, installed in legacy mode (the SATA drive in UEFI mode also failed) - i think it would probably also work on NVMe in legacy mode, but since they are live servers with ~800TB of data on there, room for experimentation was limited.
As circumstances would randomly have it, another similar machine, yet to be commissioned, seems to have thrown some light on this - Installed with Ubuntu 22.04 LTS, it is happy to boot in UEFI mode off NVMe most of the time, but shows the same 'alloc magic is broken' issues, if specific combinations of SAS drives are inserted (recycled from one of the other ZFS servers), but boots happily with other drives.
Of the 6 recycled drives tested, the presence of specific drives, namely any one of (#1, #2 or #3), plus #4 breaks booting, with or without other drives - all other combinations lacking that key combination seem fine (e.g. #1 #2 #3 together, #4, #5, #6 together, any of #1, #2, #3 with #5 and #6), while additional drives beyond that combination also fail.
Re-installation with 20.04 LTS or 24.04 LTS shows the same symptoms - happy to boot unless those specific drive combinations are present.
WTF?
The 'alloc magic is broken' message effectively indicates a buffer overflow somewhere in GRUB - but it's been there since at least Ubuntu 20.04.1 (the oldest one tried so far) which uses some variant of GRUB 2.04, and is not fixed even in 24.04, which uses GRUB 2.12. And it seems to strangely dependent on exact content of other drives in the system.
At this point I'm tempted to duplicate some of these disks to see if the 'clones' act the same as the original - then i can at least ablate the contents and see what's the key to triggering this.
Suggestions welcome...