Linux as small server

crosslink · Dec 1, 2023

I'm refurbishing a system I've been using as a NAS (FreeNAS-based, installed when it was still called that) for the last decade or so. Hardware-wise, it's based on a Supermicro X9SCM motherboard. I picked up a Xeon E3-1240 to replace the pentium I had been using, and an inexpensive SATA SSD for the OS (I had been using a USB drive.)

Since I've had good experience using Debian-based distros as my personal computer for well over a decade, I wanted to base the new system on Ubuntu Server (or Debian Standard) along the lines described by Jim Salter.

A quick-start guide to OpenZFS native encryption

Learn the hows, whys, and whats of OpenZFS encryption with this short guide.

arstechnica.com

The OS (Ubuntu or Debian, I've installed both several times now) installs and runs fine. The ZFS pool and dataset(s) also set up with no issues. When I try to transfer my music collection (about 600 folders containing collectively about 10x that number of files) to a ZFS dataset via SMB, the system freezes after a small fraction of the data gets transferred. Every. Time. (Lots of times tried!)

This of course screams out "hardware". I have physically swapped out, one at a time, these:

Power supply
CPU (Pentium G2020)
Motherboard (Supermicro X9SCM-F)
Boot storage device (SATA, live USB).

No culprit identified.

I didn't have any memory to swap (heh) but the 2x8GB of Kingston ecc DIMMs in the system passed memtest x86 and an hour of stressapptest on Linux.

Out of exasperation, I tried Xigmanas 13.2.0.5. Lo and behold it sets up, runs, and transfers my music collection over SMB with no issues at all, on the same hardware, regardless of how much data I transfer at one go. A test setup has been running for several days now, stable as can be.

I'm gobsmacked. Does anyone have an idea of why I might be seeing such a big differences between Xigmanas and the Linux distros I was trying?

While Xigmanas is running great, I really would like to implement encryption on the dataset level as described in Jim's guide. My reading of the Xigmanas docs indicate this isn't really supported, so I'm not completely convinced that would be the best path, especially if I can understand what's happening (so far) with Ubuntu and Debian.

I'm also open to other recos (my NAS demands are pretty basic, so there may be many good options.)

koala · Dec 1, 2023

I would have tried using a different protocol than Samba- that's a quick test that might reveal that the problem is Samba. (Which perhaps your other attempt configures differently.).

Just a guess, though.

crosslink · Dec 1, 2023

Added to list. Thanks.
I did use scp (ssh if I understand correctly) to transfer smaller numbers of files (more for config and setup) which worked fine, but not for the larger bulk copy. So this might be a good thing to try.

brendan_kearney · Dec 1, 2023

you can run the samba process in the foreground to see what is going on, and maybe find the problem...

Code:

smbd -i -d 3

this will run the process in the foreground, with higher logging levels, will answer one single transaction and then exit when done. also check /var/log/samba/* for any errors in the logs.

another protocol to consider is nfs, which would work with *nix and mac systems, but not windows.

crosslink · Dec 1, 2023

Thanks koala and brendan_kearney.

I have rebuilt the server software using Ubuntu Server 22.04 and have the data transfer running now over nfs (using rsync). So far it's going smoothly--on Beethoven and going. It's already much farther along than the transfers over SMB went.

Will let it continue doing its thing and report back when it finishes one way or the other (probably tomorrow).

koala · Dec 1, 2023

Oh, rsync over nfs is unusual. I'm not sure it's not detrimental to performance. Normally, you would do rsync directly with ssh, which "maximizes" the magic rsync can do.

I'm surprised Samba failed equally in Ubuntu and Debian. That points me towards the client, actually.

crosslink · Dec 1, 2023

The server just froze, less than halfway through the transfer. Here's what I got on the client side.

rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(713) [sender=3.2.7]
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at io.c(518) [generator=3.2.7]

koala · Dec 1, 2023

How much data is this?

Does it work if you retry? I mean, what's that small fraction? How long is the transfer? How long before it fails?

If you have an unreliable network, copies can fail at any point. And maybe the other system was "lucky". Maybe you retry with the other system and it fails again...

I assume this is a local network? Is wireless involved?

crosslink · Dec 2, 2023

Total amount of data is about 400 GB, in about 600 folders (mostly flac files) in a "flat" structure (all in 1 folder). This is the music portion of what I want to live on the NAS. Extrapolating from what copied yesterday, it would take roughly 3 hours to transfer.

Yes this is a local network, here at home. Router/DHCP server is an Intel Atom machine with dual ethernet ports running Opnsense, which assigns static IP addresses to server and client(s). Wifi is handled by a Ubiquiti access point. I'm using a Cisco 24 port ethernet switch connecting everything over ethernet, except of course whatever connects over wifi.

Good point about that... The client I used yesterday was my Thinkpad T460 (Ubuntu 22.04), connected over, here it comes, WiFi. While we're at it, the data source was a 3" SATA (spinning rust) hard drive, connected to laptop by a plugable USB adapter.

Put this way, I guess I can see two weak links. The wifi works quite well (as wifi goes) but an ethernet connection would be a different thing entirely. Same for the USB/SATA adapter? I tried to get a reasonably good one (Plugable) but maybe a direct SATA connection would be more reliable?

I'll plug the laptop into ethernet and try the transfer again.
Another comparison I could make would be to connect the source hard drive directly to the motherboard of my HTPC, which is always connected to the network by ethernet.

Will try them in that order.

I guess I'm puzzled by the hard crashes, which I wouldn't necessarily expect even in the case of, say, wifi issues.

koala · Dec 2, 2023

Hmmm, I'm getting worse at reading carefully. So you say the server freezes? Do you stop being able to reach the server via SSH or via physical access?

I mean, if the T460 got a networking issue, the transfer would stop... and if it's a wifi connection issue, you might not be able to reach the same server through ssh (or an ssh connection would drop).

If you are on the physical console of the server and it freezes, then it can't be a client problem.

crosslink · Dec 2, 2023

The server does indeed crash hard. ssh becomes unresponsive. Right now it's set up with a monitor and keyboard. The cursor stops flashing and it won't respond in any way without a hard power cycle. This just happened with the copy over nfs I started with the laptop plugged into ethernet, failing in only about 5 minutes.

Yeah it is these crashes (which are happening every time on the server side, and happen the same way regardless of whether I'm using Ubuntu Server or Debian Standard) that have me puzzled. My Linux admin chops are not terribly strong, so I could use some hints about which logs to look at (though in the case of a hard crash I can imagine they might not get written).

Timing: I have to go out of town from now till late Sunday, so I need to pause this till probably Monday morning. I really appreciate your ideas! I'll check back in here then.

malor · Dec 2, 2023

crosslink said:
CPU (Pentium G2020)

You're probably choking on encryption speed. ZFS needs the AES New Instructions to stay performant. There's been some work on improving that, but Debian stable probably doesn't have those fixes, and Ubuntu might not either. They're probably still nowhere near as good as the hardware-accelerated version.

Get any processor with AESNI and it'll likely be fine. Or don't use encryption.

I was forced to retire a ~~2600K~~ 920 I had intended to use, for this same reason. It didn't crash, probably because I wasn't loading it over SMB, but through regular file copies, but it was incredibly slow. A replacement 4790K was super quick by comparison.

edit: corrected CPU model.

bthylafh · Dec 2, 2023

The i7-2600K has AES-NI.

Intel® Core™ i7-2600K Processor (8M Cache, up to 3.80 GHz) Product Specifications

Intel® Core™ i7-2600K Processor (8M Cache, up to 3.80 GHz) quick reference guide including specifications, features, pricing, compatibility, design documentation, ordering codes, spec codes and more.

ark.intel.com

malor · Dec 2, 2023

Oops, sorry, that was a 920. I got my generations confused.

theevilsharpie · Dec 2, 2023

crosslink said:
The server does indeed crash hard. ssh becomes unresponsive. Right now it's set up with a monitor and keyboard. The cursor stops flashing and it won't respond in any way without a hard power cycle. This just happened with the copy over nfs I started with the laptop plugged into ethernet, failing in only about 5 minutes.

If something as basic as a VGA console or a keyboard becomes unresponsive, that's a sign that the CPU has locked up. That can be an issue with the CPU, or something it depends on to function (e.g., memory, PCH, etc.)

Linux is unlikely to be able to give you any direct information on the problem, as Linux would be relying on the very CPU that has just locked up to detect and log the source. Your motherboard's BMC could potentially provide more information since it's independent of the CPU. The BMC may also be able to issue an NMI to the CPU, which could interrupt the CPU enough to generate a crash dump (if the CPU is deadlocked and not truly crashed).

Also, according to your motherboard's manual, it's equipped with a Cougar Point PCH. Cougar Point has a known issue with the 3 Gbps SATA ports malfunctioning over time, something that is inherent to the hardware and could only be fixed with a replacement. It's unclear if your motherboard has the fixed or defective PCH. Malfunctions on the bus for local storage can also cause the machine to become unresponsive without any obvious clue as to why (since logs are probably being written to local storage), and the kernel itself can lock up if the storage that's malfunctioning contains an active swap file/partition.

crosslink · Dec 2, 2023

malor said:
You're probably choking on encryption speed. ZFS needs the AES New Instructions to stay performant. There's been some work on improving that, but Debian stable probably doesn't have those fixes, and Ubuntu might not either. They're probably still nowhere near as good as the hardware-accelerated version.

Get any processor with AESNI and it'll likely be fine. Or don't use encryption.

I was forced to retire a ~~2600K~~ 920 I had intended to use, for this same reason. It didn't crash, probably because I wasn't loading it over SMB, but through regular file copies, but it was incredibly slow. A replacement 4790K was super quick by comparison.

edit: corrected CPU model.

The G2020 was just for troubleshooting purposes. I'm using a Xeon E3-1240.

crosslink · Dec 3, 2023

theevilsharpie said:
If something as basic as a VGA console or a keyboard becomes unresponsive, that's a sign that the CPU has locked up. That can be an issue with the CPU, or something it depends on to function (e.g., memory, PCH, etc.)

Linux is unlikely to be able to give you any direct information on the problem, as Linux would be relying on the very CPU that has just locked up to detect and log the source. Your motherboard's BMC could potentially provide more information since it's independent of the CPU. The BMC may also be able to issue an NMI to the CPU, which could interrupt the CPU enough to generate a crash dump (if the CPU is deadlocked and not truly crashed).

Also, according to your motherboard's manual, it's equipped with a Cougar Point PCH. Cougar Point has a known issue with the 3 Gbps SATA ports malfunctioning over time, something that is inherent to the hardware and could only be fixed with a replacement. It's unclear if your motherboard has the fixed or defective PCH. Malfunctions on the bus for local storage can also cause the machine to become unresponsive without any obvious clue as to why (since logs are probably being written to local storage), and the kernel itself can lock up if the storage that's malfunctioning contains an active swap file/partition.

That would explain a lot. Thanks for that!

The link indicates this PCH has a total of 6 SATA connections:

2x 6Gbps (not affected by the issue)
4x 3Gbps (affected by the issue)

Both motherboards I've been trying have these, and I've always connected the SSD containing the OS to one of the former, and the 4 WD reds (non-shingled type) as the ZFS pool to the latter.

This suggests I can try as a test removing the WD reds, connecting a known good SATA drive to the other 6Gps port, and trying the data transfer again. Will give this a spin, so to speak.

edit: corrected a typo

crosslink · Dec 3, 2023

My travel plans changed a bit, so made it home earlier than expected. I ran the test as described in the post above.

The most convenient way to do it was to temporarily house the disk containing the data in my Win10 box and set up a samba share to the server. Then using robocopy to do the copying from Win 10. The data transfer finished with no errors. It's all on the server now, and seems to be working (music plays fine for all the files I've checked).

So I'm leaning pretty strongly toward theevilsharpie's suggestion being the correct one in this case. Of course I'll put some more hours and miles on the server in this configuration, just to accumulate some more data, and avoidance of failures seen previously.

Meanwhile it's time to consider some options for server hardware. I would like to keep using as much of my existing hardware as is practical, including the ability to use 2 or more drives mirrored in a zfs pool.

Would a PCIe-based SATA controller make sense in the server? My performance requirements for the server aren't terribly high, but I do want the system to be reasonably reliable.

crosslink · Dec 3, 2023

Also, my favorite part (or most perverse part depending on how you look at it)... It explains why swapping out the motherboard (one Supermicro X9SCM for another) didn't give any meaningful change. Real life troubleshooting...

Whittey · Dec 3, 2023

My guess is that it's not fundamentally a hardware issue, as Xigmanas worked. Are you running out of memory? If it's able to write to syslog before/when it freezes, check for OOM killer. Or perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup. Maybe try locally writing a couple hundred GB to the pool to segregate disk/fs from external protocol/etc. Also, when the server is frozen, can you still ping it?

crosslink · Dec 3, 2023

Putting some more "miles" on the server after successfully transferring all my music to the server via Samba (recall we're on Ubuntu Server using only the two SATA ports unaffected by the Cougar Point bug) I started a new copy job (about 800 Gb from my personal laptop, simulating a backup) over NFS and went out for a few hours. Came back to a frozen-solid server.

Whittey said:
when the server is frozen, can you still ping it?

I just got a good chance to check that...


crosslink@T460:~$ ping 192.168.1.98
PING 192.168.1.98 (192.168.1.98) 56(84) bytes of data.
From 192.168.1.10 icmp_seq=1 Destination Host Unreachable
...

Whittey said:
Are you running out of memory?

This system has 16 Gb of ecc RAM. During the years it ran as a FreeNAS machine it never once had any issues related to this.

Whittey said:
If it's able to write to syslog before/when it freezes, check for OOM killer. Or perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup. Maybe try locally writing a couple hundred GB to the pool to segregate disk/fs from external protocol/etc.

I appreciate. I need to take some time to figure out how to do these.

Whittey said:
My guess is that it's not fundamentally a hardware issue, as Xigmanas worked.

True that, about Xigmanas. That Xigmanas is based on FreeBSD, I can't help but wonder if there's any difference "under the hood" that might make it more tolerant of errors than the Linux implementations I've tried. A long time ago and in a thread I can't find now, I remember reading a thread (not on ArsTechnechnica IIRC) along the lines of "Dedicated NAS distro -vs- Roll-your-own-Linux". I remember one of the answers recommended the former, mentioning "tuning" as a reason why, though no particulars were given. I've always wondered about what that "tuning" part might mean. Regardless of how this particular episode ends, I'd love to hear perspective folks here might have.

theevilsharpie · Dec 3, 2023

Barring some type of hardware failure that simply crashes the machine, one area I'd expect Linux to be more "aggressive" than FreeBSD is when utilizing more advanced features of the hardware, particularly power management.

Power management is also something I'd expect to be set differently between a general-purpose distribution, and one tuned specifically for servers.

It's a bit of a reach, but it's the main thing that jumps to mind.

The other thing worth trying is to get a console and kernel output going of a serial port (either a standard serial, or serial-over-LAN via your BMC). Serial is very simple, and can often provide some error information when other avenues fail (in particular, whatever is recording the serial console output can persist that even with the host machine unresponsive). If even a serial console (with kernel output sent to it) doesn't provide some diagnostic information, then I suspect the CPU is faulty and is just locking up.

teubbist · Dec 4, 2023

I'd be more inclined to suspect and issue on the SATA side than general power management, especially with such a relatively old platform. NCQ and maybe drive power states(although these should not trigger during a transfer).

If you're still experimenting, NCQ can be disabled by setting the drive queue depth to 1. /sys/block/<device>/device/queue_depth is the sysfs path, so echo 1 > /sys/block/sda/device/queue_depth etc.

One other random thought: if you have VT-d enabled it's possible Linux is doing something with the IOMMU regions that FreeBSD isn't. This could lead to hard lockups as well. If you're not planning hardware passthrough to a VM, that's something else you could disable. Note, this is VT-d(or Directed I/O in some BIOSes), not VT-x which should always be safe to leave enabled.

crosslink · Dec 4, 2023

teubbist said:
I'd be more inclined to suspect and issue on the SATA side than general power management, especially with such a relatively old platform.

I'm coming to feel similarly. One thing I've noticed (cumulatively over all of the troubleshooting) is that the crashes have always occur during data transfers. It passes memtest x86 and an hour of stressapptest with no issues.

edit: and also does fine idling overnight.

crosslink · Dec 4, 2023

Synopsis of most recent crash (recall we're on Ubuntu Server, using only the two SATA ports not affected by the known Cougar Point bug, running a simulated backup of my personal machine to the server over NFS)...

Whittey said:
perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup.

I wrote a script to do this--record the top 4 entries in top every few seconds. Here's the relevant portion showing the last 10-20 seconds before the crash (during data transfer to server).


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
<snip>
3291.2 seconds
   1189 root      20   0       0      0      0 S  50.0   0.0  42:49.77 nfsd
 130063 zcrosslnk 20   0   10608   3828   3224 R   6.2   0.0   0:00.01 top
      1 root      20   0  166536  11984   8448 S   0.0   0.1   0:06.63 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:03.47 kthreadd

3294.4 seconds
   1189 root      20   0       0      0      0 S  46.7   0.0  42:51.15 nfsd
 130141 zcrosslnk 20   0   10608   3928   3324 R   6.7   0.0   0:00.01 top
      1 root      20   0  166536  11984   8448 S   0.0   0.1   0:06.63 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:03.47 kthreadd

3297.5 seconds
   1189 root      20   0       0      0      0 S  20.0   0.0  42:52.39 nfsd
      1 root      20   0  166536  11984   8448 S   0.0   0.1   0:06.63 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:03.48 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
<crash>

If I'm interpreting this correctly, we have adequate memory for the operation.

edit: the seconds entries in this log are only a very rough approximation; I re-launched it once or twice and only did a very rough add-on to the elapsed time entry shown here.

crosslink · Dec 4, 2023

So maybe this system is just too old to invest much more effort into.
I think I will wind down the troubleshooting operations here (though still check this thread for info and learning).

I continue to be amazed at Arsians.
What an incredible group.

Paul Bartz · Dec 5, 2023

I've got a similar situation here.

SuperMicro X9DR3-LN4F+, two Intel Xeon E5-2608L v3 CPUs, 8x 32GB PC3-14900R ECC DIMMs for 256GB. ZFS RAIDZ2 on 5x 12TiB disks. SATA RAID card flashed to IT mode for JBOD. Intel 10GBe NIC between old Windows server and new ZFS server.

I was getting random lockups during file transfers. I'm about to revisit this system with Ubuntu 23.10. Will advise on my findings once my 10GBe switch comes back and I can find time during holiday preparations.

I hadn't heard about the Cougar Point SATA bug until now. Thanks for the heads-up.

malor · Dec 6, 2023

If those boards are new enough to have at least a first-gen NVME slot with at least two PCIe lanes connected (which is exactly what my Haswell board has), this might be a way to work around the problem:

View: https://www.amazon.com/dp/B07T3RMFFT/

That's a converter from an M.2 slot, 2 PCIe lanes, to 5 SATA ports.

wobblytickle · Dec 6, 2023

now that's a curious piece of kit

crosslink · Dec 9, 2023

I looked into options for hardware and settled on a used Supermicro X11SSH-F. If I understand correctly it’s compatible with the same Xeon E3-12xx and RAM I’ve been using. Will report when it arrives and I get it running.

crosslink · Dec 9, 2023

malor said:
If those boards are new enough to have at least a first-gen NVME slot with at least two PCIe lanes connected (which is exactly what my Haswell board has), this might be a way to work around the problem:

View: https://www.amazon.com/dp/B07T3RMFFT/

That's a converter from an M.2 slot, 2 PCIe lanes, to 5 SATA ports.

The X9 family doesn’t (or at least my X9SCM ones don’t) but the X11 does. Hopefully the onboard SATA will work properly.

crosslink · Dec 12, 2023

New (used) motherboard arrived. Data I/O-based workload test in progress...

crosslink · Dec 13, 2023

Recall I retired the Supermicro X9SCM-F motherboard and got a Supermicro X11SSH-F. I set this up with Ubuntu Server 22.04. It is handling all data transfers with no issues. It's nice to have a working server again.

Although I can't prove it, the issue seems consistent with the Cougar Point SATA bug described by theevilsharpie. Thanks all!

Paul Bartz · Jan 4, 2024

I finally got around to updating my ZFS box to Ubuntu 23.10.1 and the drives on the SAS2008 host disappeared. An upgrade to a SAS3008 host and good to go - last I check, over a TB copied across a 10GBe connection with no issues.

Linux as small server

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Legatus Legionis

Ars Praetorian

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus