Linux as small server

crosslink

Ars Scholae Palatinae
882
Subscriptor
I'm refurbishing a system I've been using as a NAS (FreeNAS-based, installed when it was still called that) for the last decade or so. Hardware-wise, it's based on a Supermicro X9SCM motherboard. I picked up a Xeon E3-1240 to replace the pentium I had been using, and an inexpensive SATA SSD for the OS (I had been using a USB drive.)

Since I've had good experience using Debian-based distros as my personal computer for well over a decade, I wanted to base the new system on Ubuntu Server (or Debian Standard) along the lines described by Jim Salter.


The OS (Ubuntu or Debian, I've installed both several times now) installs and runs fine. The ZFS pool and dataset(s) also set up with no issues. When I try to transfer my music collection (about 600 folders containing collectively about 10x that number of files) to a ZFS dataset via SMB, the system freezes after a small fraction of the data gets transferred. Every. Time. (Lots of times tried!)

This of course screams out "hardware". I have physically swapped out, one at a time, these:

  • Power supply
  • CPU (Pentium G2020)
  • Motherboard (Supermicro X9SCM-F)
  • Boot storage device (SATA, live USB).

No culprit identified.

I didn't have any memory to swap (heh) but the 2x8GB of Kingston ecc DIMMs in the system passed memtest x86 and an hour of stressapptest on Linux.

Out of exasperation, I tried Xigmanas 13.2.0.5. Lo and behold it sets up, runs, and transfers my music collection over SMB with no issues at all, on the same hardware, regardless of how much data I transfer at one go. A test setup has been running for several days now, stable as can be.

I'm gobsmacked. Does anyone have an idea of why I might be seeing such a big differences between Xigmanas and the Linux distros I was trying?

While Xigmanas is running great, I really would like to implement encryption on the dataset level as described in Jim's guide. My reading of the Xigmanas docs indicate this isn't really supported, so I'm not completely convinced that would be the best path, especially if I can understand what's happening (so far) with Ubuntu and Debian.

I'm also open to other recos (my NAS demands are pretty basic, so there may be many good options.)
 
you can run the samba process in the foreground to see what is going on, and maybe find the problem...
Code:
smbd -i -d 3
this will run the process in the foreground, with higher logging levels, will answer one single transaction and then exit when done. also check /var/log/samba/* for any errors in the logs.

another protocol to consider is nfs, which would work with *nix and mac systems, but not windows.
 
  • Like
Reactions: crosslink

crosslink

Ars Scholae Palatinae
882
Subscriptor
Thanks koala and brendan_kearney.

I have rebuilt the server software using Ubuntu Server 22.04 and have the data transfer running now over nfs (using rsync). So far it's going smoothly--on Beethoven and going. It's already much farther along than the transfers over SMB went.

Will let it continue doing its thing and report back when it finishes one way or the other (probably tomorrow).
 

koala

Ars Tribunus Angusticlavius
7,579
How much data is this?

Does it work if you retry? I mean, what's that small fraction? How long is the transfer? How long before it fails?

If you have an unreliable network, copies can fail at any point. And maybe the other system was "lucky". Maybe you retry with the other system and it fails again...

I assume this is a local network? Is wireless involved?
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
Total amount of data is about 400 GB, in about 600 folders (mostly flac files) in a "flat" structure (all in 1 folder). This is the music portion of what I want to live on the NAS. Extrapolating from what copied yesterday, it would take roughly 3 hours to transfer.

Yes this is a local network, here at home. Router/DHCP server is an Intel Atom machine with dual ethernet ports running Opnsense, which assigns static IP addresses to server and client(s). Wifi is handled by a Ubiquiti access point. I'm using a Cisco 24 port ethernet switch connecting everything over ethernet, except of course whatever connects over wifi.

Good point about that... The client I used yesterday was my Thinkpad T460 (Ubuntu 22.04), connected over, here it comes, WiFi. While we're at it, the data source was a 3" SATA (spinning rust) hard drive, connected to laptop by a plugable USB adapter.

Put this way, I guess I can see two weak links. The wifi works quite well (as wifi goes) but an ethernet connection would be a different thing entirely. Same for the USB/SATA adapter? I tried to get a reasonably good one (Plugable) but maybe a direct SATA connection would be more reliable?

I'll plug the laptop into ethernet and try the transfer again.
Another comparison I could make would be to connect the source hard drive directly to the motherboard of my HTPC, which is always connected to the network by ethernet.

Will try them in that order.

I guess I'm puzzled by the hard crashes, which I wouldn't necessarily expect even in the case of, say, wifi issues.
 

koala

Ars Tribunus Angusticlavius
7,579
Hmmm, I'm getting worse at reading carefully. So you say the server freezes? Do you stop being able to reach the server via SSH or via physical access?

I mean, if the T460 got a networking issue, the transfer would stop... and if it's a wifi connection issue, you might not be able to reach the same server through ssh (or an ssh connection would drop).

If you are on the physical console of the server and it freezes, then it can't be a client problem.
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
The server does indeed crash hard. ssh becomes unresponsive. Right now it's set up with a monitor and keyboard. The cursor stops flashing and it won't respond in any way without a hard power cycle. This just happened with the copy over nfs I started with the laptop plugged into ethernet, failing in only about 5 minutes.

Yeah it is these crashes (which are happening every time on the server side, and happen the same way regardless of whether I'm using Ubuntu Server or Debian Standard) that have me puzzled. My Linux admin chops are not terribly strong, so I could use some hints about which logs to look at (though in the case of a hard crash I can imagine they might not get written).

Timing: I have to go out of town from now till late Sunday, so I need to pause this till probably Monday morning. I really appreciate your ideas! I'll check back in here then.
 

malor

Ars Legatus Legionis
16,093
CPU (Pentium G2020)
You're probably choking on encryption speed. ZFS needs the AES New Instructions to stay performant. There's been some work on improving that, but Debian stable probably doesn't have those fixes, and Ubuntu might not either. They're probably still nowhere near as good as the hardware-accelerated version.

Get any processor with AESNI and it'll likely be fine. Or don't use encryption.

I was forced to retire a 2600K 920 I had intended to use, for this same reason. It didn't crash, probably because I wasn't loading it over SMB, but through regular file copies, but it was incredibly slow. A replacement 4790K was super quick by comparison.

edit: corrected CPU model.
 
Last edited:

theevilsharpie

Ars Scholae Palatinae
1,199
Subscriptor++
The server does indeed crash hard. ssh becomes unresponsive. Right now it's set up with a monitor and keyboard. The cursor stops flashing and it won't respond in any way without a hard power cycle. This just happened with the copy over nfs I started with the laptop plugged into ethernet, failing in only about 5 minutes.

If something as basic as a VGA console or a keyboard becomes unresponsive, that's a sign that the CPU has locked up. That can be an issue with the CPU, or something it depends on to function (e.g., memory, PCH, etc.)

Linux is unlikely to be able to give you any direct information on the problem, as Linux would be relying on the very CPU that has just locked up to detect and log the source. Your motherboard's BMC could potentially provide more information since it's independent of the CPU. The BMC may also be able to issue an NMI to the CPU, which could interrupt the CPU enough to generate a crash dump (if the CPU is deadlocked and not truly crashed).

Also, according to your motherboard's manual, it's equipped with a Cougar Point PCH. Cougar Point has a known issue with the 3 Gbps SATA ports malfunctioning over time, something that is inherent to the hardware and could only be fixed with a replacement. It's unclear if your motherboard has the fixed or defective PCH. Malfunctions on the bus for local storage can also cause the machine to become unresponsive without any obvious clue as to why (since logs are probably being written to local storage), and the kernel itself can lock up if the storage that's malfunctioning contains an active swap file/partition.
 
  • Like
Reactions: crosslink

crosslink

Ars Scholae Palatinae
882
Subscriptor
You're probably choking on encryption speed. ZFS needs the AES New Instructions to stay performant. There's been some work on improving that, but Debian stable probably doesn't have those fixes, and Ubuntu might not either. They're probably still nowhere near as good as the hardware-accelerated version.

Get any processor with AESNI and it'll likely be fine. Or don't use encryption.

I was forced to retire a 2600K 920 I had intended to use, for this same reason. It didn't crash, probably because I wasn't loading it over SMB, but through regular file copies, but it was incredibly slow. A replacement 4790K was super quick by comparison.

edit: corrected CPU model.
The G2020 was just for troubleshooting purposes. I'm using a Xeon E3-1240.
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
If something as basic as a VGA console or a keyboard becomes unresponsive, that's a sign that the CPU has locked up. That can be an issue with the CPU, or something it depends on to function (e.g., memory, PCH, etc.)

Linux is unlikely to be able to give you any direct information on the problem, as Linux would be relying on the very CPU that has just locked up to detect and log the source. Your motherboard's BMC could potentially provide more information since it's independent of the CPU. The BMC may also be able to issue an NMI to the CPU, which could interrupt the CPU enough to generate a crash dump (if the CPU is deadlocked and not truly crashed).

Also, according to your motherboard's manual, it's equipped with a Cougar Point PCH. Cougar Point has a known issue with the 3 Gbps SATA ports malfunctioning over time, something that is inherent to the hardware and could only be fixed with a replacement. It's unclear if your motherboard has the fixed or defective PCH. Malfunctions on the bus for local storage can also cause the machine to become unresponsive without any obvious clue as to why (since logs are probably being written to local storage), and the kernel itself can lock up if the storage that's malfunctioning contains an active swap file/partition.
That would explain a lot. Thanks for that!

The link indicates this PCH has a total of 6 SATA connections:
  • 2x 6Gbps (not affected by the issue)
  • 4x 3Gbps (affected by the issue)

Both motherboards I've been trying have these, and I've always connected the SSD containing the OS to one of the former, and the 4 WD reds (non-shingled type) as the ZFS pool to the latter.

This suggests I can try as a test removing the WD reds, connecting a known good SATA drive to the other 6Gps port, and trying the data transfer again. Will give this a spin, so to speak.

edit: corrected a typo
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
My travel plans changed a bit, so made it home earlier than expected. I ran the test as described in the post above.

The most convenient way to do it was to temporarily house the disk containing the data in my Win10 box and set up a samba share to the server. Then using robocopy to do the copying from Win 10. The data transfer finished with no errors. It's all on the server now, and seems to be working (music plays fine for all the files I've checked).

So I'm leaning pretty strongly toward theevilsharpie's suggestion being the correct one in this case. Of course I'll put some more hours and miles on the server in this configuration, just to accumulate some more data, and avoidance of failures seen previously.

Meanwhile it's time to consider some options for server hardware. I would like to keep using as much of my existing hardware as is practical, including the ability to use 2 or more drives mirrored in a zfs pool.

Would a PCIe-based SATA controller make sense in the server? My performance requirements for the server aren't terribly high, but I do want the system to be reasonably reliable.
 

Whittey

Ars Tribunus Militum
1,849
My guess is that it's not fundamentally a hardware issue, as Xigmanas worked. Are you running out of memory? If it's able to write to syslog before/when it freezes, check for OOM killer. Or perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup. Maybe try locally writing a couple hundred GB to the pool to segregate disk/fs from external protocol/etc. Also, when the server is frozen, can you still ping it?
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
Putting some more "miles" on the server after successfully transferring all my music to the server via Samba (recall we're on Ubuntu Server using only the two SATA ports unaffected by the Cougar Point bug) I started a new copy job (about 800 Gb from my personal laptop, simulating a backup) over NFS and went out for a few hours. Came back to a frozen-solid server.

when the server is frozen, can you still ping it?
I just got a good chance to check that...
crosslink@T460:~$ ping 192.168.1.98 PING 192.168.1.98 (192.168.1.98) 56(84) bytes of data. From 192.168.1.10 icmp_seq=1 Destination Host Unreachable ...

Are you running out of memory?
This system has 16 Gb of ecc RAM. During the years it ran as a FreeNAS machine it never once had any issues related to this.

If it's able to write to syslog before/when it freezes, check for OOM killer. Or perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup. Maybe try locally writing a couple hundred GB to the pool to segregate disk/fs from external protocol/etc.
I appreciate. I need to take some time to figure out how to do these.


My guess is that it's not fundamentally a hardware issue, as Xigmanas worked.
True that, about Xigmanas. That Xigmanas is based on FreeBSD, I can't help but wonder if there's any difference "under the hood" that might make it more tolerant of errors than the Linux implementations I've tried. A long time ago and in a thread I can't find now, I remember reading a thread (not on ArsTechnechnica IIRC) along the lines of "Dedicated NAS distro -vs- Roll-your-own-Linux". I remember one of the answers recommended the former, mentioning "tuning" as a reason why, though no particulars were given. I've always wondered about what that "tuning" part might mean. Regardless of how this particular episode ends, I'd love to hear perspective folks here might have.
 

theevilsharpie

Ars Scholae Palatinae
1,199
Subscriptor++
Barring some type of hardware failure that simply crashes the machine, one area I'd expect Linux to be more "aggressive" than FreeBSD is when utilizing more advanced features of the hardware, particularly power management.

Power management is also something I'd expect to be set differently between a general-purpose distribution, and one tuned specifically for servers.

It's a bit of a reach, but it's the main thing that jumps to mind.

The other thing worth trying is to get a console and kernel output going of a serial port (either a standard serial, or serial-over-LAN via your BMC). Serial is very simple, and can often provide some error information when other avenues fail (in particular, whatever is recording the serial console output can persist that even with the host machine unresponsive). If even a serial console (with kernel output sent to it) doesn't provide some diagnostic information, then I suspect the CPU is faulty and is just locking up.
 

teubbist

Ars Scholae Palatinae
823
I'd be more inclined to suspect and issue on the SATA side than general power management, especially with such a relatively old platform. NCQ and maybe drive power states(although these should not trigger during a transfer).

If you're still experimenting, NCQ can be disabled by setting the drive queue depth to 1. /sys/block/<device>/device/queue_depth is the sysfs path, so echo 1 > /sys/block/sda/device/queue_depth etc.

One other random thought: if you have VT-d enabled it's possible Linux is doing something with the IOMMU regions that FreeBSD isn't. This could lead to hard lockups as well. If you're not planning hardware passthrough to a VM, that's something else you could disable. Note, this is VT-d(or Directed I/O in some BIOSes), not VT-x which should always be safe to leave enabled.
 
Last edited:

crosslink

Ars Scholae Palatinae
882
Subscriptor
I'd be more inclined to suspect and issue on the SATA side than general power management, especially with such a relatively old platform.
I'm coming to feel similarly. One thing I've noticed (cumulatively over all of the troubleshooting) is that the crashes have always occur during data transfers. It passes memtest x86 and an hour of stressapptest with no issues.

edit: and also does fine idling overnight.
 
Last edited:

crosslink

Ars Scholae Palatinae
882
Subscriptor
Synopsis of most recent crash (recall we're on Ubuntu Server, using only the two SATA ports not affected by the known Cougar Point bug, running a simulated backup of my personal machine to the server over NFS)...

perhaps keeping 'top' up on the server while transferring to see if it's approaching full before lockup.

I wrote a script to do this--record the top 4 entries in top every few seconds. Here's the relevant portion showing the last 10-20 seconds before the crash (during data transfer to server).

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND <snip> 3291.2 seconds 1189 root 20 0 0 0 0 S 50.0 0.0 42:49.77 nfsd 130063 zcrosslnk 20 0 10608 3828 3224 R 6.2 0.0 0:00.01 top 1 root 20 0 166536 11984 8448 S 0.0 0.1 0:06.63 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:03.47 kthreadd 3294.4 seconds 1189 root 20 0 0 0 0 S 46.7 0.0 42:51.15 nfsd 130141 zcrosslnk 20 0 10608 3928 3324 R 6.7 0.0 0:00.01 top 1 root 20 0 166536 11984 8448 S 0.0 0.1 0:06.63 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:03.47 kthreadd 3297.5 seconds 1189 root 20 0 0 0 0 S 20.0 0.0 42:52.39 nfsd 1 root 20 0 166536 11984 8448 S 0.0 0.1 0:06.63 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:03.48 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp <crash>

If I'm interpreting this correctly, we have adequate memory for the operation.

edit: the seconds entries in this log are only a very rough approximation; I re-launched it once or twice and only did a very rough add-on to the elapsed time entry shown here.
 
Last edited:

Paul Bartz

Ars Praefectus
5,626
Subscriptor++
I've got a similar situation here.

SuperMicro X9DR3-LN4F+, two Intel Xeon E5-2608L v3 CPUs, 8x 32GB PC3-14900R ECC DIMMs for 256GB. ZFS RAIDZ2 on 5x 12TiB disks. SATA RAID card flashed to IT mode for JBOD. Intel 10GBe NIC between old Windows server and new ZFS server.

I was getting random lockups during file transfers. I'm about to revisit this system with Ubuntu 23.10. Will advise on my findings once my 10GBe switch comes back and I can find time during holiday preparations.

I hadn't heard about the Cougar Point SATA bug until now. Thanks for the heads-up.
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
If those boards are new enough to have at least a first-gen NVME slot with at least two PCIe lanes connected (which is exactly what my Haswell board has), this might be a way to work around the problem:


View: https://www.amazon.com/dp/B07T3RMFFT/


That's a converter from an M.2 slot, 2 PCIe lanes, to 5 SATA ports.

The X9 family doesn’t (or at least my X9SCM ones don’t) but the X11 does. Hopefully the onboard SATA will work properly.
 

crosslink

Ars Scholae Palatinae
882
Subscriptor
Recall I retired the Supermicro X9SCM-F motherboard and got a Supermicro X11SSH-F. I set this up with Ubuntu Server 22.04. It is handling all data transfers with no issues. It's nice to have a working server again.

Although I can't prove it, the issue seems consistent with the Cougar Point SATA bug described by theevilsharpie. Thanks all!