Distributed Computing takes on Coronavirus COVID-19

D

Deleted member 32907

Guest
Yo, dawg, I heard you liked CPUs!

xeons.png


Code:
rgraves@xeon-boinc:~$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Model name: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
Stepping: 2
Frequency boost: enabled
CPU MHz: 1599.576
CPU max MHz: 2268.0000
CPU min MHz: 1600.0000
BogoMIPS: 4533.36
Virtualization: VT-x
L1d cache: 384 KiB
L1i cache: 384 KiB
L2 cache: 3 MiB
L3 cache: 24 MiB
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerabl
e
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and _user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP co
nditional, RSB filling
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36
clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx pdpe1gb rdtscp
lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop
tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm
2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm epb pt
i ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida
arat flush_l1d

Xeons are online, as long as I've got sun. The only real problem is that they don't sleep for shit - they pull 500W running, and about 150W "sleeping." So, not an option there. I'll play with suspend/resume, though last time I messed with Linux kernel hibernation, there were some weird issues after resume - process priority inversion and stuff until the system timer caught up with how long it had been running before. Shutting it down means that I have to resume tasks from the last checkpoint, which isn't a huge deal for most tasks, but for some of the longer running stuff, I'll be losing a bunch of work. I'll experiment.

On the plus side, I've figured out how to get it to shut down ('shutdown -h now' apparently does something different from 'halt' in terms of shut down behavior), and despite not having options for Wake on LAN in the BIOS, it does, actually, wake on LAN. So I can remotely power it on and off!

It's a respectable amount of CPU power, though with all 24 threads running, fewer instructions than I hoped for are getting retired - only 40G or so across all the cores, which is what a more modern system can do on far fewer cores. Fortunately, newer systems cost more money, and this heater was a gift for BOINC use. :) And man does it have RAM.
 
D

Deleted member 32907

Guest
On the minus side, both Folding@Home and Rosetta have been pulling their workunit timings in - hard. F@H's GPU units only have a one day window, and Rosetta expects 24h of compute in 3 days, which is doable on grid power, but harder to achieve on solar when I can only run during good sun. Oh well. I can still turn in a lot of useful work, some may just be stale sanity checks. It's still cloudy here, but should be full sun, soon.

I need to experiment with moving rigs outside, though. Getting the heat out of my office would be hugely useful. I've waffled between a server case in the corner of my office with good venting, or just building an external enclosure and putting them in there. The main question is how I toggle them for winter heat, though - having them outside is nice in the summer, but annoying in the winter. Having them inside is nice in the winter, though they don't run as much, and annoying in the summer.

Of course, I'm adding 50% more panel to my office, so that ought to help power them in darker conditions.

IMG_8424.JPG


boinctui.png
 

MadMac_5

Ars Praefectus
3,700
Subscriptor
That's one heck of a lot of CPU cores to throw at Rosetta! I haven't done any testing with Rosetta under Linux yet on my Ryzen 2600X, but in your experience does going beyond the number of physical cores start to give rapidly diminishing returns? I know that for Folding@Home and Geant4 (which I use for some actual simulation work), there's a pretty steep drop-off in work done per additional thread once I go beyond the number of actual CPUs and start using HyperThreading/SMT.
 
D

Deleted member 32907

Guest
That's one heck of a lot of CPU cores to throw at Rosetta! I haven't done any testing with Rosetta under Linux yet on my Ryzen 2600X, but in your experience does going beyond the number of physical cores start to give rapidly diminishing returns? I know that for Folding@Home and Geant4 (which I use for some actual simulation work), there's a pretty steep drop-off in work done per additional thread once I go beyond the number of actual CPUs and start using HyperThreading/SMT.

Maybe? I've debated writing some monitoring code that tracks the instructions retired over various windows and tries to optimize tasks for total instructions completed. With some workloads, on some chips (the N216 climateprediction stuff is especially bad), throwing more tasks at the cores actually reduces total throughput - not just per task, but overall system instructions retired. They're just busy blowing the cache. But other workloads fit in nicely around it. I'm not sure how much effort it's worth to really optimize it vs just letting them run and hoping for good outcomes.

At least on Rosetta tasks which I've got running now, I don't see a regression in total instructions retired with 24 tasks vs about 20 - it's flat, but not slower.
 

theevilsharpie

Ars Scholae Palatinae
1,199
Subscriptor++
Xeons are online, as long as I've got sun. The only real problem is that they don't sleep for shit - they pull 500W running, and about 150W "sleeping." So, not an option there.

If by "sleeping" you're referring to an ACPI sleep state, you're probably getting killed by the amount of physical memory you have installed.

72 GB is overkill for Rosetta@Home. If you pulled one stick per channel, you'd reduce your power usage, and would still have 48 GB available. Your memory frequency may also increase with fewer DIMMs per channels. Dropping down to 24 GB (1 DIMM per channel) would reduce power usage further, although you'd probably start swapping with some of the heavier WU's.

Also on the topic of power efficiency, if your goal is performance/watt rather than raw performance, I'd suggest restricting your maximum clock speed to below your base clock. This will effectively disable Turbo Boost, but should increase your performance/watt. I'm guessing setting the max frequency 1-2 ticks below the max base clock will produce the best efficiency, without dropping throughput too much.

Lastly, if you haven't already, I'd suggest running "powertop" and following most of its suggestions (CPU performance excepted). Enabling some of the ancillary system power controls (e.g., USB suspend, PCIe link power management, etc.) can further drop power usage by 10-15 watts.

EDIT: You may also want to consider joining Ibercivis, which is currently running a project that tests whether some existing drugs are effective at treating COVID-19. In addition to spreading your compute power around more COVID-related projects, Ibercivis WU's are memory-light (all of my WU's have peaked under 100 MB), which makes it easier to run with only 24 GB of RAM.
 
D

Deleted member 32907

Guest
If by "sleeping" you're referring to an ACPI sleep state, you're probably getting killed by the amount of physical memory you have installed.

That's probably true. It's fully populated, so keeping that alive is likely consuming a good bit of the sleep power. The fans don't power down either, though. That's fine, I just won't run the super-long-time-between-checkpoint units on it very much (cpdn has some of those for Linux hosts - I like them, but also have a core or two of my 24/7 home server throwing cycles after them - which may change when I get home solar online, depending on energy produced vs used balanace). Those WUs tend to be cache-brutal as well, so only having one or two is fine. They're a workload where you can watch the total machine throughput (in instructions retired per second across all cores) drop as you add more of them, because of cache pressure.

72 GB is overkill for Rosetta@Home. If you pulled one stick per channel, you'd reduce your power usage, and would still have 48 GB available. Your memory frequency may also increase with fewer DIMMs per channels. Dropping down to 24 GB (1 DIMM per channel) would reduce power usage further, although you'd probably start swapping with some of the heavier WU's.

Going to 48 is worth considering, going to 24, no. There are plenty of >1GB WUs I get, and I've been seeing some 2GB units. I'd rather be able to fit them, though the question of total system throughput under different workunits is an interesting question. I've seen this system "flatline" around 18 running threads of Rosetta and such, but I'm not seeing performance reversions on this set of WUs. Also, being a server board, I'm not sure what the memory frequency would do if I pulled sticks out. I'm not sure I particularly care, having 72GB of RAM is just sort of fun. Most I've ever had in a personal system by a good margin - I upgraded a box to 48GB recently before donating it for an indefinite period for video crunching work, previously 32GB was my high water mark.

Also on the topic of power efficiency, if your goal is performance/watt rather than raw performance, I'd suggest restricting your maximum clock speed to below your base clock. This will effectively disable Turbo Boost, but should increase your performance/watt. I'm guessing setting the max frequency 1-2 ticks below the max base clock will produce the best efficiency, without dropping throughput too much.

Power efficiency is gained by turning it off during the hours the sun isn't up, and when it's cloudy. :p On a sunny day, I can't use all the power my panels generate as-is, and I'm working on adding 50% more to give me better winter running capability (I'd like to get off my generator for the most part in the winter). At some point, I might start charging the EV out there, but I'll see. Power efficiency just isn't a concern of mine though. This is an old, power hungry box that's retired to a place that is power consumption limited to do some good before getting fully retired. My panels spend most of the day in use-it-or-lose-it mode, where I just can't consume everything they could potentially generate. This is pretty standard for off-grid systems, because if you're paneled for winter, with short, cloudy days, you have more power than you know what to do with during the long, sunny summer days. This is my way of attempting to do something useful with it - it's either that or cryptocurrency mining, and this seemed an awful lot more "globally useful."

I can also radically improve power efficiency by moving the computer outside, and not having to pump hot air out of my office. I just haven't gotten around to building a sheltered plywood enclosure for servers and then figuring out how to mount it somewhere.

EDIT: You may also want to consider joining Ibercivis, which is currently running a project that tests whether some existing drugs are effective at treating COVID-19. In addition to spreading your compute power around more COVID-related projects, Ibercivis WU's are memory-light (all of my WU's have peaked under 100 MB), which makes it easier to run with only 24 GB of RAM.

Sure, I'll throw a few cycles at them. The Rosetta workunits have been running tighter-than-I-can-deal-with deadlines a lot of the time, though they seem to terminate earlier than expected a lot of the time. WCG is soaking up the spares, but I'll toss ibercivis in and let it fight for some cycles.
 

JimboPalmer

Ars Tribunus Angusticlavius
9,402
Subscriptor
so I've been running F@H for a few days now and joined Team Eggroll (as far as I know) but I don't see any of my stats or even my user name on the team 14 stats page. What could I be missing? yes I'm doing work :)
I see two MarkL volunteers, neither Team Eggroll. I think it is just that having many servers, the stats server is the oldest and weakest.
 

nimro

Ars Tribunus Militum
2,097
Subscriptor++
Power efficiency is gained by turning it off during the hours the sun isn't up, and when it's cloudy. :p

I'm curious, do you have any automation set up to trigger a shutdown when available solar power drops below a certain threshold? Or do you just manually boot it in the morning then turn it off at night?
 
D

Deleted member 32907

Guest
I'm curious, do you have any automation set up to trigger a shutdown when available solar power drops below a certain threshold? Or do you just manually boot it in the morning then turn it off at night?

The process is manual. I do have the machines set up for wake on LAN, so I can power them on from the house if I want, but powering them on is purely manual, powering them off is either manual or a time based thing (I have a scheduled event that I move throughout the year to shut down in the evening if I've not shut them down earlier).

I've debated doing the sort of automated controls that would look at output, look at weather forecasts, and do the control automatically, but it just doesn't gain me much, and I'd have to rig something for air conditioner control as well (I'll let it run in eco mode overnight if I've got good sun during the days, but if it's cloudy, I shut that down too to save on power). It's not a big deal to walk out there and toggle stuff, and I also factor in a couple days ahead of both weather forecasts and my intended use. If it's a Friday afternoon, with a sunny Saturday forecast and I'm not going to be out there, I'll let it run later in the evening than if I'm going into a few cloudy workdays.

I just figured the time I'd spend writing, troubleshooting, second guessing, and dealing with corner cases is far more than I actually spend toggling stuff on and off daily.
 

Drizzt321

Ars Legatus Legionis
28,408
Subscriptor++
I'm curious, do you have any automation set up to trigger a shutdown when available solar power drops below a certain threshold? Or do you just manually boot it in the morning then turn it off at night?

The process is manual. I do have the machines set up for wake on LAN, so I can power them on from the house if I want, but powering them on is purely manual, powering them off is either manual or a time based thing (I have a scheduled event that I move throughout the year to shut down in the evening if I've not shut them down earlier).

I've debated doing the sort of automated controls that would look at output, look at weather forecasts, and do the control automatically, but it just doesn't gain me much, and I'd have to rig something for air conditioner control as well (I'll let it run in eco mode overnight if I've got good sun during the days, but if it's cloudy, I shut that down too to save on power). It's not a big deal to walk out there and toggle stuff, and I also factor in a couple days ahead of both weather forecasts and my intended use. If it's a Friday afternoon, with a sunny Saturday forecast and I'm not going to be out there, I'll let it run later in the evening than if I'm going into a few cloudy workdays.

I just figured the time I'd spend writing, troubleshooting, second guessing, and dealing with corner cases is far more than I actually spend toggling stuff on and off daily.

Sure. But wouldn't it be fun and interesting? :cool: hehe
 
D

Deleted member 32907

Guest
Sure. But wouldn't it be fun and interesting? :cool: hehe

I have no shortage of fun and interesting projects that have the side perk of being useful as well. :)

It's really interesting with these Xeons - some of the tasks are actually tight enough that they're running almost entirely out of L3 cache, with basically no memory accesses. Some thrash memory and struggle with IPC.
 

MadMac_5

Ars Praefectus
3,700
Subscriptor
Update: CPU units are coming in much more frequently now for Folding@Home, and are a mix of COVID-related and other work. GPU work units that I've been getting are all COVID-19 related.

Also, the most recent Moonshot "sprint" finished up yesterday; it started November 5th, and finished some time yesterday afternoon. We're making a difference!