Experimental build: Broadwell, eDRAM based compute node

  • Thread starter Deleted member 32907
  • Start date
D

Deleted member 32907

Guest
I've been playing around with various distributed compute projects in my office, and one thing I've noticed is that some of them really like their memory bandwidth and cache - and run poorly without it available. Plenty run purely in cache or nearly so, but CPDN is one that's really hard on the memory bandwidth (doing global climate models, I cut my Linux admin teeth partly maintaining a cluster doing that sort of stuff years ago).

For reasons I fail to understand, the whole "big L4 cache" thing hasn't really taken off - the handful of eDRAM chips Intel has released. Mostly for laptops, but there were a few that came out as desktop chips, and the late Broadwell era one is particularly promising looking for memory-heavy workloads. Onboard GPU (so no need for an external GPU on the board), 128MB eDRAM, 4C/8T, 12MB L3, 128MB L4 (effectively).

Fine, they're cheap on the used market.

I have one of these beasties on order (used) with associated hardware to make it run. I'm going to set it up as a little BOINCbox and do some testing of performance on the gnarly workunits I like to run, as well as some others.

One thing I've noticed is that on a lot of systems, especially with hyperthreading, full utilization of the system gets no more instructions retired per second than a reduced utilization - so I'm debating writing some sort of automatic system that fiddles with workunits to optimize for total system throughput. In a rather entertaining case, my old Xeon box, with some totally-fit-in-L3-cache workunits, retires almost exactly the same number of instructions per second (60G) with 12 threads or 24 threads - 24 threads only has a 10% boost to 66G instructions per second.

My office has an obscene amount of power now (I added another 8 panels in the fall, so I now have 5.2kW of solar on my shed), and I'm working towards being able to make better use of this power this summer. I'll probably be putting some of the compute outside as well - I've got a few things I want to try on that front, though I keep waffling between "adding vents to my office and venting indoor computers outside" vs "just putting the computers outside." I do like the heat in the winter, but I really just need to get the heat outside. If I could move all my compute outside in the summer, this would be really convenient for me.

Anyway, any experiences with the eDRAM based chips for compute? They seem like they should be really good at it.
 

redleader

Ars Legatus Legionis
35,019
Intel's eDRAM didn't take off because it is a whole second die fabbed on the same node as the main CPU and occupying almost as much wafer for maybe 10-20% more performance. For that to make business sense, Intel would have to be able to charge almost double for the CPU, and as Intel found out very few people took them up on that offer. But yeah if you can find them second hand, that is Intel's problem and not yours :)
 

MadMac_5

Ars Praefectus
3,700
Subscriptor
One thing I've noticed is that on a lot of systems, especially with hyperthreading, full utilization of the system gets no more instructions retired per second than a reduced utilization - so I'm debating writing some sort of automatic system that fiddles with workunits to optimize for total system throughput. In a rather entertaining case, my old Xeon box, with some totally-fit-in-L3-cache workunits, retires almost exactly the same number of instructions per second (60G) with 12 threads or 24 threads - 24 threads only has a 10% boost to 66G instructions per second.
I have seen similar results with Geant4-based Monte Carlo simulations for medical imaging/radiation therapy use, where we simulate photon interactions along paths in tissue. The software LOVES having L3 cache (I saw a massive boost going from Zen+ to Zen 2), and it's embarrassingly parallel since each particle's track is independent so more threads = MOAR BETTER. I've found that as soon as I split my workload beyond the number of physical cores in a system the performance gain drops off substantially in situations where there's no I/O bottlenecking. Based on what I understand about SMT and Hyper-Threading, since all of the threads are doing similar math as they crunch away the ability of SMT to allow the unused execution units of a CPU core to be used while the rest of the core is busy doing something else doesn't really help very much; there's no hardware left to work on those extra threads for most of the runtime.

If the climate modelling in CPDN is similar, that could explain why you don't see a substantial increase in calculation rate when you go beyond 12 threads.
 
D

Deleted member 32907

Guest
The stuff running in Xeon L3 cache is actually World Community Grid Mapping Cancer Markers tasks. I think they're all in roughly the same code section so sharing a ton of L3. Fun with Intel PCM monitoring...

The CPDN stuff is known to require 4MB per thread of L3 to be useful. I'm just interested in what it'll do with less than 4MB/thread, but also a ton of the eDRAM to work with. Though, practically, the right mix is probably one or two of those and then the rest lighter weight stuff.

Playing with my NUC, though, I'm having fun with thermals... this is an eDRAM chip of a later generation (i7-6775, IRC).

dc410c70326fbb998ff7d2b18abb3703ecc22127_2_1024x1000.jpeg
 
D

Deleted member 32907

Guest
One of the other things I've been considering here is the performance benefits (or lack thereof) of running only one type of task at a time.

I typically monitor L3 cache rate over time (I really need to automate some of this), and it's substantially higher working only one type of workunit at a time - which makes a lot of sense in that the WUs are the same core of code over different data. Having the same code available RO in the cache for different threads will improve cache utilization - if you've got 200kb of code and 8 threads, if they're all working on the same binary vs different ones, that's a potential savings of 1.4MB of cache, which is a big deal.

It might make sense to, for projects that support it, only process one WU type at a time for optimizing throughput.

Has this sort of work been done anywhere? I know I'm pretty far behind the curve on distributed compute optimization...
 

continuum

Ars Legatus Legionis
94,897
Moderator
Well, I'm curious where this takes you. Anandtech did a Broadwell retrospective not too long ago but it doesn't do F@H or anything super specific for you.

A more modern take -- which I'm pretty sure you're aware of (but I might as well paste here for posterity :p ) might be AMD Zen 2 vs. Zen 3 and the CCX changes.

On Zen 2, the four cores inside each CCX only have direct access to 16MB of L3 cache whereas on Zen 3, all eight cores within the CCX share the same 32MB of L3 cache
https://www.tomshardware.com/news/ryzen ... r-workings
 
D

Deleted member 32907

Guest
Well, I'm curious where this takes you. Anandtech did a Broadwell retrospective not too long ago but it doesn't do F@H or anything super specific for you.

I believe their conclusion was that it was still pretty good. It's mostly the CPDN workunits I'm interested in improving performance on - they're just gross, memory-bandwidth-wise. My Haswell supercooled NUC is doing a tolerable job, holding almost 1 IPC with 8 of them running, with a "More instructions retired with more tasks running" behavior vs the inversions I sometimes seen (more tasks, fewer instructions retired). It'll be interesting to see what an uncorked Broadwell can do - I plan to up the thermal limits and add cooling as needed. Heatpipe heatsinks are nice.

A more modern take -- which I'm pretty sure you're aware of (but I might as well paste here for posterity :p ) might be AMD Zen 2 vs. Zen 3 and the CCX changes.

Yes, I'm just trying to do this set of experiments on used hardware instead of buying new or nearly-new stuff.
 
D

Deleted member 32907

Guest
Parts arrived and assembled.

Still working on the clocks, but... it's happy at 3.8GHz for now and I'll mess with it a bit more later.

Chewing on some trivial units for testing (Mapping Cancer Markers, they run heavily out of L3 cache), I'm retiring 43G instructions per second on 4C/8T. For comparison, my Xeon heater (12C/24T of X5650) retires 61G instructions per second - on 24 threads. Though admittedly, 12 threads mostly pegs that one out...

I'll get some harder WUs in later today to chew on. And mess with clocks, I can probably get it up to 4GHz stable, just not sure if I have many gains there. It depends on the workunits, and I don't care to really hammer it - I don't mind pushing it a bit, but I don't want to run it on the edge for compute tasks.
 
D

Deleted member 32907

Guest
Reminds me I read this a few months ago.

That's a ripped version of an Anandtech article... o_O

Take away.. The Broadwell eDRAM has a bandwidth of 50 GiB/s, about 2x the DDR3 of the day. 3200MHz DDR4 has a bandwidth of 51.2 GiB/s bandwidth. The main system memory of my Ryzen is faster than the eDRAM on that Broadwell.

Sure, but what did your new build cost? I built this out of used parts for about $300 with 16GB of DDR3 RAM, SSD, etc.

On the other hand, it doesn't seem to be chewing through CPDN units terribly faster than my NUC. Faster than the Xeon, certainly. I should find some benchmarks to mess with.