The Lunar Lake Thread

Anonymous Chicken

Ars Scholae Palatinae
1,134
Subscriptor
Seems to me that Intel has now sacrificed everything in the name of power efficiency. Nothing is holy, they want to win.
  • everything fabbed at TSMC
  • RAM moves on package
  • HT dropped
  • cores dropped
  • seems like the E-cores are excluded from the L3 ring bus
If this doesn't win the power crown, its not happening. Might as well start designing that ARM product.
 
  • Like
Reactions: rodalpho

hobold

Ars Tribunus Militum
2,657
The reporters of c't magazine speculate that development of Lunar Lake has started back when Apple ditched Intel CPUs (assuming 5 to 6 years of development time for a completely new from scratch design). They reported this hypothesis not because of any evidence, but because it nicely explains Intel's odd design decisions: external fabbing, low number of cores, low number of chiplets, high performance AI accelerator, integrated DRAM.

If that theory is correct, then Lunar Lake was really meant for devices like the MacBook Air: extremely thin and light, no fans, extremely low power. It is unclear if Intel will be able to position it there, when Apple doesn't actually buy the chips. In the current competitive landscape, a fanless laptop might be regarded as underpowered compared to Arrow Lake and Strix, while still not offering the battery life of an Air or some other ARM device.

Intel might have to abandon the fanless path and "overclock" beyond their original design target.
 
Last edited:

Anonymous Chicken

Ars Scholae Palatinae
1,134
Subscriptor
Intel's odd design decisions
I regard them as unusually rational choices given their problem with power usage, and pressure from all sides by those who do better there. Several of the changes also seem like responses to efficiency problems with other products that are relatively recent, although given the time it takes to design these things, perhaps Intel foresaw their upcoming issues with chiplet efficiency, with E-cores being unable to save power due to everything else around them, and also with the low-power island (I speculate) being too limited.

Its amazing to see them launch a product that can probably only win one benchmark: power. Yet this product probably delivers all the performance that anyone particularly needs. While AMD & Qualcomm saw fit to put 12 full-feature cores into their latest & greatest mobile products, Intel went with essentially a pair of quads. AMD is raking in PR on a part that sports 24 threads, Intel at the same time launches something with 8. As far as I can guess, half of Intel's cores don't even have access to an L3 cache. (!)

I'm thinking Intel is going to show better real-world runtimes for thin laptops than AMD or Qualcomm in this new product cycle.
 

hobold

Ars Tribunus Militum
2,657
I'm thinking Intel is going to show better real-world runtimes for thin laptops than AMD or Qualcomm in this new product cycle.
Intel certainly has a good chance to win battery benchmarks. In actual usage, though, the other processors are quicker to complete any interactive tasks asked of them. So it'll be a battle of two philosophies: 1. don't ever power up into the high wattages, and 2. "race to sleep" and then power down again sooner.

I don't know who will win. There'll be lots of opportunity to game such benchmarks. And different users with different usage habits will see very different results.
 

Anonymous Chicken

Ars Scholae Palatinae
1,134
Subscriptor
I don't think Intel has sacrificed any single-thread performance on the P-cores, except perhaps for a mere 12M L3. It is a bit unclear what everyone is doing exactly, but if Intel is the only one who can run most tasks while most of the system sleeps, they are going to have excellent idle power numbers. (Perhaps Apple does the same without drawing attention to it.) A cluster of 4 updated E-cores isn't particularly slow, either.

I wonder if Qualcomm (or AMD) has done anything special with their 12 cores that is similar.
 

Aeonsim

Ars Scholae Palatinae
1,057
Subscriptor++
I was reading they've also entirely removed AVX-512 support, along with HT from the cores (no longer disabled, restricted only to Server variants or possibly high end desktop). They seem to have a very different direction of travel to AMD which has now added full speed AVX-512 support to all their Zen-5 cores (up from half speed).

From the point of a system designed for long battery-life with good-enough to decent performance (+14% IPC isn't terrible), these seem like decent options for 90% of workloads. Will likely make for a perfectly decent laptop for office work, web browsing etc. Then they have the GPU and NPU to make up for the loss of the AVX-512 functions on the main cores if you need to do heavy vector workloads like ML. Likely not a good core design for HPC style compute, though they will likely be focusing on the server class variants of these cores for that style workload.

Also interesting that apart from talking about the Arch, Intel didn't actually launch any CPU's or provide any real performance measures apart from the somewhat meaningless TOP numbers.
 
Last edited:

w00key

Ars Praefectus
5,907
Subscriptor
It seems like a good answer to Apple's M series SoC; on-package memory should give the GPU more bandwidth compared to DIMMs. The biggest change I notice between an Intel laptop and a new M3 MBP is how it can build the release variant of an Android app barely spinning it fans, the old one goes to takeoff speed. The Zen 3 / 5800X desktop with a big air cooler also spins up all its fans doing that, the fan curve ramps up rapidly above 50C, it's crazy how the M3 does it in half the time of a previous gen desktop and a miniscule fraction of power.

If Lunar Lake comes close to even the M1 it would be a huge improvement over the current offerings.

The NPU also looks useful - currently you need a pretty big GPU for some workloads due to VRAM requirements, on Lunar Lake, you can just use that 32GB on package memory @ 8.5GT x 8 = 68GB/channel, 136GB/s total (close to M3 Pro's 153 GB/s), it should be fine for normal users running inference tasks. Heavy workloads and training will still run on $2k+ cards anyway, this is no replacement for that.
 

Anonymous Chicken

Ars Scholae Palatinae
1,134
Subscriptor
The NPU also looks useful - currently you need a pretty big GPU for some workloads due to VRAM requirements, on Lunar Lake, you can just use that 32GB on package memory @ 8.5GT x 8 = 68GB/channel, 136GB/s total (close to M3 Pro's 153 GB/s), it should be fine for normal users running inference tasks.
I was thinking about that with all the first-gen "AI machines" and 16G of RAM. Surely that will seem a bit restrictive as AI models develop? I imagine the trend will be towards larger models, and the rest of the desktop experience has to fit there, too.

I was reading they've also entirely removed AVX-512 support, along with HT from the cores (no longer disabled, restricted only to Server variants or possibly high end desktop).
They went all-in. Everything not meeting an actual need was thrown overboard. Compare that to Qualcomm with 12 cores, swinging wildly at those benchmark wins. Has AMD or Qualcomm split their 12 cores into 4 & 8? Who will take the efficiency crown? I can't recall there ever having been so much happening at once before.
 
  • Like
Reactions: rodalpho

Aeonsim

Ars Scholae Palatinae
1,057
Subscriptor++
I was thinking about that with all the first-gen "AI machines" and 16G of RAM. Surely that will seem a bit restrictive as AI models develop?
Agreed, even set the moment most of the better LLMs need more RAM than that. Mixtral is around 48gb. 32 is okay but really there should be a 48 and 64gb option for future proofing. Especially considering the GPU will be eating a reasonable amount of the RAM.

For high end use cases these machines don't seem like they'll age well.


Also for multi threaded CPU tasks I wouldn't be surprised if this is a drop in performance compared to the previous generation or two. Max of 8 cores vs what 16 cores plus multi threading? 14% extra IPC doesn't necessarily beat 50% extra p cores plus hyper threading. The the 38-60% IPC on the ecores compared to the LP-ecores (not the standard e cores) from MLake may mean they can equal the 8 ecores with L3 access plus the 2 LP-ecs. Or will be interesting to see...
 
Last edited:

w00key

Ars Praefectus
5,907
Subscriptor
Agreed, even set the moment most of the better LLMs need more RAM than that. Mixtral is around 48gb. 32 is okay but really there should be a 48 and 64gb option for future proofing. Especially considering the GPU will be eating a reasonable amount of the RAM.

For high end use cases these machines don't seem like they'll age well.
It's a 4 performance core laptop part with on package RAM, you didn't expect it to replace a 4090 right?

The 4080 "only" has 16 GB VRAM and that's the limit for model sizes usable on consumer devices. Any larger needs a significant $ investment and will run remotely instead.
 

fitten

Ars Legatus Legionis
52,250
Subscriptor++
Quick rundown by Gamers Nexus...

Some interesting bits...
  • P-cores drop HT to be wider.
  • P-cores get finer clock control... current is 100MHz increments, new ones 16.67MHz increments
  • E-cores are faster in several areas (the claim is that the E-cores make up for the loss of HT slack)
  • E-cores will come in 4 core clusters (current is in 2 core clusters)

There's a bunch of stuff but the 2nd half is basically talking about the CPUs

 
Last edited:
  • Like
Reactions: continuum

redleader

Ars Legatus Legionis
35,019
Everyone invested in CAMM2 is probably @#$Ing livid right now.

Curious about what higher end mobile parts will look like. 16 GB is absolutely too restrictive for a lot of workloads, not just AI models.
CAMM2 will be for Arrow Lake. Lunar Lake is low power mobile to compete with Qualcomm Snapdragon. 16-32GB of RAM is plenty for that, and no point in expandable memory for compact devices; main point of the on package RAM is to save space that can be used for bigger batteries.
 

Anonymous Chicken

Ars Scholae Palatinae
1,134
Subscriptor

Aeonsim

Ars Scholae Palatinae
1,057
Subscriptor++
It's a 4 performance core laptop part with on package RAM, you didn't expect it to replace a 4090 right?

The 4080 "only" has 16 GB VRAM and that's the limit for model sizes usable on consumer devices. Any larger needs a significant $ investment and will run remotely instead.
It's RAM that has to cover both GPU and CPU, I personally consider 32GB about minimum for serious work. A 4080 with 16GB is likely in a system with at least 32GB of RAM, giving 48GB total on less balanced builds it'll have at least 16GB for the CPU and 16 for the GPU.

Most high end cellphones have 16GB now with some at 24GB, when my cellphone has the same amount of RAM as Intel's basespec the Laptop has insufficient memory.
 

continuum

Ars Legatus Legionis
94,897
Moderator
Intel apparently intends to make "P" cores both with and without hyperthreading, depending on the market. So that answers the speculation about security and simplicity and all that. Intel took out HT because Lunar Lake is dedicated 100% to the task at hand: low power.

Oooo nice find on Chips and Cheese' Lion Cove preview, I had seen a bunch of others (including Anandtech's poorly apparently AI-written tripe, so sad to see them fall :( ), and Chips and Cheese definitely looks the most through of what I've seen so far.

Am much more excited/more curious now to see how Lion Cove will perform, especially given it's on TSMC N3B on Lunar Lake. Straight comparison to Lion Cove on Intel 20A (via Intel Arrow Lake) obviously is going to have a bunch of asterisks, but now doubly curious to see any comparisons between two very similar Lion Cove implementations on different processes.
 

Paladin

Ars Legatus Legionis
32,552
Subscriptor
Your phone might also have too much memory.
Yeah, unless your phone is running some virtual machines/containers or a LLM locally, it is probably more than really needed. 16GB seems like a ton for a phone. 24GB seems silly. I'm sure it can fill it all up with something but having a bunch of junk cached that you will never need isn't really effective in a lot of cases, you're just burning battery for nothing.

For a laptop/tablet, 16GB is probably plenty unless you are looking at something for gaming/development/workstation type work and 32GB is pretty good for that. Maybe 64GB will be on the table next year if things go well with these products.
 

w00key

Ars Praefectus
5,907
Subscriptor
It's RAM that has to cover both GPU and CPU, I personally consider 32GB about minimum for serious work. A 4080 with 16GB is likely in a system with at least 32GB of RAM, giving 48GB total on less balanced builds it'll have at least 16GB for the CPU and 16 for the GPU.

Most high end cellphones have 16GB now with some at 24GB, when my cellphone has the same amount of RAM as Intel's basespec the Laptop has insufficient memory.
Again, this is a low power 4 big core + bunch of E cores part. If it comes anywhere near the power of a desktop it's mission accomplished.

Also, typing this on the flagship Sony compact of 2023, 8GB RAM. Same as the best selling Samsung S23. Pixel 8, same. There are phones with more but they are rare. This 8GB is enough to feed the OS, apps and whatever tensor shit Google is pushing.
 

redleader

Ars Legatus Legionis
35,019
Followup question: do they also make versions with and without the transistors for AVX512?

That video from Level1Techs made it sound like Intel really shook up their design tools, supporting different fabs and features.
This is customized for mobile-only and ported to TSMC, so likely they removed as much of the avx512 hardware as they could to reduce costs. Arrow Lake is the version that will be shared with Xeon servers and so that would be the one (if any) with AVX512 present but disabled in the consumer hardware.
 

redleader

Ars Legatus Legionis
35,019
For laptops that need "a lot" of RAM, could they use the on package RAM and a CAMM2 module. OK, the CAMM2 will be slower, but numerous cache levels have been a thing for a long time. It's slower than the package stuff, faster than NVME storage.
I feel like I say this at least once a month, but putting RAM on package does not make it faster; the speed of DDR5 and LPDDR5X are not determined by the package they are put in, but by the internal timings and rise times of the DRAM module. You might save a bit of power by keeping traces short, but the memory doesn't actually get faster just because it is closer. There is a maximum distance you can route DDR memories due to signal integrity, but it is pretty long and not a real limitation. That is why we are going to DDR6 at 12800 MT/s on DIMMs with JEDEC timings, and overclocked (socketed!) modules closer to 16000-20000 MT/s. Physically you can send data at those frequencies over PCB traces.

With that in mind, the reason you put memory on package is to save space and simplify layout. Maybe also to save a tiny bit of power. But if you have a big enough device to fit a memory socket, you don't care about saving a few square millimeters of board and so would not bother with on package memory in the first place. Or at least not for a CPU, a GPU or AI accelerator with HBM combined with external DRAM has been done before.
 
Intel made similar promises with meteor lake, so grain of salt. It’s possible Intel pulls victory from the jaws of defeat but I find it far less interesting than Qualcomm’s product.

Hopefully arrow lake won’t have SMT either. Heck, same for Xeons. We can pack tons of real cores in there now, it isn’t 2002, and SMT is a hack that presents a continuously exploited security risk.
 
  • Like
Reactions: charliebird

redleader

Ars Legatus Legionis
35,019
Hopefully arrow lake won’t have SMT either. Heck, same for Xeons. We can pack tons of real cores in there now, it isn’t 2002, and SMT is a hack that presents a continuously exploited security risk.
Xeons will almost certainly still have it due to the perf/watt gains for multithreaded code. Removing it only makes sense if you want to optimize the P cores for single thread performance exclusively, which is not what 60 core Xeon servers are about.
 
Xeons will almost certainly still have it due to the perf/watt gains for multithreaded code. Removing it only makes sense if you want to optimize the P cores for single thread performance exclusively, which is not what 60 core Xeon servers are about.
Xeons will definitely have it, don't get me wrong. My proposal is this-- drop SMT and use all that saved space on e-cores, see how that measures up.
 

fitten

Ars Legatus Legionis
52,250
Subscriptor++
Xeons will definitely have it, don't get me wrong. My proposal is this-- drop SMT and use all that saved space on e-cores, see how that measures up.

Back in the day, I think Intel said it was like 5% of the Pentium 4 die to add HT. I don't know what it is now but I'm not sure if removing HT from four P-cores would be enough for one additional E-core. For the P-core in Lunar, they removed HT but they also made the core wider with the space they reclaimed. I'm not sure what the net core shrinkage was, though. From the slides, it seems like the core without HT and widened is still smaller than the core with HT, though. I'm not sure if that was intentional. I didn't see where the Intel speaker specifically called that out, but I did see some YouTubers who made that inference.
 

hobold

Ars Tribunus Militum
2,657
drop SMT and use all that saved space on e-cores, see how that measures up.
Judging by the evidence that we already have, adding SMT2 to a wide & fast P-type core yields >20% more peak throughput for <10% larger silicon area. In more concrete numbers: if 5 P cores (without SMT2) are our reference point, then adding SMT2 to them requires the die space of 5.5 original P cores, but yields peak throughput of 6 P cores. Downside is that the software in question must spawn 10 rather than 6 threads to reap that peak throughput.

An E core gets us ~60% of a P core's throughput, and requires only ~30% of the die space. So 10 E cores are as large as 3 P cores, but have the throughput of 6 P cores. In other words, E cores deliver roughly twice as much peak throughput per silicon area, but require at least 50% more threads (or much more if the P cores didn't have SMT to begin with). Seems like a no-brainer, until you remember Amdahl's law, which states that single thread performance is a very real bottleneck even for heavily multithreaded workloads.

So we know how it measures up. There are a few workloads where the sea of E cores wins by a factor of ~two. And then there are workloads where the P cores win by a factor of ~two. Which processor is better for you is determined solely by the software you run, and there is no clear cut winner that could be declared once and for all.

Which is why both types of CPUs exist for datacenter use nowadays, and neither type has caused the other to go extinct.
 

w00key

Ars Praefectus
5,907
Subscriptor
There are a few workloads where the sea of E cores wins by a factor of ~two. And then there are workloads where the P cores win by a factor of ~two. Which processor is better for you is determined solely by the software you run, and there is no clear cut winner that could be declared once and for all.
My own worst workload, clicking build to running in ~4 minutes on the M3 pro, double that on an 5800X, is both together. Some parallel parts using all the cores, some low threaded parts running on just one or two threads.

Big.little has easily won this race. Apple/Arm does this well, and I think Alder Lake and later or Zen 5 with 5C would do okay too, the only way to find out is another desktop build but I'm not sure there will be one. A Caldigit dock for the MBP is a better use of money, or a big 200 ppi+ monitor with kvm built in.