I’m

If it’s for building models wouldn’t it be more likely to offer as cloud service?
Why?

That’s like asking if NVIDIA would more likely offer GPUs as a cloud service, or sell them to users to create their own infrastructure? Obviously they do both.

So why wouldn’t Apple sell HW if they can do so profitably?
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
I’m

If it’s for building models wouldn’t it be more likely to offer as cloud service?
It really depends on what they look like. The problem to be solved by Private Cloud is inference. My assumption is that what you would really want for inference is a chip that’s a tiny bit CPU and mostly NPU. That, by itself - while it’s incredibly efficient and fast for inference - isn’t a great chip for training a model. So they’re not going to be monetizing Private Cloud for model training.

But couple one of these chips with an MX Ultra and you might have a nice training machine. So either Apple could make a second cloud system just for training or they could sell them in the Mac Pro. My assumption would be the latter.

All of this is wildly speculative. It may be that the SoC is just a bog standard MX Max chip and they’re already selling them in desktops and laptops. Or it’s a slight variant on an MX Max chip to allow for off-package DIMMs for shitloads of relatively less expensive memory. Who knows.
 

iPilot05

Ars Tribunus Militum
2,780
Subscriptor
What I’m driving at here is that to the extent that they are making custom hardware for private cloud I think there’s a decent chance that that hardware eventually sees the light of day on the retail market.
Perhaps the weak sauce MacPro update last year was just a prelude to whatever Apple has up its sleeve for the next one. Perhaps Afterburner'esq daughter cards with NPUs for AI and or VR development.

They probably figured while the "real" MacPro is still a few years off they needed to get off Intel and so we got the MP of today. It'll keep pro users with PCI needs happy and keep the form factor alive until whatever Apple Silicon insanity they have being put into the datacenter is ready for prime time.
 
It really depends on what they look like. The problem to be solved by Private Cloud is inference. My assumption is that what you would really want for inference is a chip that’s a tiny bit CPU and mostly NPU. That, by itself - while it’s incredibly efficient and fast for inference - isn’t a great chip for training a model. So they’re not going to be monetizing Private Cloud for model training.

But couple one of these chips with an MX Ultra and you might have a nice training machine. So either Apple could make a second cloud system just for training or they could sell them in the Mac Pro. My assumption would be the latter.

All of this is wildly speculative. It may be that the SoC is just a bog standard MX Max chip and they’re already selling them in desktops and laptops. Or it’s a slight variant on an MX Max chip to allow for off-package DIMMs for shitloads of relatively less expensive memory. Who knows.

No existing M series chip dedicates much die space to ANE so I don't think any known chip is attractive for running inference, compared to dedicated accelerators the performance is low. So unless Apple intends to use the GPU, which would be faster but also probably not very efficient, I think Apple must have a custom ANE filled chip to make this work at all. But then how do you attach it? M series chips have relatively weak PCIe I/O? So do you go D2D and directly connect to an M2 Max? And this giant Private Cloud, where are the RAS features? Do we even have ECC on the memory?

The simplest approach would be to design a ANE chip from scratch with PCIe and shove it in some AMD EPYC servers for a host. Or maybe Apple have built a AWS Graviton like chip as a host to connect a bunch of ANE accelerators. Just filling racks with some existing M series silicon doesn't make a lot of sense to me.
 
  • Like
Reactions: Bonusround

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
The simplest approach would be to design a ANE chip from scratch with PCIe and shove it in some AMD EPYC servers for a host. Or maybe Apple have built a AWS Graviton like chip as a host to connect a bunch of ANE accelerators. Just filling racks with some existing M series silicon doesn't make a lot of sense to me.
Their on device models are going to be growing over time that means that scenarios that currently need the cloud will migrate to on device over time. That gets a lot easier when you’re using essentially the same code both locally and on the server. That means Metal as the backbone that orchestrates the neural engines and likely means that anything other than Apple Silicon was never in serious consideration.

Having followed Apple’s push on Swift for Servers and distributed actors for the last couple years I wouldn’t be surprised if for some layers of the Apple Intelligence stack the difference between on device and in the cloud were totally abstracted away.
 
It really depends on what they look like. The problem to be solved by Private Cloud is inference. My assumption is that what you would really want for inference is a chip that’s a tiny bit CPU and mostly NPU. That, by itself - while it’s incredibly efficient and fast for inference - isn’t a great chip for training a model. So they’re not going to be monetizing Private Cloud for model training.
I'm not sure I understand why? Isn't the model fundamentally the same when being trained vs when used for inference? And isn't the necessary tensor matrix HW units necessary for both training or inference, with the difference being you need to run it twice or more per pass during training as opposed to only needing to run through the model in the forward direction during inference? Obviously the training data is also going to need multiple passes itself during training, while using the model is a single pass per use. In other words you need more tensor HW to perform more calculations per pass, and per multiple passes, during inference compared to training.
But couple one of these chips with an MX Ultra and you might have a nice training machine. So either Apple could make a second cloud system just for training or they could sell them in the Mac Pro. My assumption would be the latter.
Looking at a die shot for an M3 Max and you see that the chip is like 50% GPU, with as much cache as CPU, and I/O taking up as much space as the CPU + cache, with the NPU being barely larger than the efficiency cores + cache (this assumes the annotation is correct, obviously):
A3QSjPxt4oYcYuy8i6kRtX-970-80.jpg.webp

All of this is wildly speculative. It may be that the SoC is just a bog standard MX Max chip and they’re already selling them in desktops and laptops. Or it’s a slight variant on an MX Max chip to allow for off-package DIMMs for shitloads of relatively less expensive memory. Who knows.
Given how tiny the NPU is (and how relatively inefficient it is to use the GPU), it would make sense to have a larger NPU for training. Take a look at an NVIDIA TU102 chip (similar to the TU104 used in the 2080 RTX, where the TU102 has 12 SMs and the TU104 only has 8):
image7.jpg

The above chip, abstracted away (and rotated):
image2.jpg

And an abstraction of the SM:
image11.jpg


Less than 1/4th of each SM is Tensor, and about 1/4th is RT. So an NPU that would match TU102 would be far, far, smaller and more energy efficient than relying on the GPU.
 
  • Like
Reactions: gabemaroz

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
Isn't the model fundamentally the same when being trained vs when used for inference?
No. My understanding is that the math for a forward inference pass is different from the matrix math needed for back propagation (training). The Neural Engine isn’t really suited for training whereas the AMX unit (SME now apparently with the M4) is great for training. Would love others to weigh in here.
 
  • Like
Reactions: The Limey
No. My understanding is that the math for a forward inference pass is different from the matrix math needed for back propagation (training). The Neural Engine isn’t really suited for training whereas the AMX unit (SME now apparently with the M4) is great for training. Would love others to weigh in here.
That’s interesting. Maybe that’s why NVIDIA GPUs work so well since they have so much parallel compute as well as matrix units.

So maybe it would be good if Apple added NE cores to its GPU.
 
That’s interesting. Maybe that’s why NVIDIA GPUs work so well since they have so much parallel compute as well as matrix units.

So maybe it would be good if Apple added NE cores to its GPU.

You have to be very wary of reading too much into diagrams of Nvidia GPUs, particularly assuming that eyeballing the area represents a measure of their performance. The Tensor cores do far more operations than the CUDA (SIMD) shader cores, it's about 8:1, and 16:1 with sparsity, for the same precision types when doing matrix operations*. So it's more complicated than it looks, and besides that the thing that makes them fly is the sheer amount of memory bandwidth throughout the chip when used optimally. You should never really be compute bound on a Nvidia GPU if you are using it well.

The reason Nvidia doesn't have even more area allocated to Tensor cores is that there's little point, they couldn't keep them fed, even 3 TB/s from HBM isn't enough.

Of course you might argue that Nvidia could make a smaller simpler Tensor focused chip, but Nvidia didn't get to be incredibly rich by giving their customers cheaper options.

* Yes like AMX/SME clobber the P-cores when it comes to matrix math.
 
  • Haha
Reactions: Bonusround

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
AnandTech’s first overview of Qualcomm’s new chips is out: https://www.anandtech.com/show/21445/qualcomm-snapdragon-x-architecture-deep-dive/4

It’s curious that they’ve totally eschewed E-cores, bucking the common design used by literally everyone else in the industry to target the TDPs they’re targeting with these chips. It certainly feels like - as nice a step up over x86 as these chips promise to be - they’re still leaving some efficiency on the table by not implementing E cores.
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
No. My understanding is that the math for a forward inference pass is different from the matrix math needed for back propagation (training). The Neural Engine isn’t really suited for training whereas the AMX unit (SME now apparently with the M4) is great for training. Would love others to weigh in here.
I believe training also wants higher precision than inference, and client inference accelerators are going extremely low precision (int4 etc).
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
It’s curious that they’ve totally eschewed E-cores, bucking the common design used by literally everyone else in the industry to target the TDPs they’re targeting with these chips. It certainly feels like - as nice a step up over x86 as these chips promise to be - they’re still leaving some efficiency on the table by not implementing E cores.
My guess would be there's significant compatibility challenges combining ARMH-licensed cores and custom cores in the same SoC, stuff that's not apparent at the ISA level like internal cache coherency protocols. Not that it's impossible (I think Samsung did it?) but likely a significant headache that they can get away without addressing in a PC chip. Will be interesting to see if they make the attempt for a mobile SoC (or try to market the Nuvia core in mobile at all).
 
AnandTech’s first overview of Qualcomm’s new chips is out: https://www.anandtech.com/show/21445/qualcomm-snapdragon-x-architecture-deep-dive/4

It’s curious that they’ve totally eschewed E-cores, bucking the common design used by literally everyone else in the industry to target the TDPs they’re targeting with these chips. It certainly feels like - as nice a step up over x86 as these chips promise to be - they’re still leaving some efficiency on the table by not implementing E cores.
I assume it’s lack of time and resources. I figure in a year or two they will have developed Oryon lite to pair with Oryon v2; a 2.5GHz part, with 4-6 width, smaller buffers and cache, and a quarter the size. It just makes no sense to skip it, as a smaller and lower performance core can take on many tasks and save a lot of power.
 
AnandTech’s first overview of Qualcomm’s new chips is out: https://www.anandtech.com/show/21445/qualcomm-snapdragon-x-architecture-deep-dive/4

It’s curious that they’ve totally eschewed E-cores, bucking the common design used by literally everyone else in the industry to target the TDPs they’re targeting with these chips. It certainly feels like - as nice a step up over x86 as these chips promise to be - they’re still leaving some efficiency on the table by not implementing E cores.

I think it's probably that this is Phoenix and there didn't need to be an E-core Phoenix for servers. If they'd shipped this on time it would have been extremely competitive with anything going. It's a bit of a monster, big fast caches, 8-wide, big ROB, large PRFs, decent int/vec/ LSU, and lots of bandwidth. Have no doubt, the Nuvia team know what they are doing.

My guess would be there's significant compatibility challenges combining ARMH-licensed cores and custom cores in the same SoC, stuff that's not apparent at the ISA level like internal cache coherency protocols. Not that it's impossible (I think Samsung did it?) but likely a significant headache that they can get away without addressing in a PC chip. Will be interesting to see if they make the attempt for a mobile SoC (or try to market the Nuvia core in mobile at all).

They are doing some sort of E-cores for the Snapdragon 8 Gen 4.

So far there are now five Oryon based chips known. 12 core X Elite, a smaller 8 core X Plus, 8 Gen 4, 8 Gen 5, and another X for even cheaper PCs. Oryon V2 and V3 are on the roadmap for 2025 and 2027.
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
I think it's probably that this is Phoenix and there didn't need to be an E-core Phoenix for servers. If they'd shipped this on time it would have been extremely competitive with anything going. It's a bit of a monster, big fast caches, 8-wide, big ROB, large PRFs, decent int/vec/ LSU, and lots of bandwidth. Have no doubt, the Nuvia team know what they are doing.
But what they’re releasing isn’t a server, it’s a laptop chip. And it’s not like they would need to design a compatible E-core from scratch - Qualcomm has been integrating ARM designs forever.
 

Chris FOM

Ars Legatus Legionis
10,001
Subscriptor
Right, but Nuvia started out designing server chips. They shifted to desktop/mobile after the Qualcomm acquihire, but Oryon v1 is still basically that original server core repurposed. Thats also the major component of Arm’s lawsuit with Qualcomm, Arm says custom chip designs using an Arm license aren’t transferable in a purchase so QC would need to throw out everything from before the acquisition and start from scratch.
 
Right, but Nuvia started out designing server chips. They shifted to desktop/mobile after the Qualcomm acquihire, but Oryon v1 is still basically that original server core repurposed. Thats also the major component of Arm’s lawsuit with Qualcomm, Arm says custom chip designs using an Arm license aren’t transferable in a purchase so QC would need to throw out everything from before the acquisition and start from scratch.

Apparently (I think I read this on the AnandTech forums) Qualcomm redesigned IP in the Phoenix core that was covered under the Nuvia licence from Arm, using essentially the same IP licenced by Qualcomm from Arm. i.e. They've designed some bits twice, using the same IP, but covered by two different licenses, for essentially the same results. So that Qualcomm now claim there is no Nuvia/Arm IP left, only original Nuvia IP and Qualcomm/Arm IP.
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
Right, but Nuvia started out designing server chips. They shifted to desktop/mobile after the Qualcomm acquihire, but Oryon v1 is still basically that original server core repurposed. Thats also the major component of Arm’s lawsuit with Qualcomm, Arm says custom chip designs using an Arm license aren’t transferable in a purchase so QC would need to throw out everything from before the acquisition and start from scratch.
I’m not talking about redesigning the Oryon core. I’m talking about pairing it with an E core. Or is your contention that the team spent so much time redesigning Oryon that they didn’t have time to integrate it with an E core?
 

Chris FOM

Ars Legatus Legionis
10,001
Subscriptor
Apparently (I think I read this on the AnandTech forums) Qualcomm redesigned IP in the Phoenix core that was covered under the Nuvia licence from Arm, using essentially the same IP licenced by Qualcomm from Arm. i.e. They've designed some bits twice, using the same IP, but covered by two different licenses, for essentially the same results. So that Qualcomm now claim there is no Nuvia/Arm IP left, only original Nuvia IP and Qualcomm/Arm IP.
Thanks for that. I haven’t followed this case closely and only know the major details so didn’t know that part.

I’m not talking about redesigning the Oryon core. I’m talking about pairing it with an E core. Or is your contention that the team spent so much time redesigning Oryon that they didn’t have time to integrate it with an E core?
I suspect this initial release was doing the bare minimum needed to get the work Nuvia had already done into a shipping product ASAP. Since the original design was for servers I’m guessing they didn’t have a system in place for managing heterogeneous cores.
 
I’m not talking about redesigning the Oryon core. I’m talking about pairing it with an E core. Or is your contention that the team spent so much time redesigning Oryon that they didn’t have time to integrate it with an E core?
What would suggest is a suitable eCore? An ARM A715?

The performance of Apple’s eCores approaches that of older pCores. I would imagine Windows would need something in that performance range to be useful as an eCore.

I still think they didn’t have the time or resources to design nor integrate a third party efficiency core, otherwise they would have. The architecture seems too good to pass up on.
 
What would suggest is a suitable eCore? An ARM A715?

The performance of Apple’s eCores approaches that of older pCores. I would imagine Windows would need something in that performance range to be useful as an eCore.

I still think they didn’t have the time or resources to design nor integrate a third party efficiency core, otherwise they would have. The architecture seems too good to pass up on.

Snapdragon 8 Gen 4 is meant to be two big Oryon cores and six smaller Oryon cores. What the differences are is not known.

Besides that you can't assume that Qualcomm can drop an Arm CPU into their design. There is whole bunch of I/O and protocols they'd need to licence and implement. You can't just mix and match AMD and Intel CPUs even if they are all x86, an instruction set is not all there is to a CPU, and a CPU is not all there is to a computer architecture.
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
Besides that you can't assume that Qualcomm can drop an Arm CPU into their design. There is whole bunch of I/O and protocols they'd need to licence and implement. You can't just mix and match AMD and Intel CPUs even if they are all x86, an instruction set is not all there is to a CPU, and a CPU is not all there is to a computer architecture.
Yes. The compatibility issues can be incredibly touchy. One bug I read about had to do with different cache line sizes between E-cores and P-cores, and in that case both the cores were compatible implementations of the same ISA. Very, very subtle differences become problems when you need to migrate running code seamlessly vs just having code that can run on different devices.

This is doable when it's known at design and validation time, which is manageable when it's all the same company, but mixing vendors is, imo, asking for trouble. High likelihood of crippling errata that can only be fixed by permanently disabling some of the cores.

Snapdragon 8 Gen 4 is meant to be two big Oryon cores and six smaller Oryon cores. What the differences are is not known.
Seems like they can probably do something like what AMD did and do a synthesizable RTL with different optimizations and libraries. There might also be reasonably modular stuff they can take out, like SIMD units and cache. Though I question the efficiency of E-cores made that way, it's been pointed out that ARMH's E-cores aren't actually all that efficient so potentially more of a die area cost than an efficiency cost.
 
Seems like they can probably do something like what AMD did and do a synthesizable RTL with different optimizations and libraries. There might also be reasonably modular stuff they can take out, like SIMD units and cache. Though I question the efficiency of E-cores made that way, it's been pointed out that ARMH's E-cores aren't actually all that efficient so potentially more of a die area cost than an efficiency cost.

That seems likely as the rumoured names were Phoenix L and Phoenix M.

It could well be a mix of not enough time, and they think their P core power efficiency is good enough or close enough to intel E-cores that it wasn't a priority.

Obiviously it doesn't matter as much for laptops, so they can afford to do without for now.

Overall Oryon looks like a really well balanced CPU, everything they have disclosed says they intend to do what I hoped they were doing that is release Apple-like CPUs for the rest of the computing market.

For anyone who wants M-series like performance and effiency for operating systems other than MacOS there are so many great chips coming out this year and next. Even if you don't care about that choice it should spur Apple on which is good as well.

One other thought. Looking at the Oryon specs with the big caches, any combination of 4 loads and stores, big ROB, 6 ALUs, and lots of prefetch tricks I hope someone slaps Linux on one of these systems and runs some server benchmarks. Just to see what it might have been capable of if Nuvia hadn't been bought by Qualcomm.
 
Last edited:
  • Like
Reactions: Bonusround

eas

Ars Scholae Palatinae
1,254
It really depends on what they look like. The problem to be solved by Private Cloud is inference. My assumption is that what you would really want for inference is a chip that’s a tiny bit CPU and mostly NPU. That, by itself - while it’s incredibly efficient and fast for inference - isn’t a great chip for training a model. So they’re not going to be monetizing Private Cloud for model training.

With LLMs, inference for a single request is bound by memory bandwidth, not compute. When generating text, the entire model needs to be read from memory for each token generated in a request. The memory accesses can be amortized by batching up multiple requests, loading part of the model into cache, processing all the requests, loading the next part of the model, etc. Batching shifts the bottleneck to compute. However, this batching may not be compatible with their Private Cloud Server approach.

The processing of the prompt (including any supplied context retrieved from an index) is compute limited though, so more compute reduces time-to-first-token latency. OTOH, the per-token processing times they've quoted for on-device inference already look pretty good.
 
  • Like
Reactions: Bonusround

belazeebub

Wise, Aged Ars Veteran
157
There are rumours of poor Oryon performance in the new Qualcomm-based laptops coming out this week. It seems as if this must be a bug or misunderstanding; I can't see the PC manufacturers releasing these after all the Qualcomm performance hype. Maybe the laptops will need a Windows update or manufacturer driver update to bring the heat?
 
Oryon is a bit of an odd processor. It's quite clearly the Phoenix design which was announced by Nuvia in 2020 ported to Qualcomm's design methodology. It's Armv8.7-A so missing things like SVE2 and MTE, but with enough useful extensions to make x86 emulation more effective. Gerard Williams has said in the past that there are a lot of low hanging fruit to work on that should proceed steadily now that they have silicon to test with.

In terms of where Oryon falls it's a bigger* CPU than the M3's Everest core. In fact in some areas it beats the M4 P-core (ITLB, load queues, 4 loads per cycle, L2 TLB). It really does look like the sort of CPU you would get if the guy who lead most of Apple's CPU development was developing an Arm server CPU for release around 2022/2023. I think it would likely clearly beat the Neoverse V2 as was probably the original intention. If they had shipped this last year as hoped I think most people would have been impressed. I really hope some Linux server benchmarks are run on it as those might be revealing.

A lot of people are going to look at the benchmarks and conclude that x86 has nothing to worry about. I think they are wrong. For the Nuvia teams first effort and given all the problems they have faced I'm impressed. If Qualcomm commits to further development and products (this is happening as best anyone can tell) I expect them to be very competitive. They now clearly have a team capable of designing leading-edge CPUs.

* As far as can be deduced from looking at architectural features. This is not a claim about performance.
 
  • Like
Reactions: Semi On

Demento

Ars Legatus Legionis
13,754
Subscriptor
Can any of the Qualcomm folk here advise as to what might explain these reports?
Not a Qualcomm folk, but everything I've read around those points out that there was no boosting enabled on the poor results. Which is probably just immature firmware, and if they can hit the speeds they're reputed to hit then the performance lines up and is quite pretty.
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
Not a Qualcomm folk, but everything I've read around those points out that there was no boosting enabled on the poor results. Which is probably just immature firmware, and if they can hit the speeds they're reputed to hit then the performance lines up and is quite pretty.
Slower also tends to be more efficient, so if they’re running artificially slow because of a bug their performance tests look poorer than they should but their battery tests may look better than they should. If/when they update firmware I’ll be interested to see what happens to the battery life tests.
 
  • Like
Reactions: Demento
  • Like
Reactions: Bonusround
Tom’s makes it sound like a solid first release:
Of course, we’ll wait for the official benchmark results from our lab and the full suite of Copilot+ PC features to give it a score. But all signs point to this being a milestone. This is the start of something big.

Now we wait for Lunar Lake. Exciting times.

Yes. Some of the criticism I've seen has been along the lines of "Zen 5 or Lunar Lake will be better". The mere fact that there is an Arm SoC for Windows where unreleased x86 processors have to be invoked as superior tells us a lot.

Even if Qualcomm screw this up, and I don't think they will, we have MediaTek/Nvidia, MediaTek alone, and AMD SoCs all in development that should be in the same ballpark as Oryon, and quite possibly better. Those are just the ones Reuters has reported, there are rumours as well of Microsoft and Samsung developing SoCs.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
Hmmm...

Re: "This thing toasts the M3 MacBook Air" ... this seems a conveniently selective comparison. How does this compare to chips with a comparable core count, like the M3 Pro? Who do these benchmarks include CPU comparisons to Apple Silicon, yet omit a similar compare re: graphics?

Hats off to Qualcomm Marketing.