Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
Hey, if Apple can continue to compare Apple Silicon Macs to random, ill-defined "PCs," then turnabout is fair play. ;)
Apple and Qualcomm can (and will, and do) play whatever games they choose. This is to be expected from first-party performance claims.

What we all should expect from external media and reviewers is to take a skeptical eye toward those claims, offering their readers/viewer a deeper, more considered, and independent look. Ideally one that goes beyond the first-party briefing materials and talking points.
 

byrningman

Ars Tribunus Militum
2,023
Subscriptor
Hmmm...

Re: "This thing toasts the M3 MacBook Air" ... this seems a conveniently selective comparison. How does this compare to chips with a comparable core count, like the M3 Pro? Who do these benchmarks include CPU comparisons to Apple Silicon, yet omit a similar compare re: graphics?

Hats off to Qualcomm Marketing.
Yeah it’s also a clickbait framing by the reviewer and headline writer. Just going by his own testing results that he provides, the M3 is significantly faster (25%) in single thread, a bit slower in multi thread, about 16%, with a lower performance core count). “Smokes” the M3… that’s just dishonest.
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
Hmmm...

Re: "This thing toasts the M3 MacBook Air" ... this seems a conveniently selective comparison. How does this compare to chips with a comparable core count, like the M3 Pro? Who do these benchmarks include CPU comparisons to Apple Silicon, yet omit a similar compare re: graphics?

Hats off to Qualcomm Marketing.
I dunno. It’s a 15” laptop that costs less than a 15” MBA, and has comparable battery life and weight. That seems like the correct compare.

That said, it only wins in multicore while the MbA is well ahead on single core performance so ‘toasts’ is… what’s the word I’m looking for???… oh… bullshit.
 

cateye

Ars Legatus Legionis
11,760
Moderator
What we all should expect from external media and reviewers is to take a skeptical eye toward those claims, offering their readers/viewer a deeper, more considered, and independent look. Ideally one that goes beyond the first-party briefing materials and talking points.

Agree 100 percent. But again: Who does Apple seed review hardware to, and how long do we generally have to wait in order for that hardware to instead land in the hands of someone who will objectively and quantitatively review it, versus squealing about the packaging in their "exclusive, first look"?

But yes, I expect better of Tom's Guide, much better. They're not iJustine. But the media landscape is what it is, I guess. Anandtech is barely surviving running mostly reviews of power supplies at this point.
 
  • Love
Reactions: Bonusround

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
Yeah it’s also a clickbait framing by the reviewer and headline writer. Just going by his own testing results that he provides, the M3 is significantly faster (25%) in single thread, a bit slower in multi thread, about 16%, with a lower performance core count). “Smokes” the M3… that’s just dishonest.
I dunno. It’s a 15” laptop that costs less than a 15” MBA, and has comparable battery life and weight. That seems like the correct compare.

That said, it only wins in multicore while the MbA is well ahead on single core performance so ‘toasts’ is… what’s the word I’m looking for???… oh… bullshit.

Bullshit indeed. The most egregious is the omission of a GPU compare – switching from M3 to an i7 mid-stream... how about testing all three machines on each benchmark and sharing the complete results with your readers? I know, unheard of.

I hope the duo at Hardware Unboxed get their hands on one of these Snapdragon laptops – those two know how to benchmark and are beholden to no one. They're also probably at the very bottom of Qualcomm's list of early reviewers, heheh.
 
  • Like
Reactions: Sharps97
Hmmm...

Re: "This thing toasts the M3 MacBook Air" ... this seems a conveniently selective comparison. How does this compare to chips with a comparable core count, like the M3 Pro? Who do these benchmarks include CPU comparisons to Apple Silicon, yet omit a similar compare re: graphics?

Hats off to Qualcomm Marketing.
The hiker doesn’t have to outrun the bear, just the other hikers.

The competition is really other Windows OEMs using x86 parts. Apple may be the benchmark (aka bear), but so long as Qualcomm can beat AMD and Intel in this category then this counts as a win.

So far it seems like better temperature, battery life, and performance altogether; Intel or AMD can eke out more performance but temperature and battery life suffers. Models that match battery life seem to have lower performance:
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
So far it seems like better temperature, battery life, and performance altogether; Intel or AMD can eke out more performance but temperature and battery life suffers. Models that match battery life seem to have lower performance:

I think it'll probably hinge to quite a significant degree on ARM software availability. Electron apps that still require x86 sounds painful.
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
I've read that AMD are considering ARM-based laptop offerings as well.

They've been threatening to do that on and off for years, hasn't ever really gone anywhere. Might be the case though that between Qualcomm, and probably Nvidia if they ever need revenue streams other than the gravy train du jour they take the hits to improve compatibility, allowing AMD to slide in relatively easily with the hard work already done. And if AMD wants to get the ball rolling I think they can probably do it quickly simply by licensing the ARM cores for the first generation.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
They've been threatening to do that on and off for years, hasn't ever really gone anywhere. Might be the case though that between Qualcomm, and probably Nvidia if they ever need revenue streams other than the gravy train du jour they take the hits to improve compatibility, allowing AMD to slide in relatively easily with the hard work already done. And if AMD wants to get the ball rolling I think they can probably do it quickly simply by licensing the ARM cores for the first generation.
The majority of AMD's laptop SoCs, certainly the low-power variants, are monolithic dies – so this likely would not be a chiplet play. I'd be curious over the extent to which AMD could simply 'drop in' ARM cores and keep the bulk of the remaining blocks (GPU, NPU, decode, etc.) relatively untouched.
 
The majority of AMD's laptop SoCs, certainly the low-power variants, are monolithic dies – so this likely would not be a chiplet play. I'd be curious over the extent to which AMD could simply 'drop in' ARM cores and keep the bulk of the remaining blocks (GPU, NPU, decode, etc.) relatively untouched.
If they were to use chiplets they could replace an existing CCD with an ARM CCD.

In other words I’m not sure why you think this wouldn’t be a chiplet. The issue really is if they have a superior in house core, or if they will use stock ARM (assuming this happens)
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
If they were to use chiplets they could replace an existing CCD with an ARM CCD.

In other words I’m not sure why you think this wouldn’t be a chiplet. The issue really is if they have a superior in house core, or if they will use stock ARM (assuming this happens)
This thought exercise is about an ARM-based laptop chip, and AMD has traditionally chosen monolithic dies for its low-power laptop offerings.
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
The majority of AMD's laptop SoCs, certainly the low-power variants, are monolithic dies – so this likely would not be a chiplet play. I'd be curious over the extent to which AMD could simply 'drop in' ARM cores and keep the bulk of the remaining blocks (GPU, NPU, decode, etc.) relatively untouched.

I mean it would take a new SoC but most of those things should be reusable across architectures.

In other words I’m not sure why you think this wouldn’t be a chiplet.

Because as Bonusround correctly notes AMD's generally gone with monolithic SoCs for laptops, which would be the main driver of a move to ARM. The reasoning is sound.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
I mean it would take a new SoC but those other things would be reusable.
Totally. Just wondering what the effort might be accounting for endianess, changes to the fabric between blocks, etc. Maybe it's minimal. How much would the memory controllers need to change? Stuff like that.

Wouldn't it be especially fun to see benchmarks between two otherwise-(nearly)-identical SoCs, save for the architecture of the CPU cores?
 
This thought exercise is about an ARM-based laptop chip, and AMD has traditionally chosen monolithic dies for its low-power laptop offerings.
Ah, I had assumed they were going to target the 10W-60W range, not only the low power designs.

That said, sure, a ARM Strix Point variant on a monolithic die absolutely makes sense.
 
If they were to use chiplets they could replace an existing CCD with an ARM CCD.

In other words I’m not sure why you think this wouldn’t be a chiplet. The issue really is if they have a superior in house core, or if they will use stock ARM (assuming this happens)

I'm 99% sure it will be Cortex-X925, that will be competitive with anything else around.

The Nvidia/MediaTek chip is meant to be Cortex-X925 with the Nvidia Blackwell GPU on a TSMC 3nm process. MediaTek and Nvidia also have some automotive SoCs with supposedly the same CPU and GPU, but presumably a lot more I/O for displays and cameras, plus lock-step realtime CPU subsystems for safety.

MediaTek's own Windows SoC is probably like their Kompanio SoCs that are used in Chromebooks but for Windows. I would guess it will be a low-end to mid-range part with a Mali GPU.

AMD's SoC might be something they are co-developing with Microsoft, but Reuters didn't mention that only naming AMD itself.
 
I thought 10 wide was crazy a decade ago. Man how technology progresses.

It's not just the width though, it's a big CPU just like the M-series and Oryon. What's really striking is that you have to look at the very latest x86 CPUs to find something "bigger". These latest Arm CPUs are a clear step above something like AMD's Zen 4.

Obviously everyone here gets it, but there is still a big chunk of the computing world who think Arm are small toy CPUs and real work needs big old x86. Too many people are not paying enough attention, as Apple, Qualcomm, and Arm itself are all designing big real work CPUs now.
 
  • Like
Reactions: Scotttheking

byrningman

Ars Tribunus Militum
2,023
Subscriptor
Do we know how the Qualcomm chips compare on price to current AMD and Intel offerings? I would expect ARM to start replacing x86 in laptops, but obviously pricing would be a big factor in whether that happens, and how quickly. Would AMD be able to make ARM chips more cheaply than they make x86 CPUs? I assume just licensing ARM designs would be cheaper than rolling your own, but if you’re selling standard designs, if you’re AMD, then you’re also sliding further down the value chain, at risk from any outfit that manages to bang out the same designs a bit cheaper.
 

Megalodon

Ars Legatus Legionis
34,201
Subscriptor++
Do we know how the Qualcomm chips compare on price to current AMD and Intel offerings? I would expect ARM to start replacing x86 in laptops, but obviously pricing would be a big factor in whether that happens, and how quickly. Would AMD be able to make ARM chips more cheaply than they make x86 CPUs? I assume just licensing ARM designs would be cheaper than rolling your own, but if you’re selling standard designs, if you’re AMD, then you’re also sliding further down the value chain, at risk from any outfit that manages to bang out the same designs a bit cheaper.

Most likely how AMD would approach that would be to use ARMH designs for the first generation to prove out the market, then roll their own in subsequent generations.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
Well now there’s this…

A few highlights:
In the paper, the researchers mention BitNet (the so-called "1-bit" transformer technique that made the rounds as a preprint in October) as an important precursor to their work. According to the authors, BitNet demonstrated the viability of using binary and ternary weights in language models, successfully scaling up to 3 billion parameters while maintaining competitive performance.
The researchers' approach involves two main innovations: first, they created a custom LLM and constrained it to use only ternary values (-1, 0, 1) instead of traditional floating-point numbers, which allows for simpler computations. Second, the researchers redesigned the computationally expensive self-attention mechanism in traditional language models with a simpler, more efficient unit (that they called a MatMul-free Linear Gated Recurrent Unit—or MLGRU) that processes words sequentially using basic arithmetic operations instead of matrix multiplications.
These changes, combined with a custom hardware implementation to accelerate ternary operations through the aforementioned FPGA chip, allowed the researchers to achieve what they claim is performance comparable to state-of-the-art models while reducing energy use.
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
As I’ve said before, the ability to do on device AI is being incrementally improved along two axis. Firstly the hardware is getting better as node shrinks allow and second the design of the models themselves are getting ever more efficient. And these are cross-reinforcing properties - new hardware allows for new model designs and new theoretical model designs influence new hardware designs.

At the end of the day it’s looking increasingly as though total memory and memory bandwidth to chip will be the durable bottlenecks to on device AI. Apple started with CPU design, then when that team was rock solid they branched out into GPU design. I fully expect Apple to veer into custom memory as a differentiator sooner rather than later.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
Hear, hear. Bandwidth and execution space have posed constraints for all processors, whether prefixed C or G or N. A bold move by Apple here will be welcome.

I find this model research especially fascinating now that the PC industry is on the verge or rolling out NPUs to the Windows space… just as strong indicators emerge that current NPU architectures may soon be rendered obsolete.

This feels to me like the growth spurt in PC GPUs where, in the span of just a few years, we jumped from fixed-function texture & lighting only to a fully-programmable pipeline including geometry. But in that case the evolution was merely reimplementing what workstations had long done, not responding to new fundamental research.
 
Last edited:

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
I find the model research especially fascinating now that the PC industry is on the verge or rolling out NPUs to the Windows space… just as strong indicators emerge that cureent NPU architectures could soon be rendered obsolete.
I don’t think obsolete is the right way to think about it. Over time new computing modalities tend to be additive. If trinary LLMs take off, those matmul units aren’t going to be obsolete - they’re just not going to be used for that type of LLM. Someone will find something else they’re optimal for (I mean matmul is fundamentally pretty useful thing to be able to do.) And if I were a betting man I’d bet that someone will come up with a mixed model that uses both matmul and trinary to advantage.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
I don’t think obsolete is the right way to think about it. Over time new computing modalities tend to be additive. If trinary LLMs take off, those matmul units aren’t going to be obsolete - they’re just not going to be used for that type of LLM. Someone will find something else they’re optimal for (I mean matmul is fundamentally pretty useful thing to be able to do.) And if I were a betting man I’d bet that someone will come up with a mixed model that uses both matmul and trinary to advantage.
Additive, sure. Obsolete in the fact that they’ll be missing optimizations to run these newer, wildly more efficient models well – e.g. what the Santa Cruz team currently has implemented in FPGA. Hybrid approaches may exist, but if the all-ternary approach proves applicable to many types of models (that’s a giant IF), I could easily imagine an Apple jumping ship completely. The power and memory savings would be too dramatic for them to ignore.
 
Additive, sure. Obsolete in the fact that they’ll be missing optimizations to run these newer, wildly more efficient models well – e.g. what the Santa Cruz team currently has implemented in FPGA. Hybrid approaches may exist, but if the all-ternary approach proves applicable to many types of models (that’s a giant IF), I could easily imagine an Apple jumping ship completely. The power and memory savings would be too dramatic for them to ignore.
That’s like saying a 20xx GPU is obsolete because of a 30xx GPU.

Ternary logic doesn’t mean abandoning old HW. Notice in the linked research paper how they were able to run their ternary neural net on an NVIDIA GPU?

You can perform ternary logic using a binary system, the same way you can do decimal notation or floating point. Rather than -1, 0, and 1 you use two bits and map the values to 00, 01, and 11. In other words INT2 is a valid substitute for true ternary HW.

And you f there is a gap until INT2 is HW accelerated, Apple and Nvidia both already have INT4 in their latest designs.
 
That’s like saying a 20xx GPU is obsolete because of a 30xx GPU.

Ternary logic doesn’t mean abandoning old HW. Notice in the linked research paper how they were able to run their ternary neural net on an NVIDIA GPU?

You can perform ternary logic using a binary system, the same way you can do decimal notation or floating point. Rather than -1, 0, and 1 you use two bits and map the values to 00, 01, and 11. In other words INT2 is a valid substitute for true ternary HW.

And you f there is a gap until INT2 is HW accelerated, Apple and Nvidia both already have INT4 in their latest designs.

Nvidia's next big thing will probably be a logarithmic number system for their GPUs training LLMs, logarithmic mulitplies become adds which is good, logarithmic addition is less good, but with some fancy shifts and accumulators they can be quite efficient as fewer bits switch*; and the precision and accuracy are good, perhaps the next best to a symbolic (coded) representation.

Vector-scaled quantisation will also be a big deal for inference, as once you know where you want your precision you can pick the scale, and doing it per vector means you don't have to be conservative as it can change frequently.

Floating point and integer will become legacy types for machine learning accelerators, left there for HPC, more general compute, and older code.

Bill Dally, Nvidia chief scientist, has talked about this enough — and has his name on the patents and papers — that it looks like Nvidia have made up the minds about how to build a more efficient next-generation Nvidia GPU (an obsolete name when graphics barely gets a second thought now).

* About 5:1 more efficient than FP of the same type size IIRC.
 
… it looks like Nvidia have made up the minds about how to build a more efficient next-generation Nvidia GPU (an obsolete name when graphics barely gets a second thought now).

Imagine if we could find a way to transform Game Rage into sustainable energy to power some of these new server farms …
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
That’s like saying a 20xx GPU is obsolete because of a 30xx GPU.
No. It’s like saying a 10xx GPU is obsolete because you want to use raytracing.

Ternary logic doesn’t mean abandoning old HW. Notice in the linked research paper how they were able to run their ternary neural net on an NVIDIA GPU?
Of course. That’s a backwards compatibility observation/argument. Apple ’abandons’ old HW all the time, they just do so on a 6-7 year timetable.

If Apple found they needed to migrate their NPU architecture, they would just do so. The Neural Engine is a black box, and Apple controls every mechanism that interfaces with it. Their own system models would be migrate immediately, and the only compatibility ‘required’ would be third-party apps with their own CoreML models. With deprecation and developer outreach these could be moved in as soon as two OS revisions. IMO
 

wrylachlan

Ars Legatus Legionis
12,769
Subscriptor
No. It’s like saying a 10xx GPU is obsolete because you want to use raytracing.


Of course. That’s a backwards compatibility observation/argument. Apple ’abandons’ old HW all the time, they just do so on a 6-7 year timetable.

If Apple found they needed to migrate their NPU architecture, they would just do so. The Neural Engine is a black box, and Apple controls every mechanism that interfaces with it. Their own system models would be migrate immediately, and the only compatibility ‘required’ would be third-party apps with their own CoreML models. With deprecation and developer outreach these could be moved in as soon as two OS revisions. IMO
The NPU accelerates all kinds of models for all kids of uses - of which LLMs are just one. We’re deep into diminishing returns of throwing transistors at single threaded performance. There’s no pressing need to radically increase the transistor count in the GPU. And process node improvements are giving Apple (and everyone else) more and more transistors to play with.

All of this adds up to replacing matmul with ternary Instead of just adding it on being extremely unlikely.

Could they? Absolutely. Apple is nothing if not the master of the pivot when it comes to chip technology. But I wouldn’t bet on it at all.
 
No. It’s like saying a 10xx GPU is obsolete because you want to use raytracing.
Why do you compare BitLinear to raytracing?

With raytracing the 10xx GPU isn’t powerful enough to even attempt it.

With BitLinear, however, an A100 (with an accompanying FPGA for ternary logic) was able to execute the BitLinear network. It’s not even obvious that the FPGA is necessary if they could rewrite the algorithm to utilize binary logic (they claim three bits for -1, 0, and 1, when a binary encoding means that -1, 0, and 1 can be implemented with two bits)
Of course. That’s a backwards compatibility observation/argument. Apple ’abandons’ old HW all the time, they just do so on a 6-7 year timetable.
That seems a non sequitur here. Any implementation of BitLinear is likely to be on vanilla NVIDIA HW because that’s the HW that exists today, and leaves a window open for future HW to improve its performance and energy efficiency.

Meaning nothing is abandoned nor obsoleted.

Just to be super clear, it’s like saying the release of Windows 11 obsoleted all existing x86 computers, which is patently false. Even if newer computers would be necessary for the newest features, capabilities, and performance, a large portion of the existing install base could still run Windows 11
If Apple found they needed to migrate their NPU architecture, they would just do so. The Neural Engine is a black box, and Apple controls every mechanism that interfaces with it. Their own system models would be migrate immediately, and the only compatibility ‘required’ would be third-party apps with their own CoreML models. With deprecation and developer outreach these could be moved in as soon as two OS revisions. IMO
I’m saying that the whole point of NVIDIA’s GPGPU/CUDA focus for the past 17 years is that they would create a new CUDA API and library, first, to support it and then release new HW in two years to accelerate it even more. Nothing is obsoleted, because the existing HW is in fact capable of running a BitLiner network.
 
All of this adds up to replacing matmul with ternary Instead of just adding it on being extremely unlikely.

Exactly. People don't get that GPUs, TPUs and other accelerators aimed at machine learning have already been through multiple generations of specialisation. They are not simply doing the same thing every generation and making it faster.

Nvidia say that the biggest contribution to the speed up of inference performance over generations of their GPUs came from number representation ~16x, then complex instructions ~12.5x (to save fetch and decode costs), and process is only ~2.5x. So even if you took old designs to the latest process node they would be much slower because they lack the architectural improvements.

As Nvidia's GPUs have evolved they have gained new capabilities. The Tensor Cores* (MMA) are much faster than the Streaming Multiprocessors (SIMT), but they didn't replace them, they are additions. As are the Transformer Engines, and the Tensor Memory Accelerator. So new GPUs aren't just faster than the old, the do things the older parts can't.

* These "co-processors" aren't fixed either, and evolve as well.
 
  • Like
Reactions: Bonusround
To reinforce what ev9_tarantula just said, when announcing Blackwell, NVIDIA used INT4 as its performance metric to show how much faster than Hopper it was. This was because Hopper only goes down to INT8. This ternary approach is basically INT2 (yeah, yeah, oversimplification I'm sure), so that could be NVIDIA's next evolution.
 
To reinforce what ev9_tarantula just said, when announcing Blackwell, NVIDIA used INT4 as its performance metric to show how much faster than Hopper it was. This was because Hopper only goes down to INT8. This ternary approach is basically INT2 (yeah, yeah, oversimplification I'm sure), so that could be NVIDIA's next evolution.
It's kind of wild that the researchers allocated three bits here to represent -1, 0, and 1.

In other words they mapped three values into a space that could represent 8. NVIDIA already has sub-byte precision baked into Tensor Cores:
For 4 bit precision, the APIs available remain the same, but you must specify experimental::precision::u4 or experimental::precision::s4 as the fragment data type.

NOTE: Today I learned INT4 means 4 bytes and INT1 means 1 byte, so really we need a different term than INT2 to specify 2 bits.

In fact they already have a 1 bit type, which can't be a coincidence.
the major difference of BNN is that it uses a single bit to represent each entry of the input and weight matrices. BNN evolved from DNN through binarized-weight-network (BWN) [4]. It was firstly observed that if the weight matrix can be binarized to +1 and −1, the floating-point (FP) multiplications can be degraded to addition (i.e., mul +1) and subtraction (i.e., mul −1). Later, it was further observed that if the input matrix can be binarized as well, then even the floating-point additions and subtractions in BWN can be degraded to logical operations (i.e., xnor for bit dot-product and popc for bit accumulation)

So if this 1 bit transformer/bitlinear/bitnet becomes the norm, I am sure NVIDIA is well positioned to take advantage of that future.
 

Bonusround

Ars Scholae Palatinae
1,060
Subscriptor
NOTE: Today I learned INT4 means 4 bytes and INT1 means 1 byte, so really we need a different term than INT2 to specify 2 bits.

Don’t confuse CUDA’s INT4 and INT8 with the int4 and int8 used in other contexts.

The numerals in INT4 and INT8 indicate the number of bits that represent an integer.

By contrast in, say, a database like PostgreSQL the final numerals do refer to bytes, with int4 and int8 indicating 32-bit and 64-bit integers respectively.