It really depends on what they look like. The problem to be solved by Private Cloud is inference. My assumption is that what you would really want for inference is a chip that’s a tiny bit CPU and mostly NPU. That, by itself - while it’s incredibly efficient and fast for inference - isn’t a great chip for training a model. So they’re not going to be monetizing Private Cloud for model training.
I'm not sure I understand why? Isn't the model fundamentally the same when being trained vs when used for inference? And isn't the necessary tensor matrix HW units necessary for both training or inference, with the difference being you need to run it
twice or more per pass during training as opposed to only needing to run through the model in the forward direction during inference? Obviously the training data is also going to need multiple passes itself during training, while using the model is a single pass per use. In other words you need more tensor HW to perform more calculations per pass, and per multiple passes, during inference compared to training.
But couple one of these chips with an MX Ultra and you might have a nice training machine. So either Apple could make a second cloud system just for training or they could sell them in the Mac Pro. My assumption would be the latter.
Looking at a die shot for an M3 Max and you see that the chip is like 50% GPU, with as much cache as CPU, and I/O taking up as much space as the CPU + cache, with the NPU being barely larger than the efficiency cores + cache (this assumes the annotation is correct, obviously):
All of this is wildly speculative. It may be that the SoC is just a bog standard MX Max chip and they’re already selling them in desktops and laptops. Or it’s a slight variant on an MX Max chip to allow for off-package DIMMs for shitloads of relatively less expensive memory. Who knows.
Given how tiny the NPU is (and how relatively inefficient it is to use the GPU), it would make sense to have a larger NPU for training. Take a look at an NVIDIA TU102 chip (similar to the TU104 used in the 2080 RTX, where the TU102 has 12 SMs and the TU104 only has 8):
The above chip, abstracted away (and rotated):
And an abstraction of the SM:
Less than 1/4th of each SM is Tensor, and about 1/4th is RT. So an NPU that would match TU102 would be far, far, smaller and more energy efficient than relying on the GPU.