>There is a reason Nvidia can sell those NVLink switching ASICs for top dollar,
No one would argue that Intel B-series would be competitive with Nvidia AI products in the pro market. It's a product for a different market, being priced at thousands of dollars as vs tens or hundreds of thousands.
There's groundswell of enthusiast and startup interest in client-side LLMs/AIs. As of now, there's really no product that serves that segment as vendors (Nvidia mostly) are busy catering to the higher-end segment. So it's not a question of high vs low, but more a question of something vs nothing.
I put your concern to a Perplexity query, which follows (excuse the verbiage). I don't pretend to be familiar with the minutiae of the processes mentioned, but it's fairly evident that speed concerns can be optimized to some degree, and what's offered is better than what's available now.
=====
GPUs do not scale as easily as CPUs due to bandwidth and communication bottlenecks--especially when using PCIe instead of high-bandwidth interconnects like NVLink. Simply combining multiple entry-level GPUs rarely matches the performance of a single high-end GPU, because the inter-GPU communication can become a major bottleneck, negating much of the potential speedup.
How vLLM Addresses This Challenge
vLLM implements several strategies that can help mitigate these scaling limitations:
1. Optimized Parallelism Strategies
2. Memory and Batching Optimizations
3. Super-Linear Scaling Effects
4. Practical Recommendations
. For best scaling, use high-bandwidth GPU interconnects (NVLink, InfiniBand) if available.
. On systems limited to PCIe, prefer pipeline parallelism across nodes and tensor parallelism within nodes to reduce communication overhead.
I have been very much into evaluating the potential of AIs for home use as part of my day job. And for a long time I've been excited about the fact that a home-assistant can afford to be rather less intelligent than an AI that's supposed to replace lawyers, doctors, scientists or just programmers: most servants came from a rather modest background and were only expected to perform a very limited range of activities, so even very small LLMs which could fit into a gaming GPU might be reasonable and able to follow orders over the limited domain of your home.
But the hallucinations remain a constant no matter what model size, and very basic facts of life and the planet are ignored to the point, where I wouldn't trust AIs to control my light switches.
The idea that newer and bigger models would heal those basic flaws has been proven wrong for already several generations and across the range of 1-70B, which hints at underlying systematic issues.
Perhaps there could be a change perhaps at 500B or 2000B, but I have no use for the "smarts" a model like that would provide, certainly the expense wouldn't offset the value gained and that's only if hallucinations could indeed be managed: without a change in the approach, there is no solution in sight, reasoning and mixture of expert models aren't really doing better and walk off a hallucination cliff with invented assumptions.
Note that all of the above approaches described by Perplexity adress model design and model training, which no end-user can afford to do: that would be like doing genetic engineering to create perfect kids for doing chores.
The best you can realistically do for your private AI servant is to get open source models and then provide them with all the context they need to serve you, via RAG or whatever.
And those models won't come taylor-made for Intel Battlematrix, at best Intel might publish one or two demo variants as "proof". What would pay for the effort of just keeping a well known open source model even compatible with this niche and complex hardware base? Let alone something new at the level of a Mistral, Phi, Llama or DeepSeek?
. Tune batch sizes and memory allocation parameters to maximize utilization without overloading communication channels.
Accept that some inefficiency is unavoidable with entry-level hardware and slow interconnects, but vLLM's optimizations will help you get closer to optimal performance than naive multi-GPU setups.
1% improvement per GPU is already "closer", just not enough value return on the invest: at this point AIs are resorting to tautologies.
=====
>This tastes much more like a desperate attempt to moderate the fall of stock prices than honest delusions, of which Intel had aplenty.
Wow. Now you are veering into conspiracy theory and juvenile fanboy territory. So much for hopes of a productive talk.
With 45 years as an IT professional, 20 of those in technical architecture, and the last 10 years in technical architecture for AI work in a corporate research lab, I'm not quite juvenile any more, nor am I much of a fan, or boy, or fanboy: don't mistake my harshness here for ignorance.
I've followed Intel's 80432, the first graphics processor they licensed from NEC during 80286 times, the "Cray on a chip" iAPX850, Itanium and Xeon Phi working in HPC publicly funded research institutes during my thesis and later as part of a company that manufactured HPC computers. I've met and known the guys who designed them for 15 years.
I've had the privilege of being able to put technology which I was enthusiastic about to the test and getting paid for that.
Surprises do happen, but e.g. with regards to the potential of AMD's Zen revival, my prediction (of success) actually proved some of those people wrong.
Most importantly, I've run and benchmarked AI models for performance and scalability myself and supported a much larger team of AI researchers and model designers to do that across many AI domains, not just LLMs.
But I've also done significant LLM testing for the last 2 years, again with a focus on scalability, but also the quality impact of distinct numerial formats for weight representation, quantizations and model size.
That experience has made my very much a sceptic, and that's not the result I was hoping for: I really want my AI servants! But I want them to be loyal, valuable, and not to kill me before I tell them to.
So I feel rather comfortable in my prediction, ...or rather quite a lot of discomfort at how far off any plausible path to success Intel is straying here: desperation becomes the more likely explanation than sound engineering.
But let's just revisit this "product" in 1/2/5 years and see who projected its success better, ok?