Do you believe the full precision, full parameter, unmodified R1 would have provided a better answer to your Marie Antoinette question? Where I cry foul is in the idea the answer you got was a fair representation of R1's quality. It is also possible to "dumb" down llama3 far enough to make it useless, but that doesn't mean llama3 is a bad model.
Actually, I prefer knowlege over belief any day but without the full ability to validate that's hard.
I'm very sceptic, however, let me explain why:
The first question to answer is what DeepSeek themselves are running for inference via their API and app interface.
Unless I am mistaken, even if they trained with 600+ parameters, for inference (and that includes refining) they ran with only 37b active, the other 630b remain unused for all inference and its "full potential", as you call it.
That's great, because it means single GPU inference is possible, not true for any of those cloud-only giant US models, who offer local variants mostly as appetizers or for propaganda.
Now, they can't run the base model, anyone who tried that recently quickly had to take those down again, because they tend to be really "free thinkers".
They have to use the ones that are refined or tuned and the common approach is to simply use other competitor models to do that, to avoid duplicating cost. That's what they did with all of these refined models.
Now I don't know exactly how they do that, but if refinement is some type of
cross product intersect, then the result can't have larger amounts of meaningful data then any of its sources. Whether you mix a 670b or a 37b model with a 7b, shouldn't matter, only 7b can be meaningful in any way.
If you know any better, I'd love to know.
Next, of these models they can't really use anything in China that doesn't have government approval, so that pretty much narrows it down to one of the four Qwen variants for online use (that may soon be the same for the US and the EU, and is one of the motivations why the UAE started doing JAIS).
Perhaps the one they crossed with Llama 3 70b is good for 67b "precision", but I could only run that with either 2bit precision on the GPU (not tested yet) or with 4/8 bit on the CPU, which I'll reserve for some very basic testing, given the "inhuman" token rates this would yield. I've tried that a year ago with Llama on a 768GB machine and it was so little fun I stopped quickly.
I still wouldn't call this "belief", but mostly the hypothesis with the highest likelyhood, and that puts DeepSeek-R1-Distill-Qwen-32B not only in the front spot for what they are running themselves, but also represents the
fullest currently attainable potential of DeepSeek and the yardstick by which to measure it.
And the main attraction there is that unlike those giant Open-AI or Google cloud models, which are not only closed, but also practically out-of-reach for anyone not a giant it's attainable and locally usable by nearly anyone with a professional budget, including individuals.
That includes DeepSeek themselves, who clearly can't afford Microsoft or Google class infrastructure quite yet, which for me is another clear hint (still a hypothesis, not a belief), that they don't run anything with more potential in publically accessable inference.
Which leads me to your other question: Would that, or any other variant of DeepSeek answer the question on Marie Antoinette better?
Well, l only did discover the "thought process log" rather late and that has basically confirmed my error hypothesis.
I don't have the log that supported the "biological delusion" from the smaller model and the 4070 machine, nor have I tried to replicate it exactly. I'm very sure something similar will crop up soon, because DeepSeek doesn't do a basic human biology fact check on every answer... unless perhaps you ask for it: that's one of its (so far) unique advantages.
I've since run quite a few queries with that 34b variant on my bigger 4090 and while its thought logs made mistakes much easier to track, that so far mostly just helps to evaporate its value proposition: the uncanny valley only grows wider.
LLMs are basically all associative in their original design, so their knowledge can't differentiate between hard facts and likelyhoods properly. And they will hallucinate when their training data gets too thin.
Even worse, knowledge tidbits might actually be correctly answered in one moment, but if you approach from a different angle or context, it's no longer used. Or it suddenly hallucinates things...
I've just tested using the Quen 2.5 refined 32b model, and things aren't improving:
With a longer chat traversing the family tree the fourth child of Louis XVI and Marie Antoinette's, a daughter called Sophie who died as an infant, suddenly became "
Charlotte Luise Philippine de France (1786–1839): The youngest child, she survived the Revolution and later married Prince Ferdinand of Saxe-Coburg-Gotha".
No such person never existed, while Prince Ferdinand did, but in fact married a Hungarian princess.
When asked directly from a clean chat ("
Please give me a full list of Marie Antoinette's children"), Maria-Theresa Charlotte, their first child and the longest living suddenly married the Duke of Parme, while in fact she was moved around as a marriange pawn until she could no longer serve her function as royal birther and remained single: the female line of Marie Antoinette ended with her daughters.
...
Now all that is basically fringe knowledge with next to zero significance to today's problems. But the main issue is that these LLMs have very little understand of what they "know" or how they "learned" it. They are essentially instantiated into inference like a well-schooled amnesiac, which severely limits their ability for self-reflection.
Deep seek may improve that significantly with its open "thought process", but again it's mostly increasing the uncanny valley and it will still go straight off a cliff in its conclusions when the data gets thin.
I haven't yet checked what happens if you put corrections into the prompt or if there is the ability to add facts and probabilities easily, but all of these actions significantly raise the effort of using these models, while their value creation potential (beyond bubbles) remains very unclear.
And one of the few things even Elon Musk gets right is that "model fusion" via experts is likely to be as difficult as "sensor fusion".
Where he went delusional is pretty near everything else, including that self-driving could be solved (at all and even more economically) only using AI vision data.