If you ask a GPT model "Who painted Mona Lisa?" it will correctly answer "Leonardo Da Vinci".
It is able to do so because words "Mona Lisa" and words "Leonardo Da Vinci" which are internally represented as a string of numeric tokens are found right next to each other in the dataset used to train said model.
Since there's more than one occurence of this in the whole dataset it has a high significance just like with humans (many people said the same thing so it is taken as ground truth) and as such model has memorized a high probability that tokens 44, 6863, 29656 (which represent "Mona Lisa") are extremely likely to be followed by tokens 73004, 21106, 14569, 97866 (which represent "Leonardo Da Vinci").
Um, no. You're not describing how GPT
actually solves that problem, but rather how you
imagine it's solved.
Now comes the fun part -- if you ask a question which it has never seen before it will turn it into tokens and try to answer based on the next token probability. If the temperature is high and if you have a good set of guardrails injected into the context window you can force the model to give a truthful "I don't know" answer. Otherwise, you will get what's called a "hallucination" which will most likely be factually incorrect, possibly harmful,
Cool story. Now tell us how it composes sonnets, haiku's, or limericks about whatever subject you like? In order to perform that trick, it needs to separately understand the form in which it's being requested to answer as well as the subject matter. Styles, and their distinction from substance, are things it learns, implicitly.
As an aside: here's a fun fact about parrots. They can reproduce things they've heard people say, but
always in the voice of the speaker. Because they don't understand the words and don't know which aspects of the speech are words or voice, they can't repeat one person's words in the voice of another. Here's where the "parrot" analogy people like to apply to LLMs really falls short.
How do you do that with code for say... Mars Rover?
Robotics is actually one of the more straight-forward tasks to teach AI, since you can just use sophisticated simulators. Tesla's self-driving algorithm is one giant neural network, from what I've heard. I'm not actually a big fan of that approach, but I digress.
Yes, but those programmers can create novel output while the model can't
I don't believe it. What really is creativity, at some level, other than filtered noise? I expect it should be possible to train a GAN to produce more novel output, with a suitably designed and trained adversarial network. It's a harder problem than simply imitating human works, but probably not insurmountable with today's technology.
Who is going to provide that data if in a couple of decades nobody knows how to write code anymore?
In some ways, code is an easier problem because it must adhere to well-defined rules and achieve specific objectives. These are all quantifiable things.
Furthermore, there's already billions of lines of code out there, much of it in high-quality, open source software. Much of it even has automated tests! That's a vast reserve of training data, right there.