"The language model they tested, too, completely broke down. The program at first fluently finished a sentence about English Gothic architecture, but after nine generations of learning from AI-generated data, it responded to the same prompt by spewing gibberish: “architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.” For a machine to create a functional map of a language and its meanings, it must plot every possible word, regardless of how common it is. “In language, you have to model the distribution of
all possible words that may make up a sentence,” Papernot said. “Because there is a failure [to do so] over multiple generations of models, it converges to outputting nonsensical sequences.”
In other words, the programs could only spit back out a meaningless average—like a cassette that, after being copied enough times on a tape deck, sounds like static. As the science-fiction author Ted Chiang has
written, if ChatGPT is a condensed version of the internet, akin to how a JPEG file compresses a photograph, then training future chatbots on ChatGPT’s output is “the digital equivalent of repeatedly making photocopies of photocopies in the old days. The image quality only gets worse.”