We'll have to be settle on agreeing to disagree on whether LLM is machine learning or a new class of AI
The main problem with debating whether something is really "AI" is the continual goalpost movement by non-experts in the field. All of the long-time AI academics and practitioners seem to agree that generative AI fits the definition - it's only the general public that seems to object.
The word "artificial" is there to indicate that it shouldn't be seen as truly intelligent, but merely something capable of intelligent behavior. Yet, the popular sentiment seems to be pretending as if the qualifier isn't there and holding AI up to the benchmark of a human-level intelligence, before we're willing to accept that it's actually behaving in intelligent ways.
I think it's informative to compare with how we talk about animals long regarded to be intelligent. Take crows, for instance. They can perform numerous cognitive tasks not widely seen among members of the animal kingdom and are generally regarded as comparatively intelligent animals. Such claims are not regarded as controversial - people are willing to accept that they have some advanced cognitive skills and are willing to characterize it as intelligence, without requiring them to demonstrate the full range and magnitude of human cognition.
I think the way to move forward is simply to define a set of cognitive capabilities and measure how well leading AI models perform them. On that note, here's a recent paper that set forth a benchmark for measuring how well AI performs logical reasoning:
Dataset | LogiQA 2.0 test | LogiQA 2.0 zh test | ReClor dev | AR-LSAT test | LogiQA 2.0 ood |
---|
Size | 1572 | 1594 | 500 | 230 | 1354 |
Human avg. | 86 | 88 | 63 | 56 | 83 |
human ceiling | 95 | 96 | 100 | 91 | 99 |
RoBERTa | 48.76 | 35.64 | 55.01 | 23.14 | 33.22 |
ChatGPT (API) | 52.37 | 53.18 | 57.38 | 20.42 | 38.44 |
GPT-4 (Chat UI) | 75.26 (73/97) | 51.76 (44/85) | 92 (92/100) | 18.27 (19/104) | 48.21(54/112) |
GPT-4 (API) | 72.25 | 70.56 | 87.2 | 33.48 | 58.49 |
Table 1: ChatGPT and GPT-4 performance on the Logical multi-choice machine reading comprehension task
(accuracy %). “LogiQA 2.0 zh test” refers to the test set of the LogiQA 2.0 Chinese version. “LogiQA 2.0 ood”
represents the out-of-distribution data of LogiQA 2.0.
Source: Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4
There are more test results in the paper, but it's too much trouble to copy & paste them here.
From the abstract:
... Experiment results show that ChatGPT performs significantly better than the RoBERTa fine-tuning method on most logical reasoning benchmarks. ... The results show GPT-4 yields even higher performance on most logical reasoning datasets. Among benchmarks, ChatGPT and GPT-4 do relatively well on well-known datasets like LogiQA and ReClor. However, the performance drops significantly when handling newly released and out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT and GPT-4, especially on out-of-distribution and natural language inference datasets.
Note that they don't deny that these models are
capable of reasoning - rather concluding simply that they're not yet very good at it.