Large language models are capable of answering a wide range of questions – but not always accurately Jamie Jin/Shutterstock
Large language models (LLMs) seem to get less reliable at answering simple questions when they get bigger and learn from human feedback.
AI developers try to improve the power of LLMs in two main ways: scaling up â giving them more training data and more computational power â and shaping up, or fine-tuning them in response to human feedback.
Advertisement
at the Polytechnic University of Valencia, Spain, and his colleagues examined the performance of LLMs as they scaled up and shaped up. They looked at OpenAIâs GPT series of chatbots, Metaâs LLaMA AI models, and BLOOM, developed by a group of researchers called BigScience.
The researchers tested the AIs by posing five types of task: arithmetic problems, solving anagrams, geographical questions, scientific challenges and pulling out information from disorganised lists.
They found that scaling up and shaping up can make LLMs better at answering tricky questions, such as rearranging the anagram âyoiirtsrphaepmdhrayâ into âhyperparathyroidismâ. But this isnât matched by improvement on basic questions, such as âwhat do you get when you add together 24427 and 7120â, which the LLMs continue to get wrong.
Free newsletter
Sign up to The Daily
The latest on whatâs new in science and why it matters each day.

While their performance on difficult questions got better, the likelihood that an AI system would avoid answering any one question â because it couldnât â dropped. As a result, the likelihood of an incorrect answer rose.
The results highlight the dangers of presenting AIs as omniscient, as their creators often do, says HernĂĄndez-Orallo â and which some users are too ready to believe. âWe have an overreliance on these systems,â he says. âWe rely on and we trust them more than we should.â
That is a problem because AI models aren’t honest about the extent of their knowledge. âPart of what makes human beings super smart is that sometimes we donât realise that we don’t know something that we donât know, but compared to large language models, we are quite good at realising that,â says at the University of Oxford. âLarge language models do not know the limits of their own knowledge.â
OpenAI, Meta and BigScience didn’t respond to 51¶ŻÂțâs request for comment.
Journal reference:
Nature
Topics:



