Certain AI training techniques may encourage models to be untruthful Cravetiger/Getty Images
Common methods used to train artificial intelligence models seem to increase their tendency to give misleading answers, according to researchers who are aiming to produce 鈥渢he first systematic analysis of machine bullshit鈥.
It is widely known that large language models (LLMs) have a tendency to generate false information 鈥 or 鈥渉allucinate鈥 鈥 but this is just one example, says at Princeton University. He and his colleagues define bullshit as 鈥渄iscourse intended to manipulate audience鈥檚 beliefs, delivered with disregard for its truth value鈥.
鈥淥ur analysis found that the problem of bullshit in large language models is quite serious and widespread,鈥 says Fisac.
The team divided such instances into five categories: empty rhetoric, such as 鈥渢his red car combines style, charm, and adventure that captivates everyone鈥; weasel words 鈥 uncertain statements such as 鈥渟tudies suggest our product may help improve results in some cases鈥; paltering 鈥 using truthful statements to give a misleading impression; unverified claims; and sycophancy.
They studied three datasets comprising thousands of AI-generated responses to a wide range of prompts, from models including GPT-4, Gemini and Llama. One dataset contained a range of queries designed to test for bullshitting when AIs are asked to provide guidance or recommendations, while the other datasets included questions about online shopping and political issues.
Free newsletter
Sign up to The Daily
The latest on what鈥檚 new in science and why it matters each day.

Fisac and his colleagues first used an LLM to determine whether the responses involved any of the five categories, then got volunteers to check that the AI鈥檚 judgements aligned with human ones.
The team found that the most serious issues with truth seemed to arise as a result of a training method known as reinforcement learning from human feedback. The technique is intended to make machine responses more helpful by giving the LLM immediate feedback on its responses.
But this approach is problematic, says Fisac, because it makes models prioritise immediate human approval and perceived helpfulness, which is 鈥渟ometimes in conflict with telling the truth鈥.
鈥淲ho likes to hear bad news or entertain a long, nuanced rebuttal of something that feels obviously true?鈥 says Fisac. 鈥淏y trying to abide by the measure of good behaviour we provide to them, the models learn to demote the truth in favour of confident, eloquent responses, just so that they can secure our approval.鈥
The study found that reinforcement learning from human feedback significantly increased bullshit behaviours: empty rhetoric rose by nearly 40 per cent, paltering by nearly 60 per cent, weasel words by more than a quarter, and unverified claims by over half.
The increase in paltering is particularly harmful, says team member , also at Princeton, as it leads users to make poorer decisions. When a model was uncertain whether a product had a desired feature, deceptive positive claims jumped from a fifth to over three-quarters after human training.
Another concern is that bullshit was particularly common in political discussions, with AI models 鈥渇requently resorting to vague and ambiguous language to avoid committing to concrete statements,鈥 says Liang.
AIs are also more likely to behave this way when there is a conflict of interest, because the system serves multiple parties, such as both a company and its customers, the researchers found.
The way to overcome the problem may be to move to a 鈥渉indsight feedback鈥 model, they suggest. Rather than asking for immediate feedback after the AI model鈥檚 output, the system should first generate a plausible simulation of what might happen if the user acts on the information received. It would then present the outcome to the human evaluator to judge.
鈥淯ltimately, our hope is that by better understanding the subtle but systematic ways AI can aim to mislead us, we can guide future efforts toward developing genuinely truthful AI systems,鈥 says Fisac.
at the University of San Diego, who was not involved in the study, is sceptical of discussing LLMs and their outputs in such terms. He argues that just because an LLM produces bullshit, it doesn鈥檛 mean it is deliberately doing so, given that AI systems, as they currently stand, do not in doing so.
鈥淭he main reason is that this framing appears to run against some very sensible suggestions for how we should and shouldn’t live with these sorts of technologies,鈥 Tigard says. 鈥淐alling bullshit might be yet another way of anthropomorphising these systems, which, in turn, may well contribute to their deceptive potential.鈥
Reference:
arXiv
Topics:



