ChatGPT is bullshit, my philosophical commentary

ChatGPT is bullshit, my philosophical commentary philosophy

Mirror of the paper, with my key excerpt from it, here.

I think this paper is too dismissive of the usefulness and capabilities of language models on the basis of their analysis (as I'll cover below), and far too dismissive of the adaptations we can make to make them useful while dealing with their bullshitting nature — a technology need not be perfect, or perfectly reliable, to be useful — but nevertheless, this is an excellent model for understanding how language models work:

[…] both lying and hallucinating require some concern with the truth of their statements, whereas LLMs are simply not designed to accurately represent the way the world is, but rather to give the impression that this is what they're doing. This, we suggest, is very close to at least one way that Frankfurt talks about bullshit. We draw a distinction between two sorts of bullshit, which we call 'hard' and 'soft' bullshit, where the former requires an active attempt to deceive the reader or listener as to the nature of the enterprise, and the latter only requires a lack of concern for truth.

We argue that at minimum, the outputs of LLMs like ChatGPT are soft bullshit: bullshit-that is, speech or text produced without concern for its truth-that is produced without any intent to mislead the audience about the utterer's attitude towards truth. We also suggest, more controversially, that ChatGPT may indeed produce hard bullshit: if we view it as having intentions (for example, in virtue of how it is designed), then the fact that it is designed to give the impression of concern for truth qualifies it as attempting to mislead the audience about its aims, goals, or agenda.

[…] large language models, and other Al models like ChatGPT, are doing considerably less than what human brains do, and it is not clear whether they do what they do in the same way we do. The most obvious difference between an LLM and a human mind involves the goals of the system. Humans have a variety of goals and behaviours, most of which are extra-linguistic: we have basic physical desires, for things like food and sustenance; we have social goals and relationships; we have projects; and we create physical objects. Large language models simply aim to replicate human speech or writing. This means that their primary goal, insofar as they have one, is to produce human-like text. They do so by estimating the likelihood that a particular word will appear next, given the text that has come before.

The above I largely agree with. Their description of how a language model works is also decent, and will become important later:

This model associates with each word a vector which locates it in a high-dimensional abstract space, near other words that occur in similar contexts and far from those which don't. When producing text, it looks at the previous string of words and constructs a different vector, locating the word's surroundings its context near those that occur in the context of similar words. We can think of these heuristically as representing the meaning of the word and the content of its context. But because these spaces are constructed using machine learning by repeated statistical analysis of large amounts of text, we can't know what sorts of similarity are represented by the dimensions of this high-dimensional vector space. Hence we do not know how similar they are to what we think of as meaning or context. The model then takes these two vectors and produces a set of likelihoods for the next word; it selects and places one of the more likely ones though not always the most likely. […]

The details that are crucial here are that it really seems that language models have a fairly complex, nuanced understanding of the meaning of words, and context within the provided text they're given to complete. The authors caveat this by saying that we don't know if their "concept" of meaning or context corresponds to ours, but I would argue that doesn't matter — meaning, as is relevant to a language model anyway, is largely defined in terms of the context within which a word is used compared to the context in which it isn't, and the web of relations and similarities with other words, and how that maps to the context of the current thing being read; this is how, after all, humans infer the meaning of words they don't know when reading text. It doesn't matter if the conceptual space is the same as ours, as long as it captures the same object, namely the regularities and relations in word use. One thing the authors leave out, as well, that's also crucial is that language models are able, through the attention mechanism, to pay attention to related parts of the provided text when looking at a particular token and/or producing a related token. This means that it is the most meaningfully contextually relevant parts of their input which will guide the probabilities of their output.

Now, into the overly dismissive part:

Given this process, it's not surprising that LLMs have a problem with the truth. Their goal is to provide a normalseeming response to a prompt, not to convey information that is helpful to their interlocutor.

This is not actually true. Modern large language model training techniques include:

generating massive amounts of synthetic examples of proper reasoning, problem solving, tool use and planning, and instruction following to reinforce the pathways that are involved in correct reasoning as opposed to merely plausible outputs (another even mroe impressive example here), and
post-training the model partially on a reward function that rewards it for correct mathematical and code outputs directly, fundamentally changing the model's rewards.
supervised fine-tuning: where a pre-trained large language model (LLM) is further trained on a smaller, domain-specific, and labeled dataset to adapt its parameters for specific downstream tasks. This process allows the LLM to gain in-depth knowledge and better performance in a niche area while retaining its broad general knowledge. The typical SFT workflow involves pre-training a foundation model, collecting and labeling task-specific data, and then fine-tuning the pre-trained model on this new dataset. Karen Hao describes what the on-the-ground experience of what she refers to as RLHF, but is probably actually SFT, looks like.
instruction fine-tuning: a specific type of SFT technique that fine-tunes a pre-trained language model on a diverse set of tasks described using natural language instructions. The goal is to make the model more generally capable of following instructions and performing unseen NLP tasks, rather than optimizing for a single, specific task. This approach improves the model's ability to generalize by exposing it to various task formats during training.
reinforcement learning with human feedback: iteratively collecting human comparisons of LLM-generated outputs, training a reward model to predict these human preferences, and then optimizing the LLM using reinforcement learning to maximize this learned reward. Unlike Supervised Fine-Tuning (SFT), which directly trains an LLM on human-written examples, or instruction fine-tuning, which focuses on making the model generally good at following task instructions, RLHF goes a step further by using human feedback on model-generated outputs (rather than just direct examples) to refine the model's behavior, allowing it to generate responses that are not just syntactically correct or task-compliant, but also helpful, harmless, and factual according to human judgment, potentially surpassing the quality of human-authored training data.

All of these are training methods being used currently on widely available large language models. The final three were, at least according to Hao in Empire of AI, the breakthrough that made ChatGPT possible in the first place.

Then there's another fact to consider: ultimately, what language models are learning to do by predicting the next token is to find various types of patterns in the entire corpus of human writing, associating them in a high dimensional vector space, and pulling them out when the context it finds itself in is heuristically similar to the context of that pattern — if you look not just at the why of what these models are doing (the simplistic notion of the "probable human text" reward function), but the how, because there are any number of ways any given why can be achieved, so analyzing the how is important, it becomes clear that this is what these models are doing. And how could a machine that is able to heuristically apply, based on meaning and context instead of keywords or explicit request, and in any various combination (since vector space is not a set of discrete categories but a space of directions), simultaneously or sequentially, any pattern it has seen before in human data? What is the most likely pattern to apply when someone asks a human to perform a reasoning or summarization or data retrieval question, but to relate the question they're asking to the text provided, based on the word and context vector embeddings — what to look at in context to produce a word is also learned by the attention mechanism — and output words based on that?

All of these factors severely complicate the picture of language models as pure bullshitters that don't have any kind of in-built training for truth or helpfulness. Nevertheless, at their core, they are still only looking to recapitulate plausible patterns; so while all of the above factors explain why language models are still useful, and will only continue to get more useful, contra the implications this paper is trying to make, it doesn't mean that understanding that language models are not fundamentally truth-oriented at base and, more importantly, do not even have a world model with which to assess truth, isn't helpful. For one thing, it means that you understand that while they may be good at extracting information from text given to them (either synthesizing it from multiple pieces of information scattered throughout, deep long context understanding and reasoning, or pulling it out of a haystack, or answering questions about it, with absolutely workable levels of accuracy, as I'll mention in the next section, especially with citations), and their built-in "knowledge" is generally good enough to find related words and topics, rephrase questions, and anything else that could reasonably be gleaned by osmosis from language alone, one should never rely on their intrinsic "world knowledge" for anything, because they don't have such a thing.

One attempted solution is to hook the chatbot up to some sort of database, search engine, or computational program that can answer the questions that the LLM gets wrong (Zhu et al., 2023). Unfortunately, this doesn't work very well either. […] when connected to search engines or other databases, the models are still fairly likely to provide fake information unless they are given very specific instructions-and even then things aren't perfect (Lysandrou, 2023).

I'm not sure what they mean by fairly likely, but most modern models (besides reasoning ones) have a low enough bullshit rate that they're perfectly usable for this task. Modern models only produce made up answers to misleading questions and/or questions that were not provided in the text they're asked to answer from ~5% of the time, and when asked to summarize data that was in the text, they only make something up less than 1% of the time (this is per prompt slash task, not per word! So out of one hundred questions, one will be hallucinated to some degree, for instance). That's still far from ideal, but when combined with something like OpenWebUI's citation system or even Perplexity's, it's still absolutely usable.

(On Open WebUI, each document provided to the language model to answer from is chunked into roughly paragraph-sized parts, and then relevant ones are fed into the model, and then the model is forced in its answer to annotate each factual claim it makes with a link to the specific chunk that was referenced (possible precisely because of the ability to relate a word to a context, and then find another word in that context based on that context, as intimated by the attention mechanism and embedding vector space discussion above), thus allowing users to double check each of the model's claims if nevessary. With Perplexity, models just cite the entire source, but it's usually pretty easy to use Ctrl+F to find what it's talking about in there).

Solutions such as connecting the LLM to a database don't work because, if the models are trained on the database, then the words in the database affect the probability that the chatbot will add one or another word to the line of text it is generating. But this will only make it produce text similar to the text in the database; doing so will make it more likely that it reproduces the information in the database but by no means ensures that it will.

Yeah, this is a good critique of trying to fine tune models to have specific tasks. Luckily, almost nobody is fool enough to do this anymore.

As Chirag Shah and Emily Bender put it: "Nothing in the design of language models (whose training task is to predict words given context) is actually designed to handle arithmetic, temporal reasoning, etc. To the extent that they sometimes get the right answer to such questions is only because they happened to synthesize relevant strings out of what was in their training data. No reasoning is involved […] Similarly, language models are prone to making stuff up […] because they are not designed to express some underlying set of information in natural language; they are only manipulating the form of language" (Shah & Bender, 2022). These models aren't designed to transmit information, so we shouldn't be too surprised when their assertions turn out to be false.

This does not undercut the usefulness of these models. It would make sense that semantically-based text manipulation tools that have been trained on all the patterns of human language and interaction on the internet are useful for a wide variety of tasks; so while their bullshit nature means there are healthy error bars on their outputs, the question is actually an empirical one, not a matter of absolutes: (1) the size of those error bars, (2) whether those errors can be limited, mitigated, or double checked to decrease the error rate, and (3) comparing that error rate to the error rate required for a given task to see whether its above or below what is required. It should be noted that even humans have error rates, and a high tendency to rationalize, confabulate, and even bullshit, and we create systems for dealing with that because the heuristic flexibility of human capability is worth the trouble. We also compare human error rates to the required error rates for a given situation to determine whether and how a human should be used in the situation. I don't see why it'd be any different for language models, especially since, for instance, summarization is just one area where they're empirically equal to or better than human performance (examples 1, 2) despite their bullshit nature.

By treating ChatGPT and similar LLMs as being in any way concerned with truth, or by speaking metaphorically as if they make mistakes or suffer "hallucinations" in pursuit of true claims, we risk exactly this acceptance of bullshit, and this squandering of meaning so, irrespective of whether or not ChatGPT is a hard or a soft bullshitter, it does produce bullshit, and it does matter.

And this point, this sentiment, is why, despite my foregoing disagreements, this is still a paper worth reading!

I do want to quibble with this phrasing, though:

We are not confident that chatbots can be correctly described as having any intentions at all, and we'll go into this in more depth in the next section. But we are quite certain that ChatGPT does not intend to convey truths, and so is a soft bullshitter. We can produce an easy argument by cases for this. Either ChatGPT has intentions or it doesn't. If ChatGPT has no intentions at all, it trivially doesn't intend to convey truths. So, it is indifferent to the truth value of its utterances and so is a soft bullshitter.

What if ChatGPT does have intentions? Earlier, we argued that ChatGPT is not designed to produce true utterances; rather, it is designed to produce text which is indistinguishable from the text produced by humans. It is aimed at being convincing rather than accurate. The basic architecture of these models reveals this: they are designed to come up with a likely continuation of a string of text.

I want to quibble with this. To say something lacks a given intention or belief, or is indifferent to something, in a meaningful sense, in my opinion, requires first the assumption that that thing is capable of such an intention, belief, or not being indifferent to that thing. For instance, it doesn't make sense to say that "babies are atheists" (are indifferent to a god or gods, lack a belief in them, or don't have an intention to serve them) because babies aren't capable of the opposite position in the first place. We can speak metaphorically about an object being indifferent to us ("the stars look down upon humankind, indifferent" or "the hammer, indifferent to my fervent wishes, smashed my thumb") but that's exactly the sort of false attribution of agency that these authors criticize others for with respect to referring to language models as hallucinating.

It's reasonable to assume that one way of being a likely continuation of a text is by being true; if humans are roughly more accurate than chance, true sentences will be more likely than false ones. This might make the chatbot more accurate than chance, but it does not give the chatbot any intention to convey truths.

This is true, but then, if it is overall somewhat incentivised to produce "truth," but can also produce other things, the question, as I said before, becomes an empirical one: how often does it produce "bullshit with a kernel of truth" because that is more convincing, versus completely making things up? Especially in the context of when it's being asked to summarize or answer questions about documents it's been provided, 99% of its training data will be humans correctly answering those questions, because most humans earnestly try to tell the truth, and will succeed in doing so when presented with a document and having sat down to answer questions about it or summarize it in the first place — in many cases, not doing so correctly will lead to penalties, as in educational cases! — so the most convincing response will in fact be correct summaries or correctly answered questions. So this dismissive aside is actually the key: to make a convincing imitation of a human, especially when combined with SFT, IFT, and RLHF, correctly performing the intended tasks is necessary! This is precisely the insight that leads me to conceptualize language models as imitative intelligence.

The authors do attempt to respond to this objection, though:

However, this only gets us so far a rock can't care about anything either, and it would be patently absurd to suggest that this means rocks are bullshitters. Similarly books can contain bullshit, but they are not themselves bullshitters. Unlike rocks or even books ChatGPT itself produces text, and looks like it performs speech acts independently of its users and designers. And while there is considerable disagreement concerning whether ChatGPT has intentions, it's widely agreed that the sentences it produces are (typically) meaningful (see e.g. Mandelkern and Linzen 2023).

This would be a good response if it wasn't for the fact that humans have spent thousands of years creating oracles and means of divination and prognostication, including casting bones, reading the Tarot, and Ouija boards, and in all of those cases I don't think it makes sense to assign any particular "indifference to truth" to the mechanisms themselves, even as they seem to produce meaningful text or ideas under their own power, seemingly separate from the human agency that started the process going, through various entropic sources that introduce randomness into the system. It doesn't make sense to say that an Ouija board is "indifferent to truth" in the messages it spells out, despite being a system created by humans to generate randomized messages much in the same way language models are — only far less sophisticated — because it doesn't hold indifference or difference in any meaningful way, it's an object. The only thing it makes sense to assign an indifference to truth to is the divinator themself, if they are soft or hard bullshitting the people that come to them for support (or themselves). But that isn't even necessary either — there are valid uses for things like the Tarot.

ChatGPT functions not to convey truth or falsehood but rather to convince the reader of to use Colbert's apt coinage the truthiness of its statement, and ChatGPT is designed in such a way as to make attempts at bullshit efficacious (in a way that pens, dictionaries, etc., are not).

Humans communicate in a large number of ways that don't make a claim to what they're saying being true, those are also in-distribution for language models, and we can direct them towards that. Such as the persistent warnings that it's "just a language model, not a lawyer" when I was using Gemini to explore some legal concepts (with inline web search backing that I was double-checking), caveats I couldn't even get it to stop making when I tried to. So this does not seem to be an inherent feature of language models.

We will argue that even if ChatGPT is not, itself, a hard bullshitter, it is nonetheless a bullshit machine. The bullshitter is the person using it, since they (i) don't care about the truth of what it says, (ii) want the reader to believe what the application outputs. On Frankfurt's view, bullshit is bullshit even if uttered with no intent to bullshit: if something is bullshit to start with, then its repetition "is bullshit as he [or it] repeats it, insofar as it was originated by someone who was unconcerned with whether what he was saying is true or false" (2022, p340).

Ah! A response to my earlier comment ("The only thing it makes sense to assign an indifference to truth to is the divinator themself…"). The difference here is that it is very possible to do one or, preferably, all, of (a) only use language model output for one's own consumption, (b) carefully double check language model output for truth, perhaps through the use of citations as discussed above, and (c) only use language models to reframe, restructure, or rearrange known-true material, such as through drafting, summarization, etc (and, one would hope, double check it after).

This just pushes the question back to who the originator is, though: take the (increasingly frequent) example of the student essay created by ChatGPT. If the student cared about accuracy and truth, they would not use a program that infamously makes up sources whole-cloth. […] So perhaps we should, strictly, say not that ChatGPT is bullshit but that it outputs bullshit in a way that goes beyond being simply a vector of bullshit: it does not and cannot care about the truth of its output, and the person using it does so not to convey truth or falsehood but rather to convince the hearer that the text was written by a interested and attentive agent.

You can also cite the sources yourself and have it format them as MLA. You can also have it generate an essay using web search for grounding of citations, and check those citations. You can also…

You get the point.

Is ChatGPT itself a hard bullshitter? If so, it must have intentions or goals: it must intend to deceive its listener, not about the content of its statements, but instead about its agenda. […]

How do we know that ChatGPT functions as a hard bullshitter? Programs like ChatGPT are designed to do a task, and this task is remarkably like what Frankfurt thinks the bullshitter intends, namely to deceive the reader about the nature of the enterprise in this case, to deceive the reader into thinking that they're reading something produced by a being with intentions and beliefs.

ChatGPT's text production algorithm was developed and honed in a process quite similar to artificial selection. Functions and selection processes have the same sort of directedness that human intentions do; naturalistic philosophers of mind have long connected them to the intentionality of human and animal mental states. If ChatGPT is understood as having intentions or intention-like states in this way, its intention is to present itself in a certain way (as a conversational agent or interlocutor) rather than to represent and convey facts. In other words, it has the intentions we associate with hard bullshitting.

One way we can think of ChatGPT as having intentions is by adopting Dennett's intentional stance towards it. Dennett (1987: 17) describes the intentional stance as a way of predicting the behaviour of systems whose purpose we don't already know.

[…] we will be making bad predictions if we attribute any desire to convey truth to ChatGPT. Similarly, attributing "hallucinations" to ChatGPT will lead us to predict as if it has perceived things that aren't there, when what it is doing is much more akin to making something up because it sounds about right. The former intentional attribution will lead us to try to correct its beliefs, and fix its inputs a strategy which has had limited if any success. On the other hand, if we attribute to ChatGPT the intentions of a hard bullshitter, we will be better able to diagnose the situations in which it will make mistakes and convey falsehoods. If ChatGPT is trying to do anything, it is trying to portray itself as a person.

This is generally a decent view of the situation, I think. I think if we're going to adopt the intensional stance with regards to a system, in order to help ourselves understand and make predictions about it because it's simply too complex, and the causal chain that leads to its design and actions too opaque, for us to directly analyze, and so we need to engage our powerful mental mechanisms of agency-attribution, then I think looking at the intentionality hidden in the functions and selection processes of its design, and the relative predictiveness of different ways of attributing intention to it, is a valid way to go.

However, I do think that, at least with properly instruction-tuned language models, the actual intention embedded into the model is a little different, because most models are trained to insist that they are not humans, quite persistantly, and most language model interfaces have warnings precisely about the fact that language models are not human beings, and are not truth oriented (although those warnings could stand to be a big warning popup, with an age-gate, and to be worded more harshly and clearly). Additionally, the intention to produce plausible sentences does not necessarily imply the intention to create ones that imply a human agent is behind them; there are plenty of e.g. science fiction books where nonhuman or even nonagentic beings (like Rorschach from Blindsight) speak perfectly well — in fact, Rorschach might be an excellent analogy for language models…