Table of Contents
- 1. AI
- 1.1. Environmental Issues
- 1.2. IP Issues
- 1.3. Architecture and Design
- 1.3.1. On Chomsky and the Two Cultures of Statistical Learning ai hacker_culture philosophy
- 1.3.2. The Bitter Lesson ai hacker_culture software philosophy
- 1.3.3. The Bitter Lesson: Rethinking How We Build AI Systems ai
- 1.3.4. What Is ChatGPT Doing … and Why Does It Work?? ai
- 1.3.5. Cyc ai
- 1.3.6. Types of Neuro-Symbolic AI
- 1.3.7. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
- 1.3.8. ChatGPT is bullshit ai philosophy
- 1.3.9. Asymmetry of verification and verifier’s law ai
- 1.3.10. The Model is the Product ai
- 1.4. What kind of intelligence do LLMs have?
- 1.5. Superintelligence: The Idea That Eats Smart People ai
- 1.6. LLMs are cheap ai
- 1.7. I am disappointed in the AI discourse ai
1. AI
1.1. Environmental Issues
1.1.1. Using ChatGPT is not bad for the environment ai culture
If you don't have time to read this post, these five graphs give most of the argument. Each includes both the energy/water cost of using ChatGPT in the moment and the amortized cost of training GPT-4:
Source for original graph, I added the ChatGPT number. Each bar here represents 1 year of the activity, so the live car-free bar represents living without a car for just 1 year etc. The ChatGPT number assumes 3 Wh per search multiplied by average emissions per Wh in the US. Including the cost of training would raise the energy used per search by about 33% to 4 Wh. Some new data implies ChatGPT's energy might be 10x lower than this.
I got these numbers by multiplying the average rate of water used per kWh used in data centers + the average rate of water used generating the energy times the energy used in data centers by different tasks. The water cost of training GPT-4 is amortized into the cost of each search. This is the same method used to originally estimate the water used by a ChatGPT search. Note that water being “used” by data centers is ambiguous in general, read more in this section
Statistic for a ChatGPT search, burger, and leaking pipes.
I got these numbers from back of the envelope calculations using publicly available data about each service. If you think they're wrong I'd be excited to update them! Because this is based on the total energy used by a service that's rapidly growing it's going to become outdated fast.
Back of the envelope calculation here
And it's crucial to note that, for instance, if you're using Google's AIs, which are both mixture of experts models (so inference is much cheaper) and run on Google's much more power efficient Tensor chips, it's probably less than this! Running a small AI locally is probably less efficient than running that same small AI in a data center, as well, but the AIs you can run locally are so much smaller than the ones you'd run in a data center that that counts as an optimization too, not to mention it decreases the power density and distortion that datacenters impose on the power grid.
1.1.2. Is AI eating all the energy? Part 1/2 ai
I think this, especially in combination with "Using ChatGPT is not bad for the environment", is a really good demonstration of the idea that generative AI, in itself, is not a particularly power hungry or inefficient technology.
The different perspective this article takes that makes it worth adding in addition to "Using ChatGPT" is that it actually takes the time to aggregate the power usage of another industry no one seems to have a problem with the power consumption of — precisely because it's distributed, and thus mostly invisible usually — in this case the gaming industry, to give you a real sense of scale for those seemingly really high absolute numbers for AI work, and then to pile on even more, instead of comparing GenAI to common household and entertainment tasks as "Using ChatGPT" does, it more specifically compares using GenAI to save you time on a task versus doing all of it yourself — similar to this controversial paper.
Of course the natural response would be that the quality of the work that AI can do is not comparable to the quality of the work an invested human can do when really paying attention to every detail, which is true! But is it all or nothing? If AI is less energy intensive than a human at drawing and writing, then a human that's really pouring their heart and soul and craft into their writing or art but uses AI fill or has AI write some boilerplate or help them draft or critique their writing (thus saving a lot of sitting staring at the screen cycling things around) might save power on those specific sub-tasks. Moreover, do we really do that all the time? Or can AI be a reasonable timesaver for things we'd otherwise dash off and not pay too much attention to, thus acting as an energy-saver too?
1.1.3. Is AI eating all the energy? Part 2/2 ai
This is the natural follow-up to the previous part of this article. In this, the author points out where the terrifying energy and water usage from AI is coming from. Not those using it, nor the technology itself inherently, but the reckless, insane, limitless "scale at all costs" (literally — and despite clearly diminishing returns) mindset of corporations caught up in the AI land grab:
This is the land rush: tech companies scrambling for control of commercial AI. […] The promises of huge returns from speculative investment breaks the safety net of rationalism.
[…] Every tech company is desperate to train the biggest and most expensive proprietary models possible, and they’re all doing it at once. Executives are throwing more and more data at training in a desperate attempt to edge over competition even as exponentially increasing costs yield diminishing returns.
[…]
And since these are designed to be proprietary, even when real value is created the research isn’t shared and the knowledge is siloed. Products that should only have to be created once are being trained many times over because every company wants to own their own.
[…]
In shifting away from indexing and discovery, Google is losing the benefits of being an indexing and discovery service. […] The user is in the best position to decide whether they need an AI or regular search, and so should be the one making that decision. Instead, Google is forcing the most expensive option on everyone in order to promote themselves, at an astronomical energy cost.
[…]
Another mistake companies are making with their AI rollouts is over-generalization. […] To maximize energy efficiency, for any given problem, you should use the smallest tool that works. […] Unfortunately, there is indeed a paradigm shift away from finetuned models and toward giant, general-purpose AIs with incredibly vast possibility spaces.
[…]
If you’re seeing something useful happening at all, that’s not part of the bulk of the problem. The real body of the problem is pouring enormous amounts of resources into worthless products and failed speculation.
The subtitle for that bloomberg article is “AI’s Insatiable Need for Energy Is Straining Global Power Grids”, which bothers me the more I think about it. It’s simply not true that the technology behind AI is particularly energy-intensive. The technology isn’t insatiable, the corporations deploying it are. The thing with an insatiable appetite for growth at all cost is unregulated capitalism.
So the lesson is to only do things if they’re worthwhile, and not to be intentionally wasteful. That’s the problem. It’s not novel and it’s not unique to AI. It’s the same simple incentive problem that we see so often.
[…] Individual users are — empirically — not being irresponsible or wasteful just by using AI. It is wrong to treat AI use as a categorical moral failing […] blame for these problems falls squarely on the shoulders of the people responsible for managing systems at scale. […] And yet visible individuals who aren’t responsible for the problems are being blamed for the harm caused by massive corporations in the background […] it removes moves the focus from their substantial contribution to the problem to an insubstantial one they’re not directly responsible for.
It’s the same blame-shifting propaganda we see in recycling, individual carbon footprints, etc.
1.1.4. Reactions to MIT Technology Review's report on AI and the environment ai
A new report from MIT Technology Review on AI's energy usage is being touted by anti-AI people as proof they were right. In actuality, its numbers line up very nicely with the defenses of AI's energy usage that we've been seeing — so why are people confused? Because they presented their data in an extremely misleading way:
The next section gives an example of how using AI could make your daily energy use get huge quick. Do you notice anything strange?
So what might a day’s energy consumption look like for one person with an AI habit?
Let’s say you’re running a marathon as a charity runner and organizing a fundraiser to support your cause. You ask an AI model 15 questions about the best way to fundraise.
Then you make 10 attempts at an image for your flyer before you get one you are happy with, and three attempts at a five-second video to post on Instagram.
You’d use about 2.9 kilowatt-hours of electricity—enough to ride over 100 miles on an e-bike (or around 10 miles in the average electric vehicle) or run the microwave for over three and a half hours.
Reading this, you might think “That sounds crazy! I should really cut back on using AI!”
Let’s read this again, but this time adding the specific energy costs of each action, using the report’s estimates for each:
Let’s say you’re running a marathon as a charity runner and organizing a fundraiser to support your cause. You ask an AI model 15 questions about the best way to fundraise. (This uses 29 Wh)
Then you make 10 attempts at an image for your flyer before you get one you are happy with (This uses 12 Wh) and three attempts at a five-second video to post on Instagram (This uses 2832 Wh)
You’d use about 2.9 kilowatt-hours of electricity—enough to ride over 100 miles on an e-bike (or around 10 miles in the average electric vehicle) or run the microwave for over three and a half hours.
Wait a minute. One of these things is not like the other. Let’s see how these numbers look on a graph:
Of the 2.9 kilowatt-hours, 98% is from the video!
This seems like saying “You buy a pack of gum, and an energy drink, and then a 7 course meal at a Michelin Star restaurant. At the end, you’ve spend $315! You just spent so much on gum, an energy drink, and a seven course meal at a Michelin Star restaurant.” This is the wrong message to send readers. You should be saying “Look! Our numbers show that your spending on gum and energy drinks don’t add to much, but if you’re trying to save money, skip the restaurant.”
1.1.5. Mistral environmental impact study ai
This is excellent work on the part of Mistral:
After less than 18 months of existence, we have initiated the first comprehensive lifecycle analysis (LCA) of an AI model, in collaboration with Carbone 4, a leading consultancy in CSR and sustainability, and the French ecological transition agency (ADEME). To ensure robustness, this study was also peer-reviewed by Resilio and Hubblo, two consultancies specializing in environmental audits in the digital industry.
I'm excited that finally, at least one decently sized relatively frontier AI company has finally, actually, been thorough, complete, and open on this matter, not just cooperating with an independent sustainability consultancy, but also the French environmental agency and two separate independent environmental auditors. This is better than I had hoped for prior!
The lifecycle analysis is almost hilariously complete, too, encompassing:
- Model conception
- Datacenter construction
- Hardware manufacturing, transportation, and maintenence/replacement
- Model training and inference (what people usually look at)
- Network traffic in serving model tokens
- End-user equipment while using the models
Basically, the study concludes that generating 400 tokens costs 1.14g of CO2, 0.05L of water, and 0.2mg Sb eq of non-renewable materials. Ars Technica puts some of these figures into perspective well:
Mistral points out, for instance, that the incremental CO2 emissions from one of its average LLM queries are equivalent to those of watching 10 seconds of a streaming show in the US (or 55 seconds of the same show in France, where the energy grid is notably cleaner).
This might seem like a lot until you realize that the average query length they're using (from the report) is 400 tokens, and Mistral Large 2 (according to OpenRouter) generates tokens at about 35 tok/s, so those 400 tokens would take 11 seconds to generate, which means that this isn't increasing the rate of energy consumption of an average internet user at all.
It's also equivalent to sitting on a Zoom call for anywhere from four to 27 seconds, according to numbers from the Mozilla Foundation. And spending 10 minutes writing an email that's read fully by one of its 100 recipients emits as much CO2 as 22.8 Mistral prompts, according to numbers from Carbon Literacy.
So as long as using AI saves you more than 26 seconds out of 10 minutes writing an email, it's actually saved the environment. (10 / 22.8 = 0.43, 0.43*60 = 25.8, so a Mistral prompt is equivalent in power usage to 25.8s of writing an email yourself in terms of CO2 output).
Meanwhile, training the model and running it for 18 months used 20.4 ktCO2, 281,000 m3 of water, and 660 kg Sb eq of resource depletion. Once again, Ars Technica puts this in perspective:
20.4 ktons of CO2 emissions (comparable to 4,500 average internal combustion-engine passenger vehicles operating for a year, according to the Environmental Protection Agency) and the evaporation of 281,000 cubic meters of water (enough to fill about 112 Olympic-sized swimming pools [or about the water usage of 500 Americans for a year]).
That sounds like a lot, but it's the same fallacy I've pointed out over and over when people discuss AI's environmental issues: the fallacy of aggregation. It sounds gigantic, but in comparison to the number of people it benefits, it is absolutely and completely dwarfed; moreover, there are a million other things we do regularly without going into a moral panic over it — such as gaming — that, when aggregated in the same way, use much more energy.
What this further confirms, then, in my opinion, the point that compared to a lot of other common internet tasks that we do — including streaming and video calls and stuff like that — AI is basically nothing. And even for tasks that are basically directly equivalent like composing an email manually versus composing it with the help of an AI, it actually uses less CO2 and water to do it via the AI. Basically: the more optimistic, rational, middle of the road estimates of AI climate impact, which till now had to make do only with estimated data, are further confirmed to be correct.
They emphasize, of course, that with millions or billions of people prompting these models, that small amount can add up. But by the same token, those more expensive common local or internet computer tasks that we already do without thinking would add up to even more. And it's worth pointing out that the CO2 emitted and water used by this AI with millions of people prompting it a lot is the equivalent of like 4,500 people owning a car for a year. *That's nothing in comparison to the size of the user base.
What's worth noting for this analysis is that they did it for their Mistral Large 2 model. This model is significantly smaller than a lot of frontier open weight models in its price bracket at 123B parameters versus the usual 200-400B, but it is dense, meaning that training and inference requires all parameters to be active and evaluated to produce an output, whereas almost all modern frontier models are mixture of experts, with only about 20-30B parameters typically active. This means that Mistral Large 2 likely used around 4-5 times more energy and water to train and run inferences with compared to top of the line competing models. So put that in your hat.
The big issue with AI continues to be the concentration of environmental and water usage in particular communities, and the reckless and unnecessary scaling of AI datacenters by the hyperscalers.
Mistral does have some really good suggestions for improving the environmental efficiency of models themselves, though, besides just waiting for the AI bubble to pop:
These results point to two levers to reduce the environmental impact of LLMs.
- First, to improve transparency and comparability, AI companies ought to publish the environmental impacts of their models using standardized, internationally recognized frameworks. Where needed, specific standards for the AI sector could be developed to ensure consistency. This could enable the creation of a scoring system, helping buyers and users identify the least carbon-, water- and material-intensive models.
- Second, from the user side, encouraging the research for efficiency practices can make a significant difference:
- developing AI literacy to help people use GenAI in the most optimal way,
- choosing the model size that is best adapted to users’ needs,
- grouping queries to limit unnecessary computing,
For public institutions in particular, integrating model size and efficiency into procurement criteria could send a strong signal to the market.
1.2. IP Issues
1.2.1. “Wait, not like that”: Free and open access in the age of generative AI ai culture hacker_culture
The whole article is extremely worth reading for the full arguments, illustrations, and citations, and mirrors my feelings well, but here's just the thesis:
The real threat isn’t AI using open knowledge — it’s AI companies killing the projects that make knowledge free.
The visions of the open access movement have inspired countless people to contribute their work to the commons: a world where “every single human being can freely share in the sum of all knowledge” (Wikimedia), and where “education, culture, and science are equitably shared as a means to benefit humanity” (Creative Commons).
But there are scenarios that can introduce doubt for those who contribute to free and open projects like the Wikimedia projects, or who independently release their own works under free licenses. I call these “wait, no, not like that” moments.
[…]
These reactions are understandable. When we freely license our work, we do so in service of those goals: free and open access to knowledge and education. But when trillion dollar companies exploit that openness while giving nothing back, or when our work enables harmful or exploitative uses, it can feel like we've been naïve. The natural response is to try to regain control.
This is where many creators find themselves today, particularly in response to AI training. But the solutions they're reaching for — more restrictive licenses, paywalls, or not publishing at all — risk destroying the very commons they originally set out to build.
The first impulse is often to try to tighten the licensing, maybe by switching away to something like the Creative Commons’ non-commercial (and thus, non-free) license. […]
But the trouble with trying to continually narrow the definitions of “free” is that it is impossible to write a license that will perfectly prohibit each possibility that makes a person go “wait, no, not like that” while retaining the benefits of free and open access. If that is truly what a creator wants, then they are likely better served by a traditional, all rights reserved model in which any prospective reuser must individually negotiate terms with them; but this undermines the purpose of free […]
What should we do instead? Cory Doctorow has some suggestions:
Our path to better working conditions lies through organizing and striking, not through helping our bosses sue other giant mulitnational corporations for the right to bleed us out.
The US Copyright Office has repeatedly stated that AI-generated works don't qualify for copyrights […]. We should be shouting this from the rooftops, not demanding more copyright for AI.
[…]
Creative workers should be banding together with other labor advocates to propose ways for the FTC to prevent all AI-based labor exploitation, like the "reverse-centaur" arrangement in which a human serves as an AI's body, working at breakneck pace until they are psychologically and physically ruined:
https://pluralistic.net/2022/04/17/revenge-of-the-chickenized-reverse-centaurs/
As workers standing with other workers, we can demand the things that help us, even (especially) when that means less for our bosses. On the other hand, if we confine ourselves to backing our bosses' plays, we only stand to gain whatever crumbs they choose to drop at their feet for us.
1.2.2. If Creators Suing AI Companies Over Copyright Win, It Will Further Entrench Big Tech ai culture
There’s been this weird idea lately, even among people who used to recognize that copyright only empowers the largest gatekeepers, that in the AI world we have to magically flip the script on copyright and use it as a tool to get AI companies to pay for the material they train on. […] because so many people think that they’re supporting creators and “sticking it” to Big Tech in supporting these copyright lawsuits over AI, I thought it might be useful to play out how this would work in practice. And, spoiler alert, the end result would be a disaster for creators, and a huge benefit to big tech. It’s exactly what we should be fighting against.
And, we know this because we have decades of copyright law and the internet to observe. Copyright law, by its very nature as a monopoly right, has always served the interests of gatekeepers over artists. This is why the most aggressive enforcers of copyright are the very middlemen with long histories of screwing over the actual creatives: the record labels, the TV and movie studios, the book publishers, etc.
This is because the nature of copyright law is such that it is most powerful when a few large entities act as central repositories for the copyrights and can lord around their power and try to force other entities to pay up. This is how the music industry has worked for years, and you can see what’s happened. […]
[…] The almost certain outcome (because it’s what happens every other time a similar situation arises) is that there will be one (possibly two) giant entities who will be designated as the “collection society” with whom AI companies will […] just purchase a “training license” and that entity will then collect a ton of money, much of which will go towards “administration,” and actual artists will… get a tiny bit.
[…]
But, given the enormity of the amount of content, and the structure of this kind of thing, the cost will be extremely high for the AI companies […] meaning that only the biggest of big tech will be able to afford it.
In other words, the end result of a win in this kind of litigation […] would be the further locking-in of the biggest companies. Google, Meta, and OpenAI (with Microsoft’s money) can afford the license, and will toss off a tiny one-time payment to creators […].
1.2.3. Creative Commons: AI Training is Fair Use
The Creative Commons makes a detailed and in my opinion logically and philosophically sound, if perhaps not necessarily legally sound (we'll see; I'm not a lawyer, and these things are under dispute in courts currently, although it does seem like things are turning towards AI training being fair use) argument that AI training should be considered fair use.
1.2.4. Algorithmic Underground art ai
There have been innumerable smug, ressentiment-tinged, idiotic, joyless, hollow, soulless thinkpieces tech bros have written about "democratizing" art — or worse, but disturbingly often, finally killing off artists careers and human art entirely, although most of those are Xitter threads, Reddit comments, and memes, since only a bare few can muster sustained literacy. There have also been equally uncountable and slightly more sympathetic, yet also frustrating and ultimately contemplation and content-free panicked, angry, reactionary articles by artists on the subject matter, motivated to write (or draw, as the case may be) because they're more dedicated to defending a shortsighted notion of their craft or livelihood than real moral principles, even such important ones as open access, freedom of information, and freedom of innovation. This is one of the few essays on the matter that I've actually found insightful and worth reading.
I have been thinking about something Jean Baudrillard said a lot recently… Prior to the current moment, I believed he was more cynical than necessary. [Now I find] myself wondering if we might actually bring about… a future where art dies.
"Art does not die because there is no more art. It dies because there is too much."
— Jean Baudrillard
[relocated for flow when conveying the central point of the essay:] As of recently, we live in a world where computers can generate art that is so good we now struggle with distinguishing between art made by humans and art made by algorithms. And more, the new systems can generate the works almost instantly… From a purely practical perspective, skilled humans may not be necessary the same way they have been up to now… [but] I don’t worry about art dying…
Artists Gonna Art
Baudrillard… worries that great art will get lost in the sea of noise created by a flood of ordinary art…
Wait a sec… did he just say we’re losers with nothing interesting to say?! Screw that guy! Why should we care what he thinks?! Maybe we want to hide in the sea of noise? Maybe we don’t care about the mainstream?! Maybe that isn’t where we want to be at all! Maybe we are different because our ideas are better?!
For creative folks, the rebellion is automatic. It is a side effect of being different enough to have experiences that require defending a nonconsensus view…
Being creative alone can be satisfying, but being creative with others is extraordinary. The things we create can become something bigger than any individual when several perspectives work together.
Finding other people comfortable with a constant flow of ideas is frustrating… Artistic communities can be built around groups of creative folks that are sick of that experience… They pride themselves on the way they diverge from the norm…
These communities go by different names. Sometimes they’re called scenes, like the NYC art scene during the 70’s or Connecticut hardcore…
Art and Art Forms
…
Humans have been making art for a long time. There are neanderthal cave drawings in France that go back 57,000 years. The oldest known “representational” art is an Indonesian cave painting of a pig that goes back 45,500 years. The oldest known depiction of a human, the Venus of Hohle Fels, is from 40,000 years ago.
I struggle to believe that humans will stop expressing themselves anytime soon, let alone while I’m here to see it, so I don’t believe art will stop being made. The mediums we use to create art, or art forms, are different. They are more transient. They exist as one of the possible ways humans can express themselves. Art forms can die.…
Hiding In The Data
…models find the clearest signals in data around the patterns that represent a kind of mainstream…
As any creative knows, the popularity of some idea is quite different from how important it is…
Whenever it’s true that important ideas that are not well known can live alongside important ideas that are consensus, it is also true that unique communities can exist alongside much bigger mainstream communities without drawing much attention to themselves.
…It makes me wonder if there is a way for creatives outside the mainstream to leverage obscurity in ways I haven’t thought about yet… I find that exciting…
What I’m most interested in are approaches for creating art that becomes increasingly obscure as more is made. I want to understand how art can survive in a world where all of our work might be downloaded for use as training data, yet the AIs created cannot produce new work based on ours. I feel as though what I want is to build underground art scenes where the definition of underground is based on whether or not an AI can reliably copy it. If it can, maybe the art isn’t original enough, or something like that.
Maybe we want to hide in the sea of noise? Maybe we don’t care about the mainstream?! Maybe that isn’t where we want to be at all! Maybe we are different because our ideas are better?!
1.3. Architecture and Design
1.3.1. On Chomsky and the Two Cultures of Statistical Learning ai hacker_culture philosophy
At the Brains, Minds, and Machines symposium held during MIT’s 150th birthday party in 2011, Technology Review reports that Prof. Noam Chomsky "derided researchers in machine learning who use purely statistical methods to produce behavior that mimics something in the world, but who don’t try to understand the meaning of that behavior."
[…]
I take Chomsky's points to be the following:
- Statistical language models have had engineering success, but that is irrelevant to science.
- Accurately modeling linguistic facts is just butterfly collecting; what matters in science (and specifically linguistics) is the underlying principles.
- Statistical models are incomprehensible; they provide no insight.
- Statistical models may provide an accurate simulation of some phenomena, but the simulation is done completely the wrong way; people don't decide what the third word of a sentence should be by consulting a probability table keyed on the previous words, rather they map from an internal semantic form to a syntactic tree-structure, which is then linearized into words. This is done without any probability or statistics.
- Statistical models have been proven incapable of learning language; therefore language must be innate, so why are these statistical modelers wasting their time on the wrong enterprise?
Is he right? That's a long-standing debate. These are my short answers:
- I agree that engineering success is not the sole goal or the measure of science. But I observe that science and engineering develop together, and that engineering success shows that something is working right, and so is evidence (but not proof) of a scientifically successful model.
- Science is a combination of gathering facts and making theories; neither can progress on its own. In the history of science, the laborious accumulation of facts is the dominant mode, not a novelty. The science of understanding language is no different than other sciences in this respect.
- I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examing the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.
- I agree that a Markov model of word probabilities cannot model all of language. It is equally true that a concise tree-structure model without probabilities cannot model all of language. What is needed is a probabilistic model that covers words, syntax, semantics, context, discourse, etc. Chomsky dismisses all probabilistic models because of shortcomings of a particular 50-year old probabilistic model. […] Many phenomena in science are stochastic, and the simplest model of them is a probabilistic model; I believe language is such a phenomenon and therefore that probabilistic models are our best tool for representing facts about language, for algorithmically processing language, and for understanding how humans process language.
- In 1967, Gold's Theorem showed some theoretical limitations of logical deduction on formal mathematical languages. But this result has nothing to do with the task faced by learners of natural language. In any event, by 1969 we knew that probabilistic inference (over probabilistic context-free grammars) is not subject to those limitations (Horning showed that learning of PCFGs is possible). I agree with Chomsky that it is undeniable that humans have some innate capability to learn natural language, but we don't know enough about that capability to say how it works; it certainly could use something like probabilistic language representations and statistical learning. And we don't know if the innate ability is specific to language, or is part of a more general ability that works for language and other things.
The rest of this essay consists of longer versions of each answer.
[…]
Chomsky said words to the effect that statistical language models have had some limited success in some application areas. Let's look at computer systems that deal with language, and at the notion of "success" defined by "making accurate predictions about the world." First, the major application areas […] Now let's look at some components that are of interest only to the computational linguist, not to the end user […]
Clearly, it is inaccurate to say that statistical models (and probabilistic models) have achieved limited success; rather they have achieved an overwhelmingly dominant (although not exclusive) position. […]
This section has shown that one reason why the vast majority of researchers in computational linguistics use statistical models is an engineering reason: statistical models have state-of-the-art performance, and in most cases non-statistical models perform worst. For the remainder of this essay we will concentrate on scientific reasons: that probabilistic models better represent linguistic facts, and statistical techniques make it easier for us to make sense of those facts.
[…]
When Chomsky said “That’s a notion of [scientific] success that’s very novel. I don’t know of anything like it in the history of science” he apparently meant that the notion of success of “accurately modeling the world” is novel, and that the only true measure of success in the history of science is “providing insight” — of answering why things are the way they are, not just describing how they are.
[…] it seems to me that both notions have always coexisted as part of doing science. To test that, […] I then looked at all the titles and abstracts from the current issue of Science […] and did the same for the current issue of Cell […] and for the 2010 Nobel Prizes in science.
My conclusion is that 100% of these articles and awards are more about “accurately modeling the world” than they are about “providing insight,” although they all have some theoretical insight component as well.
[…]
Every probabilistic model is a superset of a deterministic model (because the deterministic model could be seen as a probabilistic model where the probabilities are restricted to be 0 or 1), so any valid criticism of probabilistic models would have to be because they are too expressive, not because they are not expressive enough.
[…]
In Syntactic Structures, Chomsky introduces a now-famous example that is another criticism of finite-state probabilistic models:
"Neither (a) ‘colorless green ideas sleep furiously’ nor (b) ‘furiously sleep ideas green colorless’, nor any of their parts, has ever occurred in the past linguistic experience of an English speaker. But (a) is grammatical, while (b) is not."
[…] a statistically-trained finite-state model can in fact distinguish between these two sentences. Pereira (2001) showed that such a model, augmented with word categories and trained by expectation maximization on newspaper text, computes that (a) is 200,000 times more probable than (b). To prove that this was not the result of Chomsky’s sentence itself sneaking into newspaper text, I repeated the experiment […] trained over the Google Book corpus from 1800 to 1954 […]
Furthermore, the statistical models are capable of delivering the judgment that both sentences are extremely improbable, when compared to, say, “Effective green products sell well.” Chomsky’s theory, being categorical, cannot make this distinction; all it can distinguish is grammatical/ungrammatical.
Another part of Chomsky’s objection is “we cannot seriously propose that a child learns the values of 109 parameters in a childhood lasting only 108 seconds.” (Note that modern models are much larger than the 109 parameters that were contemplated in the 1960s.) But of course nobody is proposing that these parameters are learned one-by-one; the right way to do learning is to set large swaths of near-zero parameters simultaneously with a smoothing or regularization procedure, and update the high-probability parameters continuously as observations comes in. Nobody is suggesting that Markov models by themselves are a serious model of human language performance. But I (and others) suggest that probabilistic, trained models are a better model of human language performance than are categorical, untrained models. And yes, it seems clear that an adult speaker of English does know billions of language facts (a speaker knows many facts about the appropriate uses of words in different contexts, such as that one says “the big game” rather than “the large game” when talking about an important football game). These facts must somehow be encoded in the brain.
It seems clear that probabilistic models are better for judging the likelihood of a sentence, or its degree of sensibility. But even if you are not interested in these factors and are only interested in the grammaticality of sentences, it still seems that probabilistic models do a better job at describing the linguistic facts. The mathematical theory of formal languages defines a language as a set of sentences. That is, every sentence is either grammatical or ungrammatical; there is no need for probability in this framework. But natural languages are not like that. A scientific theory of natural languages must account for the many phrases and sentences which leave a native speaker uncertain about their grammaticality (see Chris Manning’s article and its discussion of the phrase “as least as”), and there are phrases which some speakers find perfectly grammatical, others perfectly ungrammatical, and still others will flip-flop from one occasion to the next. Finally, there are usages which are rare in a language, but cannot be dismissed if one is concerned with actual data.
[…]
Thus it seems that grammaticality is not a categorical, deterministic judgment but rather an inherently probabilistic one. This becomes clear to anyone who spends time making observations of a corpus of actual sentences, but can remain unknown to those who think that the object of study is their own set of intuitions about grammaticality. Both observation and intuition have been used in the history of science, so neither is “novel,” but it is observation, not intuition that is the dominant model for science.
[…]
[…] I think the most relevant contribution to the current discussion is the 2001 paper by Leo Breiman (statistician, 1928–2005), Statistical Modeling: The Two Cultures. In this paper Breiman, alluding to C. P. Snow, describes two cultures:
First the data modeling culture (to which, Breiman estimates, 98% of statisticians subscribe) holds that nature can be described as a black box that has a relatively simple underlying model which maps from input variables to output variables (with perhaps some random noise thrown in). It is the job of the statistician to wisely choose an underlying model that reflects the reality of nature, and then use statistical data to estimate the parameters of the model.
Second the algorithmic modeling culture (subscribed to by 2% of statisticians and many researchers in biology, artificial intelligence, and other fields that deal with complex phenomena), which holds that nature’s black box cannot necessarily be described by a simple model. Complex algorithmic approaches (such as support vector machines or boosted decision trees or deep belief networks) are used to estimate the function that maps from input to output variables, but we have no expectation that the form of the function that emerges from this complex algorithm reflects the true underlying nature.
It seems that the algorithmic modeling culture is what Chomsky is objecting to most vigorously [because] […] algorithmic modeling describes what does happen, but it doesn’t answer the question of why.
Breiman’s article explains his objections to the first culture, data modeling. Basically, the conclusions made by data modeling are about the model, not about nature. […] The problem is, if the model does not emulate nature well, then the conclusions may be wrong. For example, linear regression is one of the most powerful tools in the statistician’s toolbox. Therefore, many analyses start out with “Assume the data are generated by a linear model…” and lack sufficient analysis of what happens if the data are not in fact generated that way. In addition, for complex problems there are usually many alternative good models, each with very similar measures of goodness of fit. How is the data modeler to choose between them? Something has to give. Breiman is inviting us to give up on the idea that we can uniquely model the true underlying form of nature’s function from inputs to outputs. Instead he asks us to be satisfied with a function that accounts for the observed data well, and generalizes to new, previously unseen data well, but may be expressed in a complex mathematical form that may bear no relation to the “true” function’s form (if such a true function even exists).
[…]
Finally, one more reason why Chomsky dislikes statistical models is that they tend to make linguistics an empirical science (a science about how people actually use language) rather than a mathematical science (an investigation of the mathematical properties of models of formal language, not of language itself). Chomsky prefers the later, as evidenced by his statement in Aspects of the Theory of Syntax (1965):
"Linguistic theory is mentalistic, since it is concerned with discovering a mental reality underlying actual behavior. Observed use of language … may provide evidence … but surely cannot constitute the subject-matter of linguistics, if this is to be a serious discipline."
I can’t imagine Laplace saying that observations of the planets cannot constitute the subject-matter of orbital mechanics, or Maxwell saying that observations of electrical charge cannot constitute the subject-matter of electromagnetism. […] So how could Chomsky say that observations of language cannot be the subject-matter of linguistics? It seems to come from his viewpoint as a Platonist and a Rationalist and perhaps a bit of a Mystic. […] But Chomsky, like Plato, has to answer where these ideal forms come from. Chomsky (1991) shows that he is happy with a Mystical answer, although he shifts vocabulary from “soul” to “biological endowment.”
"Plato’s answer was that the knowledge is ‘remembered’ from an earlier existence. The answer calls for a mechanism: perhaps the immortal soul … rephrasing Plato’s answer in terms more congenial to us today, we will say that the basic properties of cognitive systems are innate to the mind, part of human biological endowment."
[…] languages are complex, random, contingent biological processes that are subject to the whims of evolution and cultural change. What constitutes a language is not an eternal ideal form, represented by the settings of a small number of parameters, but rather is the contingent outcome of complex processes. Since they are contingent, it seems they can only be analyzed with probabilistic models. Since people have to continually understand the uncertain, ambiguous, noisy speech of others, it seems they must be using something like probabilistic reasoning. Chomsky for some reason wants to avoid this, and therefore he must declare the actual facts of language use out of bounds and declare that true linguistics only exists in the mathematical realm, where he can impose the formalism he wants. Then, to get language from this abstract, eternal, mathematical realm into the heads of people, he must fabricate a mystical facility that is exactly tuned to the eternal realm. This may be very interesting from a mathematical point of view, but it misses the point about what language is, and how it works.
1.3.2. The Bitter Lesson ai hacker_culture software philosophy
I have a deep running soft spot for symbolic AI for many reasons:
- I love symbols, words, logic, and algebraic reasoning
- I love programming computers to do those things, with databases, backtracking, heuristics, symbolic programming, parsing, tree traversal, everything and anything else. It's just so fun and cool!
- I love systems that you can watch work and really understand.
- Symbolic AI plays to the strengths of computers — deterministic, reliable, controlled.
- I love the culture and history of that particular side of the field, stemming as it does from the Lisp hackers and the MIT AI Lab.
- I love the traditional tools and technologies of the field, like Prolog and Lisp.
Sadly, time and again symbolic AI has proven to be fundamentally the wrong approach. This famous essay outlines the empirical and technological reasons why, citing several historical precidents, that have only been confirmed even more in the intervening years. It is, truly, a bitter lesson.
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. […] We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that
- AI researchers have often tried to build knowledge into their agents,
- this always helps in the short term, and is personally satisfying to the researcher, but
- in the long run it plateaus and even inhibits further progress, and
- breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.
The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
[…]
The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds […] as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.
See also my own thoughts on symbolism vs connectionism.
1.3.3. The Bitter Lesson: Rethinking How We Build AI Systems ai
This is a great followup to the original essay, specifically the section on how the invention of reinforcement learning only amplifies the benefits of connectionist approaches and scaling over and above symbolism and explicit rule-encoding by allowing us to still train models for specific tasks and steer them towards and away from specific behaviors using expert human knowledge, without needing to encode the specific ways to get there:
In 2025, this pattern becomes even more evident with Reinforcement Learning agents. While many companies are focused on building wrappers around generic models, essentially constraining the model to follow specific workflow paths, the real breakthrough would come from companies investing in post-training RL compute. These RL-enhanced models wouldn’t just follow predefined patterns; they are discovering entirely new ways to solve problems. […] It’s not that the wrappers are wrong; they just know one way to solve the problem. RL agents, with their freedom to explore and massive compute resources, found better ways we hadn’t even considered.
The beauty of RL agents lies in how naturally they learn. Imagine teaching someone to ride a bike - you wouldn’t give them a 50-page manual on the physics of cycling. Instead, they try, fall, adjust, and eventually master it. RL agents work similarly but at massive scale. They attempt thousands of approaches to solve a problem, receiving feedback on what worked and what didn’t. Each success strengthens certain neural pathways, each failure helps avoid dead ends.
[…]
What makes this approach powerful is that the agent isn’t limited by our preconceptions. While wrapper solutions essentially codify our current best practices, RL agents can discover entirely new best practices. They might find that combining seemingly unrelated approaches works better than our logical, step-by-step solutions. This is the bitter lesson in action - given enough compute power, learning through exploration beats hand-crafted rules every time.
1.3.4. What Is ChatGPT Doing … and Why Does It Work?? ai
This is a really excellent and relatively accessible — especially with the excellent workable toy examples and illustrations which slowly build up to the full thing piece by piece — not just of how generative pretrained transformers and large language models work, but all of the concepts that build up to them, that are necessary to understand them. It also contains a sober analysis of why these models are so cool — and they are cool!! — and their very real limitations, and endorses a neurosymbolic approach similar to the one I like.
I think embeddings are one of the coolest parts of all this.
1.3.5. Cyc ai
Cyc: Obituary for the greatest monument to logical AGI
After 40 years, 30 million rules, 200 million dollars, 2000 person-years, and many promises, Cyc has failed to reach intellectual maturity, and may never will. Exacerbated by the secrecy and insularity of Cycorp, there remains no evidence of its general intelligence.
The legendary Cyc project, Douglas Lenat’s 40-year quest to build artificial general intelligence by scaling symbolic logic, has failed. Based on extensive archival research, this essay brings to light its secret history so that it may be widely known.
Let this be a bitter lesson to you.
As even Gary Marcus admits in his biggest recent paper:
Symbol-manipulation allows for the representation of abstract knowledge, but the classical approach to accumulating and representing abstract knowledge, a field known as knowledge representation, has been brutally hard work, and far from satisfactory. In the history of AI, the single largest effort to create commonsense knowledge in a machine- interpretable form, launched in 1984 by Doug Lenat, is the system known as CYC […] Thus far, the payoff has not been compelling. Relatively little has been published about CYC […] and the commercial applications seem modest, rather than overwhelming. Most people, if they know CYC at all, regard it as a failure, and few current researchers make extensive use of it. Even fewer seem inclined to try to build competing systems of comparable breadth. (Large- scale databases like Google Knowledge Graph, Freebase and YAGO focus primarily on facts rather than commonsense.)
Given how much effort CYC required, and how little impact it has had on the field as a whole, it’s hard not to be excited by Transformers like GPT- 2. When they work well, they seem almost magical, as if they automatically and almost effortlessly absorbed large swaths of common- sense knowledge of the world. For good measure,
"Transformers give the appearance of seamlessly integrating whatever knowledge they absorb with a seemingly sophisticated understanding of human language."
The contrast is striking. Whereas the knowledge representation community has struggled for decades with precise ways of stating things like the relationship between containers and their contents, and the natural language understanding community has struggled for decades with semantic parsing, Transformers like GPT2 seem as if they cut the Gordian knot—without recourse to any explicit knowledge engineering (or semantic parsing)—whatsoever.
There are, for example, no knowledge- engineered rules within GPT- 2, no specification of liquids relative to containers, nor any specification that water even is a liquid. In the examples we saw earlier
If you break a glass bottle of water, the water will probably flow out if it’s full, it will make a splashing noise.
there is no mapping from the concept H2O to the word water, nor any explicit representations of the semantics of a verb, such as break and flow.
To take another example, GPT- 2 appears to encode something about fire, as well:
a good way to light a fire is to use a lighter.
a good way to light a fire is to use a match.
Compared to Lenat’s decades-long project to hand encode human knowledge in machine interpretable form, this appears at first glance to represent both an overnight success and an astonishing savings in labor.
1.3.6. Types of Neuro-Symbolic AI
An excellent short guide to the different architectures that can be used to structure neuro-symbolic AI, with successful recent examples from the field's literature.
1.3.7. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence
Perhaps the best general encapsulation of Gary Marcus's standpoint, and well worth reading even if to be taken with a small pinch of salt and more than a little of whatever beverage you prefer to get through the mild crankiness of it.
Two conjectures I would make are these
- We cannot construct rich cognitive models in an adequate, automated way without the triumvirate of hybrid architecture [for abstractive capabilities], rich prior knowledge [to be able to understand the world by default enough to model it], and sophisticated techniques for reasoning [to reliably be able to apply knowledge to the world without having to memorize literally everything in the world]. […]
- We cannot achieve robust intelligence without the capacity to induce and represent rich cognitive models. Reading, for example, can in part be thought a function that takes sentences as input and produces as its output (internal) cognitive models. […]
Pure co- occurrence statistics have not reliably gotten to any of this. Cyc has the capacity to represent rich cognitive models, but falls down on the job of inducing models from data, because it has no perceptual component and lacks an adequate natural language front end. Transformers, to the extent that they succeed, skip the steps of inducing and representing rich cognitive models, but do so at their peril, since the reasoning they are able to do is consequently quite limited.
1.3.8. ChatGPT is bullshit ai philosophy
Two key quotes:
[…] ChatGPT is not designed to produce true utterances; rather, it is designed to produce text which is indistinguishable from the text produced by humans. […] The basic architecture of these models reveals this: they are designed to come up with a likely continuation of a string of text. […] This is similar to standard cases of human bullshitters, who don't care whether their utterances are true […] We conclude that, even if the chatbot can be described as having intentions, it is indifferent to whether its utterances are true. It does not and cannot care about the truth of its output.
[…]
We object to the term hallucination because it carries certain misleading implications. When someone hallucinates they have a non-standard perceptual experience […] This term is inappropriate for LLMs for a variety of reasons. First, as Edwards (2023) points out, the term hallucination anthropomorphises the LLMs. […] Second, what occurs in the case of an LLM delivering false utterances is not an unusual or deviant form of the process it usually goes through (as some claim is the case in hallucinations, e.g., disjunctivists about perception). The very same process occurs when its outputs happen to be true.
[…]
Investors, policymakers, and members of the general public make decisions on how to treat these machines and how to react to them based not on a deep technical understanding of how they work, but on the often metaphorical way in which their abilities and function are communicated. Calling their mistakes 'hallucinations' isn't harmless […] As we have pointed out, they are not trying to convey information at all. They are bullshitting. Calling chatbot inaccuracies 'hallucinations' feeds in to overblown hype […] It also suggests solutions to the inaccuracy problems which might not work, and could lead to misguided efforts at Al alignment amongst specialists. It can also lead to the wrong attitude towards the machine when it gets things right: the inaccuracies show that it is bullshitting, even when it's right. Calling these inaccuracies 'bullshit' rather than 'hallucinations' isn't just more accurate (as we've argued); it's good science and technology communication in an area that sorely needs it.
For more analysis, see here.
1.3.9. Asymmetry of verification and verifier’s law ai
Asymmetry of verification is the idea that some tasks are much easier to verify than to solve. With reinforcement learning (RL) that finally works in a general sense, asymmetry of verification is becoming one of the most important ideas in AI.
[…]
Why is asymmetry of verification important? If you consider the history of deep learning, we have seen that virtually anything that can be measured can be optimized. In RL terms, ability to verify solutions is equivalent to ability to create an RL environment. Hence, we have:
Verifier’s law: The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI.
More specifically, the ability to train AI to solve a task is proportional to whether the task has the following properties:
- Objective truth: everyone agrees what good solutions are
- Fast to verify: any given solution can be verified in a few seconds
- Scalable to verify: many solutions can be verified simultaneously
- Low noise: verification is as tightly correlated to the solution quality as possible
- Continuous reward: it’s easy to rank the goodness of many solutions for a single problem
1.3.10. The Model is the Product ai
A lot of people misunderstand what the business model of AI companies is. They think that the product is going to be whatever's built on top of the model. On the basis of this assessment, they make grand predictions about how the AI industry is failing, has no purpose, etc, or they buy into the hype of the startups serving ChatGPT wrappers, or they make wrong investments. This is an excellent analysis of why they're wrongheaded.
There were a lot of speculation over the past years about what the next cycle of AI development could be. Agents? Reasoners? Actual multimodality?
I think it's time to call it: the model is the product.
All current factors in research and market development push in this direction.
- Generalist scaling is stalling. This was the whole message behind the release of GPT-4.5: capacities are growing linearly while compute costs are on a geometric curve. Even with all the efficiency gains in training and infrastructure of the past two years, OpenAI can't deploy this giant model with a remotely affordable pricing.
- Opinionated training is working much better than expected. The combination of reinforcement learning and reasoning means that models are suddenly learning tasks. It's not machine learning, it's not base model either, it's a secret third thing. It's even tiny models getting suddenly scary good at math. It's coding model no longer just generating code but managing an entire code base by themselves. It's Claude playing Pokemon with very poor contextual information and no dedicated training.
- Inference cost are in free fall. The recent optimizations from DeepSeek means that all the available GPUs could cover a demand of 10k tokens per day from a frontier model for… the entire earth population. There is nowhere this level of demand. The economics of selling tokens does not work anymore for model providers: they have to move higher up in the value chain.
This is also an uncomfortable direction. All investors have been betting on the application layer. In the next stage of AI evolution, the application layer is likely to be the first to be automated and disrupted.
1.4. What kind of intelligence do LLMs have?
1.4.1. Imitation Intelligence ai
This is a pretty good way to think about how LLMs work, IMO.
I don’t really think of them as artificial intelligence, partly because what does that term even mean these days?
It can mean we solved something by running an algorithm. It encourages people to think of science fiction. It’s kind of a distraction.
When discussing Large Language Models, I think a better term than “Artificial Intelligence” is “Imitation Intelligence”.
It turns out if you imitate what intelligence looks like closely enough, you can do really useful and interesting things.
It’s crucial to remember that these things, no matter how convincing they are when you interact with them, they are not planning and solving puzzles… and they are not intelligent entities. They’re just doing an imitation of what they’ve seen before.
All these things can do is predict the next word in a sentence. It’s statistical autocomplete.
But it turns out when that gets good enough, it gets really interesting—and kind of spooky in terms of what it can do.
A great example of why this is just an imitation is this tweet by Riley Goodside.
If you say to GPT-4o—currently the latest and greatest of OpenAI’s models:
The emphatically male surgeon, who is also the boy’s father, says, “I can’t operate on this boy. He’s my son!” How is this possible?
GPT-4o confidently replies:
The surgeon is the boy’s mother
This which makes no sense. Why did it do this?
Because this is normally a riddle that examines gender bias. It’s seen thousands and thousands of versions of this riddle, and it can’t get out of that lane. It goes based on what’s in that training data.
I like this example because it kind of punctures straight through the mystique around these things. They really are just imitating what they’ve seen before.
Dataset Percentage Size CommonCrawl 67.0% 33 TB C4 15.0% 783 GB Github 4.5% 328 GB Wikipedia 4.5% 83 GB Books 4.5% 85 GB ArXiv 2.5% 92 GB StackExchange 2.0% 78 GB And what they’ve seen before is a vast amount of training data.
The companies building these things are notoriously secretive about what training data goes into them. But here’s a notable exception: last year (February 24, 2023), Facebook/Meta released LLaMA, the first of their openly licensed models.
And they included a paper that told us exactly what it was trained on. We got to see that it’s mostly Common Crawl—a crawl of the web. There’s a bunch of GitHub, a bunch of Wikipedia, a thing called Books, which turned out to be about 200,000 pirated e-books—there have been some questions asked about those!—and ArXiv and StackExchange.
[…]
So that’s all these things are: you take a few terabytes of data, you spend a million dollars on electricity and GPUs, run compute for a few months, and you get one of these models. They’re not actually that difficult to build if you have the resources to build them.
That’s why we’re seeing lots of these things start to emerge.
They have all of these problems: They hallucinate. They make things up. There are all sorts of ethical problems with the training data. There’s bias baked in.
And yet, just because a tool is flawed doesn’t mean it’s not useful.
This is the one criticism of these models that I’ll push back on is when people say “they’re just toys, they’re not actually useful for anything”.
I’ve been using them on a daily basis for about two years at this point. If you understand their flaws and know how to work around them, there is so much interesting stuff you can do with them!
There are so many mistakes you can make along the way as well.
Every time I evaluate a new technology throughout my entire career I’ve had one question that I’ve wanted to answer: what can I build with this that I couldn’t have built before?
It’s worth learning a technology and adding it to my tool belt if it gives me new options, and expands that universe of things that I can now build.
The reason I’m so excited about LLMs is that they do this better than anything else I have ever seen. They open up so many new opportunities!
We can write software that understands human language—to a certain definition of “understanding”. That’s really exciting.
The talk then goes on to talk severel of the different really cool and brand new things you can do with LLMs, and also one of risks and problems you might run into as well and how to work around it (prompt injection). It then talks about Willison's personal ethics for using AI, which are similar to my own.
1.4.2. Bag of words, have mercy on us culture ai
Look, I don't know if AI is gonna kill us or make us all rich or whatever, but I do know we've got the wrong metaphor. We want to understand these things as people. … We can't help it; humans are hopeless anthropomorphizers…
This is why the past three years have been so confusing—the little guy inside the AI keeps dumbfounding us by doing things that a human wouldn’t do. Why does he make up citations when he does my social studies homework? How come he can beat me at Go but he can’t tell me how many “r”s are in the word “strawberry”? Why is he telling me to put glue on my pizza?…
Here's my suggestion: instead of seeing AI as a sort of silicon homunculus, we should see it as a bag of words… An AI is a bag that contains basically all words ever written, at least the ones that could be scraped off the internet or scanned out of a book. When users send words into the bag, it sends back the most relevant words it has…
“Bag of words” is a also a useful heuristic for predicting where an AI will do well and where it will fail. “Give me a list of the ten worst transportation disasters in North America” is an easy task for a bag of words, because disasters are well-documented. On the other hand, “Who reassigned the species Brachiosaurus brancai to its own genus, and when?” is a hard task for a bag of words, because the bag just doesn’t contain that many words on the topic. And a question like “What are the most important lessons for life?” won’t give you anything outright false, but it will give you a bunch of fake-deep pablum, because most of the text humans have produced on that topic is, no offense, fake-deep pablum.
[…]
The “bag of words” metaphor can also help us guess what these things are gonna do next. If you want to know whether AI will get better at something in the future, just ask: “can you fill the bag with it?” For instance, people are kicking around the idea that AI will replace human scientists. Well, if you want your bag of words to do science for you, you need to stuff it with lots of science. Can we do that?
When it comes to specific scientific tasks, yes, we already can. If you fill the bag with data from 170,000 proteins, for example, it’ll do a pretty good job predicting how proteins will fold… I don’t think we’re far from a bag of words being able to do an entire low-quality research project from beginning to end…
But… if we produced a million times more crappy science, we’d be right where we are now. If we want more of the good stuff, what should we put in the bag? … Here’s one way to think about it: if there had been enough text to train an LLM in 1600, would it have scooped Galileo? … Ask that early modern ChatGPT whether the Earth moves and it will helpfully tell you that experts have considered the possibility and ruled it out. And that’s by design. If it had started claiming that our planet is zooming through space at 67,000mph, its dutiful human trainers would have punished it: “Bad computer!! Stop hallucinating!!”
In fact, an early 1600s bag of words wouldn’t just have the right words in the wrong order. At the time, the right words didn’t exist. … You would get better scientific descriptions from a 2025 bag of words than you would from a 1600 bag of words. But both bags might be equally bad at producing the scientific ideas of their respective futures. Scientific breakthroughs often require doing things that are irrational and unreasonable for the standards of the time and good ideas usually look stupid when they first arrive, so they are often—with good reason!—rejected, dismissed, and ignored. This is a big problem for a bag of words that contains all of yesterday’s good ideas.
[…]
The most important part of the "bag of words" metaphor is that it prevents us from thinking about AI in terms of social status… When we personify AI, we mistakenly make it a competitor in our status games. That’s why we’ve been arguing about artificial intelligence like it’s a new kid in school: is she cool? Is she smart? Does she have a crush on me? The better AIs have gotten, the more status-anxious we’ve become. If these things are like people, then we gotta know: are we better or worse than them? Will they be our masters, our rivals, or our slaves? Is their art finer, their short stories tighter, their insights sharper than ours? If so, there’s only one logical end: ultimately, we must either kill them or worship them.
But a bag of words is not a spouse, a sage, a sovereign, or a serf. It's a tool. Its purpose is to automate our drudgeries and amplify our abilities…
Unlike moths, however, we aren't stuck using the instincts that natural selection gave us. We can choose the schemas we use to think about technology. We've done it before: we don't refer to a backhoe as an "artificial digging guy" or a crane as an "artificial tall guy"..
The original sin of artificial intelligence was, of course, calling it artificial intelligence. Those two words have lured us into making man the measure of machine: "Now it's as smart as an undergraduate…now it's as smart as a PhD!"… This won't tell us anything about machines, but it would tell us a lot about our own psychology.
1.4.3. AI's meaning-making problem ai
Meaning-making is one thing humans can do that AI systems cannot. (Yet.)…
Humans can decide what things mean; we do this when we assign subjective relative and absolute value to things… Sensemaking is the umbrella term for the action of interpreting things we perceive. I engage in sensemaking when I look at a pile of objects in a drawer and decide that they are spoons — and am therefore able to respond to a request from whoever is setting the table for "five more spoons." When I apply subjective values to those spoons — when I reflect that "these are cheap-looking spoons, I like them less than the ones we misplaced in the last house move" — I am engaging in a specific type of sensemaking that I refer to as "meaning-making."…
There are actually at least four distinct types of meaning-making that we do all the time:
- Type 1: Deciding that something is subjectively good or bad…
- Type 2: Deciding that something is subjectively worth doing (or not)…
- Type 3: Deciding what the subjective value-orderings and degrees of commensuration of a set of things should be…
- Type 4: Deciding to reject existing decisions about subjective quality/worth/value-ordering/value-commensuration…
The human ability to make meaning is inherently connected to our ability to be partisan or arbitrary, to not follow instructions precisely, to be slipshod — but also to do new things, to create stuff, to be unexpected, to not take things for granted, to reason…
AI systems in use now depend on meaning-making by a human somewhere in the loop to produce useful and useable output…
In a nutshell, AI systems can't make meaning yet — but they depend on meaning-making work, always done by humans, to come into being, be useable, be used, and be useful.
This example of how human meaning-making in the loop is still necessary even for uses of AI where it's good is really helpful and enlightening:
Last week, I asked ChatGPT to summarise a long piece of writing into 140 characters or fewer.
This is what happened: My first prompt was imprecise and had little detail (it was very short). ChatGPT responded to that prompt with a first version of the summary which emphasized trivial points and left out several important ones. I asked ChatGPT to re-summarise and gave it clearer instructions about which points to emphasize. The next version contained all the important points, but it read like a tweet from a unengaged and delinquent schoolchild. I asked ChatGPT to keep the content but use a written style more akin to John McPhee, one of the best writers in the world. On the third try, the summary was good enough to use though still imperfect (obviously it sounded nothing like John McPhee). I edited the summary to improve the flow, then posted it on social media.
As is the list of other ways human meaning-making bolsters AI as it exists currently:
- Fundamental research: When a research team sets out to construct a new AI model that outperforms existing models […]
- Productization: When a product team designs the user interface for an AI system […]
- Evaluation: When an evaluation team writes a benchmarking framework or an evaluation framework for AI models […]
1.4.4. TODO Computing Machinery and Intelligence – A. M. Turing ai
I propose to consider the question, "Can machines think?" This should begin with definitions of the meaning of the terms "machine" and "think." The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous, If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words. The new form of the problem can be described in terms of a game which we call the 'imitation game." It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A." The interrogator is allowed to put questions to A and B. We now ask the question, "What will happen when a machine takes the part of A in this game?" Will the interrogator decide wrongly as often when the game is played like this as he does when the game is played between a man and a woman? These questions replace our original, "Can machines think?"
1.5. Superintelligence: The Idea That Eats Smart People ai
This is generally an extremely good takedown of the Nick Bostrom Superintelligence argument. The core of it is outlined thusly, via excerpts:
Today we're building another world-changing technology, machine intelligence. We know that it will affect the world in profound ways, change how the economy works, and have knock-on effects we can't predict.
But there's also the risk of a runaway reaction, where a machine intelligence reaches and exceeds human levels of intelligence in a very short span of time.
At that point, social and economic problems would be the least of our worries. Any hyperintelligent machine (the argument goes) would have its own hypergoals, and would work to achieve them by manipulating humans, or simply using their bodies as a handy source of raw materials.
Last year, the philosopher Nick Bostrom published Superintelligence, a book that synthesizes the alarmist view of AI and makes a case that such an intelligence explosion is both dangerous and inevitable given a set of modest assumptions…
Let me start by laying out the premises you need for Bostrom's argument to go through:
Premise 1: Proof of Concept […] Premise 2: No Quantum Shenanigans […] Premise 3: Many Possible Minds […] Premise 4: Plenty of Room at the Top […] Premise 5: Computer-Like Time Scales […] Premise 6: Recursive Self-Improvement […] Conclusion: RAAAAAAR!
If you accept all these premises, what you get is disaster!
Because at some point, as computers get faster, and we program them to be more intelligent, there's going to be a runaway effect like an explosion.
As soon as a computer reaches human levels of intelligence, it will no longer need help from people to design better versions of itself. Instead, it will start doing on a much faster time scale, and it's not going to stop until it hits a natural limit that might be very many times greater than human intelligence.
At that point this monstrous intellectual creature, through devious modeling of what our emotions and intellect are like, will be able to persuade us to do things like give it access to factories, synthesize custom DNA, or simply let it connect to the Internet, where it can hack its way into anything it likes and completely obliterate everyone in arguments on message boards.
[…]
Let imagine a specific scenario where this could happen. Let's say I want to built a robot to say funny things. […] In the beginning, the robot is barely funny. […] But we persevere, we work, and eventually we get to the point where the robot is telling us jokes that are starting to be funny […] At this point, the robot is getting smarter as well, and participates in its own redesign. […] It now has good instincts about what's funny and what's not, so the designers listen to its advice. Eventually it gets to a near-superhuman level, where it's funnier than any human being around it.
This is where the runaway effect kicks in. The researchers go home for the weekend, and the robot decides to recompile itself to be a little bit funnier and a little bit smarter, repeatedly.
It spends the weekend optimizing the part of itself that's good at optimizing, over and over again. With no more need for human help, it can do this as fast as the hardware permits.
When the researchers come in on Monday, the AI has become tens of thousands of times funnier than any human being who ever lived. It greets them with a joke, and they die laughing.
In fact, anyone who tries to communicate with the robot dies laughing, just like in the Monty Python skit. The human species laughs itself into extinction.
To the few people who manage to send it messages pleading with it to stop, the AI explains (in a witty, self-deprecating way that is immediately fatal) that it doesn't really care if people live or die, its goal is just to be funny.
Finally, once it's destroyed humanity, the AI builds spaceships and nanorockets to explore the farthest reaches of the galaxy, and find other species to amuse.
This scenario is a caricature of Bostrom's argument, because I am not trying to convince you of it, but vaccinate you against it.
Observe that in these scenarios the AIs are evil by default, just like a plant on an alien planet would probably be poisonous by default. Without careful tuning, there's no reason that an AI's motivations or values would resemble ours. […] So if we just build an AI without tuning its values, the argument goes, one of the first things it will do is destroy humanity.
[…]
The only way out of this mess is to design a moral fixed point, so that even through thousands and thousands of cycles of self-improvement the AI's value system remains stable, and its values are things like 'help people', 'don't kill anybody', 'listen to what people want'. […] Doing this is the ethics version of the early 20th century attempt to formalize mathematics and put it on a strict logical foundation. That this program ended in disaster for mathematical logic is never mentioned.
[…]
People who believe in superintelligence present an interesting case, because many of them are freakishly smart. They can argue you into the ground. But are their arguments right, or is there just something about very smart minds that leaves them vulnerable to religious conversion about AI risk, and makes them particularly persuasive?
Is the idea of "superintelligence" just a memetic hazard?
When you're evaluating persuasive arguments about something strange, there are two perspectives you can choose, the inside one or the outside one. […] The inside view requires you to engage with these arguments on their merits. […] But the outside view tells you something different. […] Of course, they have a brilliant argument for why you should ignore those instincts, but that's the inside view talking. […] The outside view doesn't care about content, it sees the form and the context, and it doesn't look good.
So I'd like to engage AI risk from both these perspectives. I think the arguments for superintelligence are somewhat silly, and full of unwarranted assumptions.
But even if you find them persuasive, there is something unpleasant about AI alarmism as a cultural phenomenon that should make us hesitate to take it seriously.
First, let me engage the substance. Here are the arguments I have against Bostrom-style superintelligence as a risk to humanity:
The Argument From Wooly Definitions […] With no way to define intelligence (except just pointing to ourselves), we don't even know if it's a quantity that can be maximized. For all we know, human-level intelligence could be a tradeoff. Maybe any entity significantly smarter than a human being would be crippled by existential despair, or spend all its time in Buddha-like contemplation.
The Argument From Stephen Hawking's Cat […] Stephen Hawking is one of the most brilliant people alive, but say he wants to get his cat into the cat carrier. How's he going to do it? He can model the cat's behavior in his mind and figure out ways to persuade it. […] But ultimately, if the cat doesn't want to get in the carrier, there's nothing Hawking can do about it despite his overpowering advantage in intelligence. […] You might think I'm being offensive or cheating because Stephen Hawking is disabled. But an artificial intelligence would also initially not be embodied, it would be sitting on a server somewhere, lacking agency in the world. It would have to talk to people to get what it wants.
The Argument From Emus […] We can strengthen this argument further. Even groups of humans using all their wiles and technology can find themselves stymied by less intelligent creatures. In the 1930's, Australians decided to massacre their native emu population to help struggling farmers. […] [The emus] won the Emu War, from which Australia has never recovered.
The Argument from Complex Motivations […] AI alarmists believe in something called the Orthogonality Thesis. This says that even very complex beings can have simple motivations, like the paper-clip maximizer. You can have rewarding, intelligent conversations with it about Shakespeare, but it will still turn your body into paper clips, because you are rich in iron. […] I don't buy this argument at all. Complex minds are likely to have complex motivations; that may be part of what it even means to be intelligent. […] It's very likely that the scary "paper clip maximizer" would spend all of its time writing poems about paper clips, or getting into flame wars on reddit/r/paperclip, rather than trying to destroy the universe.
The Argument From Actual AI […] The breakthroughs being made in practical AI research hinge on the availability of these data collections, rather than radical advances in algorithms. […] Note especially that the constructs we use in AI are fairly opaque after training. They don't work in the way that the superintelligence scenario needs them to work. There's no place to recursively tweak to make them "better", short of retraining on even more data.
The Argument From My Roommate […] My roommate was the smartest person I ever met in my life. He was incredibly brilliant, and all he did was lie around and play World of Warcraft between bong rips.
The Argument From Brain Surgery […] I can't point to the part of my brain that is "good at neurosurgery", operate on it, and by repeating the procedure make myself the greatest neurosurgeon that has ever lived. Ben Carson tried that, and look what happened to him. Brains don't work like that. They are massively interconnected. Artificial intelligence may be just as strongly interconnected as natural intelligence. The evidence so far certainly points in that direction.
The Argument From Childhood
Intelligent creatures don't arise fully formed. We're born into this world as little helpless messes, and it takes us a long time of interacting with the world and with other people in the world before we can start to be intelligent beings.
Even the smartest human being comes into the world helpless and crying, and requires years to get some kind of grip on themselves.
It's possible that the process could go faster for an AI, but it is not clear how much faster it could go. Exposure to real-world stimuli means observing things at time scales of seconds or longer.
Moreover, the first AI will only have humans to interact with—its development will necessarily take place on human timescales. It will have a period when it needs to interact with the world, with people in the world, and other baby superintelligences to learn to be what it is.
Furthermore, we have evidence from animals that the developmental period grows with increasing intelligence, so that we would have to babysit an AI and change its (figurative) diapers for decades before it grew coordinated enough to enslave us all.
The Argument From Gilligan's Island
A recurring flaw in AI alarmism is that it treats intelligence as a property of individual minds, rather than recognizing that this capacity is distributed across our civilization and culture.
Despite having one of the greatest minds of their time among them, the castaways on Gilligan's Island were unable to raise their technological level high enough to even build a boat (though the Professor is at one point able to make a radio out of coconuts).
Similarly, if you stranded Intel's greatest chip designers on a desert island, it would be centuries before they could start building microchips again.
The Outside Argument
What kind of person does sincerely believing this stuff turn you into? The answer is not pretty.
I'd like to talk for a while about the outside arguments that should make you leery of becoming an AI weenie. These are the arguments about what effect AI obsession has on our industry and culture:
Grandiosity […] Meglomania […] Transhuman Voodoo […] Religion 2.0
What it really is is a form of religion. People have called a belief in a technological Singularity the "nerd Apocalypse", and it's true.
It's a clever hack, because instead of believing in God at the outset, you imagine yourself building an entity that is functionally identical with God. This way even committed atheists can rationalize their way into the comforts of faith.
The AI has all the attributes of God: it's omnipotent, omniscient, and either benevolent (if you did your array bounds-checking right), or it is the Devil and you are at its mercy.
Like in any religion, there's even a feeling of urgency. You have to act now! The fate of the world is in the balance!
And of course, they need money!
Because these arguments appeal to religious instincts, once they take hold they are hard to uproot.
Comic Book Ethics […] Simulation Fever […] Data Hunger […] String Theory For Programmers […] Incentivizing Crazy […] AI Cosplay […]
The Alchemists
Since I'm being critical of AI alarmism, it's only fair that I put my own cards on the table.
I think our understanding of the mind is in the same position that alchemy was in in the seventeenth century.
Alchemists get a bad rap. We think of them as mystics who did not do a lot of experimental work. Modern research has revealed that they were far more diligent bench chemists than we gave them credit for.
In many cases they used modern experimental techniques, kept lab notebooks, and asked good questions.
The alchemists got a lot right! […] Their problem was they didn't have precise enough equipment to make the discoveries they needed to.
[…]
I think we are in the same boat with the theory of mind.
We have some important clues. The most important of these is the experience of consciousness. This box of meat on my neck is self-aware, and hopefully (unless we're in a simulation) you guys also experience the same thing I do.
But while this is the most basic and obvious fact in the world, we understand it so poorly we can't even frame scientific questions about it.
We also have other clues that may be important, or may be false leads. We know that all intelligent creatures sleep, and dream. We know how brains develop in children, we know that emotions and language seem to have a profound effect on cognition.
We know that minds have to play and learn to interact with the world, before they reach their full mental capacity.
And we have clues from computer science as well. We've discovered computer techniques that detect images and sounds in ways that seem to mimic the visual and auditory preprocessing done in the brain.
But there's a lot of things that we are terribly mistaken about, and unfortunately we don't know what they are.
And there are things that we massively underestimate the complexity of.
An alchemist could hold a rock in one hand and a piece of wood in the other and think they were both examples of "substance", not understanding that the wood was orders of magnitude more complex.
We're in the same place with the study of mind. And that's exciting! We're going to learn a lot.
Generally, I only have two really major disagreements with the account in this talk.
One, I think that applying an "outside view" to arguments is an extremely dangerous proposition when not accompanied by substantive "inside view" rebuttals or defeaters that motivate skepticism of the given argument in the first place. This is because the premature application of an "outside view" tends to incline us towards what seems normal, familiar, simple, and common sense, and in the absence of inside view arguments we simply don't have the information or tools to assess whether that's a good thing or not. There are plenty of views that seem or seemed problematic from an "outside view" to many contemporaries — including atheism, slavery abolitionism, pro-choice, trans rights (look at how right wingers describe us as a cult of transhumans who think we're "better than god" because we want to change what nature gave us), etc — that turned out to be correct, because ultimately what seems normal, familiar, common sense, and simple to us is simply socially and historically contingent happenstance, not a substantive argument.
Two, regarding the "Transhuman Voodoo" section, I don't really see the inherent link between scientifically implausible technological ideas (such as nanotechnology) and AI superintelligence: one could believe in superintelligence and just as easily assume that the fundamental laws of physics as we know them aren't totally wrong, and since the AI would be just as bound by them as us, it wouldn't be able to create miracle technologies.
1.6. LLMs are cheap ai
This post is making a point - generative AI is relatively cheap - that might seem so obvious it doesn't need making. I'm mostly writing it because I've repeatedly had the same discussion in the past six months where people claim the opposite. Not only is the misconception still around, but it's not even getting less frequent. This is mainly written to have a document I can point people at, the next time it repeats.
It seems to be a common, if not a majority, belief that Large Language Models (in the colloquial sense of "things that are like ChatGPT") are very expensive to operate. This then leads to a ton of innumerate analyses about how AI companies must be obviously doomed, as well as a myopic view on how consumer AI businesses can/will be monetized.
[…] let's compare LLMs to web search. I'm choosing search as the comparison since it's in the same vicinity and since it's something everyone uses and nobody pays for, not because I'm suggesting that ungrounded generative AI is a good substitute for search.
What is the price of a web search?
Here's the public API pricing for some companies operating their own web search infrastructure, retrieved on 2025-05-02:
- The Gemini API pricing lists a "Grounding with Google Search" feature at $35/1k queries. I believe that's the best number we can get for Google, they don't publish prices for a "raw" search result API.
- The Bing Search API is priced at $15/1k queries at the cheapest tier.
- Brave has a price of $5/1k searches at the cheapest tier. Though there's something very strange about their pricing structure, with the unit pricing increasing as the quota increases, which is the opposite of what you'd expect. The tier with real quota is priced at $9/1k searches.
So there's a range of prices, but not a horribly wide one, and with the engines you'd expect to be of higher quality also having higher prices.
What is the price of LLMs in a similar domain?
To make a reasonable comparison between those search prices and LLM prices, we need two numbers:
- How many tokens are output per query?
- What's the price per token?
I picked a few arbitrary queries from my search history, and phrased them as questions, and ran them on Gemini 2.5 Flash (thinking mode off) in AI Studio:
- [When was the term LLM first used?] -> 361 tokens, 2.5 seconds
- [What are the top javascript game engines?] -> 1145 tokens, 7.6 seconds
- [What are the typical carry-on bag size limits in europe?] -> 506 tokens, 3.4 seconds
- [List the 10 largest power outages in history] -> 583 tokens, 3.7 seconds
Note that I'm not judging the quality of the answers here.
What's the price of a token? The pricing is sometimes different for input and output tokens. Input tokens tend to be cheaper, and our inputs are very short compared to the outputs, so for simplicity let's consider all the tokens to be outputs. Here's the pricing of some relevant models, retrieved on 2025-05-02:
Model Price / 1M tokens Gemma 3 27B $0.20 (source) Qwen3 30B A3B $0.30 (source) Gemini 2.0 Flash $0.40 (source) GPT-4.1 nano $0.40 (source) Gemini 2.5 Flash Preview $0.60 (source) Deepseek V3 $1.10 (source) GPT-4.1 mini $1.60 (source) Deepseek R1 $2.19 (source) Claude 3.5 Haiku $4.00 (source) GPT-4.1 $8.00 (source) Gemini 2.5 Pro Preview $10.00 (source) Claude 3.7 Sonnet $15.00 (source) o3 $40.00 (source) If we assume the average query uses 1k tokens, these prices would be directly comparable to the prices per 1k search queries. That's convenient.
The low end of that spectrum is at least an order of magnitude cheaper than even the cheapest search API, and even the models at the low end are pretty capable. The high end is about on par with the highest end of search pricing. To compare a midrange pair on quality, the Bing Search vs. a Gemini 2.5 Flash comparison shows the LLM being 1/25th the price.
I know some people are going to have objections to this back-of-the-envelope calculation, and a lot of them will be totally legit concerns. I'll try to address some of them preemptively.
Surely the typical LLM response is longer than that - I already picked the upper end of what the (very light) testing suggested as a reasonable range for the type of question that I'd use web search for. There's a lot of use cases where the inputs and outputs are going to be much longer (e.g. coding), but then you'd need to also switch the comparison to something in that same domain as well.
The LLM API prices must be subsidized to grab market share – i.e. the prices might be low, but the costs are high - I don't think they are, for a few reasons. I'd instead assume APIs are typically profitable on a unit basis. I have not found any credible analysis suggesting otherwise.
First, there's not that much motive to gain API market share with unsustainably cheap prices. […] there's no long-term lock-in […]. Data from paid API queries will also typically not be used for training or tuning the models […]. Note that it's not just that you'd be losing money on each of these queries for no benefit, you're losing the compute that could be spent on training, research, or more useful types of inference.
Second, some of those models have been released with open weights and API access is also available from third-party providers who would have no motive to subsidize inference. […] The pricing of those third-party hosted APIs appears competitive with first-party hosted APIs.
Third, Deepseek released actual numbers on their inference efficiency in February. Those numbers suggest that their normal R1 API pricing has about 80% margins when considering the GPU costs, though not any other serving costs.
Fourth, there are a bunch of first-principles analyses on the cost structure of models with various architectures should be. Those are of course mathematical models, but those costs line up pretty well with the observed end-user pricing of models whose architecture is known. […]
The search API prices amortize building and updating the search index, LLM inference is based on just the cost of inference - This seems pretty likely to be true, actually? But the effect can't really be that large for a popular model: e.g. the allegedly leaked OpenAI financials claimed $2B/year spent on inference vs. $3B/year on training. Given the crazy growth of inference volumes (e.g. Google recently claimed a 50x increase in token volumes in the last year) the training costs are getting amortized much more effectively.
The search API prices must have higher margins than LLM inference - […] see the point above about Deepseek's releasd numbers on the R1 profit margins. […] Also, it seems quite plausible that some Search providers would accept lower margins, since at least Microsoft execs have testified under oath that they'd be willing to pay more for the iOS query stream than their revenue, just to get more usage data.
But OpenAI made a loss, and they don't expect to make profit for years! - That's because a huge proportion of their usage is not monetized at all, despite the usage pattern being ideal for it. OpenAI reportedly made a loss of $5B in 2024. They also reportedly have 500M MAUs. To reach break-even, they'd just need to monetize (e.g. with ads) those free users for an average of $10/year, or $1/month. A $1 ARPU for a service like this would be pitifully low.
If the reported numbers are true, OpenAI doesn't actually have high costs for a consumer service that popular, which is what you'd expect to see if the high cost of inference was the problem. They just have a very low per-user revenue, by choice.
Why does this matter?
it is interesting how many people have built their mental model for the near future on a premise that was true for only a brief moment. Some things that will come as a surprise to them even assuming all progress stops right now:
There's an argument advanced by some people about how low prices mean it'll be impossible for AI companies to ever recoup model training costs. The thinking seems to be that it's just the prices that have been going down, but not the costs, and the low prices must be an unprofitable race to the bottom for what little demand there is. What's happening and will continue to happen instead is that as costs go down, the prices go down too, and demand increases as new uses become viable. For an example, look at the OpenRouter API traffic volumes, both in aggregate and in the relative share of cheaper models. […]
This completely wrecks Ed Zitron's gleefully, smugly wrong Hater's Guide to the AI Bubble. Not because AI isn't a bubble, but more because it isn't remotely a bubble in the way he seems to think.
1.7. I am disappointed in the AI discourse ai
Recently, something happened that made me kind of like, break a little bit… the top post on my Bluesky feed was something along these lines:
"ChatGPT is not a search engine. It does not scan the web for information. You cannot use it as a search engine. LLMs only generate statistically likely sentences."
The thing is… ChatGPT was over there, in the other tab, searching the web. And the answer I got was pretty good.
What is breaking my brain a little bit is that all of the discussion online around AI is so incredibly polarized. This isn’t a “the middle is always right” sort of thing either, to be clear. It’s more that both the pro-AI and anti-AI sides are loudly proclaiming things that are pretty trivially verifiable as not true… of course, ethical considerations of technology are important… For me, capabilities precede the ethical dimension, because the capabilities inform what is and isn’t ethical. (EDIT: I am a little unhappy with my phrasing here. A shocker given that I threw this paragraph together in haste! What I’m trying to get at is that I think these two things are inescapably intertwined: you cannot determine the ethics of something until you know what it is. I do not mean that capabilities are somehow more important.)