It's fascinating to think about the space of problems which are amenable to RL scaling of these probability distributions.
Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?
This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale. However, if you viewed them on some really large macro time scale where now LLMs are injecting information into the universe and the re-ingesting that maybe in some very philosophical way they are a /very/ slow oscillating intelligence right now. And as we narrow that gap (maybe with a totally new non-LLM paradigm) perhaps that is ultimately what gen AI becomes. Or some new insight that lets the models update themselves in some fundamental way without the insanely expensive training costs they have now.
A very good point. For anyone not familiar with anterograde amnesia, the classical case is patient H.M. (https://en.wikipedia.org/wiki/Henry_Molaison), whose condition was researched by Brenda Milner.
There's nothing to say that you can't build something intelligent out of them by bolting a memory on it, though.
Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.
Memory is not just bolted on top of the latest models. They under go training on how and when to effectively use memory and how to use compaction to avoid running out of context when working on problems.
Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.
I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.
Agreed, there's no doubt this will happen. It's likely already happening (it feels safe to assume that Anthropic is curating data from the data they record from Claude Code?)
As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.
Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?
My prior is that these companies will take this data without asking you as much as they can.
Exactly, or functionally equivalently, asking you in paragraph 37 of a 120-page PDF (bonus points: in an agreement update).
And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...
I wonder how long we have until we start solving some truly hard problems with AI. How long until we throw AI at "connect general relativity and quantum physics", give the AI 6 months and a few data centers, and have it pop out a solution?
I think a very long time because part of our limit is experiment.
We need enough experimental results to explain to solve these theoretical mismatches and we don't and at present can't explore that frontier.
Once we have more results at that frontier we'd build a theory out from there that has two nearly independent limits for QFT and GR.
What we'd be asking if the AI is something that we can't expect a human to solve even with a lifetime of effort today.
It'll take something in par with Newton realising that the heavens and apples are under the same rules to do it. But at least Newton got to hold the apple and only had to imagine he could a star.
What prevents us from giving this system access to other real systems that live in physical labs? I don't see much difference between parameterizing and executing a particle accelerator run and invoking some SQL against a provider. It's just JSON on the wire at some level.
If AGI will ever come, then. Currently, AI is only a statistical machines, and solutions like this are purely based on distribution and no logic/actual intelligence.
I swear that AI could independently develop a cure for cancer and people would still say that it's not actually intelligent, just matrix multiplications giving a statistically probable answer!
LLMs are at least designed to be intelligent. Our monkey brains have much less reason to be intelligent, since we only evolved to survive nature, not to understand it.
We are at this moment extremely deep into what most people would have been considered to be actual artificial intelligence a mere 15 years ago. We're not quite at human levels of intelligence, but it's close.
The issue to my mind is a lack of data at the meeting of QFT/GR.
Afterall few humans historically have been capable of the initial true leap between ontologies. But humans are pretty smart so we can't say that is a requirement for AGI.
Are not LLMs supposed to just find the most probable word that follows next like many people here have touted? How this can be explained under that pretense? Is this way of problem solving 'thinking'?
In some sense that is still correct, i.e. the words are taken from some probability distribution conditional on previous words, but the key point is that probability distribution is not just some sort of average across the internet set of word probabilities. In the end this probability distribution is really the whole point of intelligence. And I think the LLMs are learning those.
> just find the most probable word that follows next
Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.
This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.
That description is really only fair for base models†. Something like Opus 4.6 has all kinds of other training on top of that which teach it behaviors beyond "predict most probable token," like problem-solving and being a good chatbot.
(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)
I'd disagree, the other training on top doesn't alter the fundamental nature of the model that it's predicting the probabilities of the next token (and then there's a sampling step which can roughly be described as picking the most probable one).
It just changes the probability distribution that it is approximating.
To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
Put a loop around an LLM and, it can be trivially made Turing complete, so it boils down to whether thinking requires exceeding the Turing computable, and we have no evidence to suggest that is even possible.
As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.
I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.
[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).
[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.
I think it's pretty likely that "intelligence" is emergent behavior that comes when you predict what comes next in physical reality well enough, at varying timescales. Your brain has to build all sorts of world model abstractions to do that over any significant timescale. Big LLMs have to build internal world models, too, to do well at their task.
In some cases solving a problem is about restating the problem in a way that opens up a new path forward. “Why do planets move around the sun?” vs “What kind of force exists in the world that makes planets tethered to the sun with no visible leash?” (Obviously very simplified but I hope you can see what I am saying.) Given that a human is there to ask the right questions it isn’t just an LLM.
Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.
Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.
Thanks for your input. The way I saw this and how it looks Knuth interpreted it is that there were some reasoning steps taken by Claude independently. Some internal decisions in the model that made it try different things, finally succeeding.
>Are not LLMs supposed to just find the most probable word that follows next like many people here have touted?
The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.
NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.
But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.
Imagine training a chess bot to predict a valid sequence of moves or valid game using the standard algebraic notation for chess
Great! It will now correctly structure chess games, but we've created no incentive for it to create a game where white wins or to make the next move be "good"
Ok, so now you change the objective. Now let's say "we don't just want valid games, we want you to predict the next move that will help that color win"
And we train towards that objective and it starts picking better moves (note: the moves are still valid)
You might imagine more sophisticated ways to optimize picking good moves. You continue adjusting the objective function, you might train a pool of models all based off of the initial model and each of them gets a slightly different curriculum and then you have a tournament and pick the winningest model. Great!
Now you might have a skilled chess-playing-model.
It is no longer correct to say it just finds a valid chess program, because the objective function changed several times throughout this process.
This is exactly how you should think about LLMs except the ways the objective function has changed are significantly significantly more complicated than for our chess bot.
So to answer your first question: no, that is not what they do. That is a deep over simplification that was accurate for the first two generations of the models and sort of accurate for the "pretraining" step of modern llms (except not even that accurate, because pretraining does instill other objectives. Almost like swapping our first step "predict valid chess moves" with "predict stockfish outputs")
Are you feigning ignorance? The best way to answer a question, like completing a sentence, is through reasoning; an emergent behavior in complex models.
TLDR (story, not math) - Knuth poses a problem, his friend uses Claude to conduct 30 some explorations, with careful human guidance, and Claude eventually writes a Python program that can find a solution for all odd values. Knuth then writes a proof of the approach and is very pleased by Claude's contribution. Even values remain an open question (Claude couldn't make much progress on them)
I asked Claude to solve the pentominoes puzzle made famous by Arthur C. Clarke. It struggled mightily until I told it how I'd solved the problem using 64 bit unsigned integers to represent the board and pieces. Then, it created a C# program that solved the problem very quickly. However, in the 20x3 case it found four solutions when there are only two. Turns out it had incorrectly mapped one of the pentominoes. Sort of a silly mistake; the sort a human might make.
> Shock! Shock! I learned yesterday that an open problem I’d been working on for several weeks had just
been solved by Claude Opus 4.6— Anthropic’s hybrid reasoning model that had been released three weeks
earlier! It seems that I’ll have to revise my opinions about “generative AI” one of these days. What a joy
it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in
automatic deduction and creative problem solving.
I would like to note that would it be trivial to definitively prove or disprove such things if we had a searchable public archive of the training data. Interestingly, the same people who loudly claim that LLMs are creating original work seen to be utterly disinterested in having public access to such data.
Was it? It was an open problem to Knuth - who generally knows how to search literature. However there is enough literature to search that it wouldn't be a surprise at all to discover it was already solved but he just used slightly different terms and so didn't find it. Or maybe it was sovled because this is a specialization of something that looks unrelated and so he wouldn't have realized it when he read it. Or...
Overall I'm going with unsolved, because Knuth is a smart person who I'd expect to not miss the above. I'm also sure he falls for the above all the time even though the majority of the time he doesn't.
Agreed with all of that, but with the added point that Knuth has done a lot of work in this exact area in The Art of Computer Programming Volume 4. If he considers this conjecture open given his particular knowledge of the field, it likely is (although agreed, it's not guaranteed).
Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.
One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?
Crazy times.
Sure, it's not how we work, but I can imagine a system where the LLM does a lot of heavy lifting and allows more expensive, smaller networks that train during inference and RAG systems to learn how to do new things and keep persistent state and plan.
I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.
As far as I understand RL scaling (we've already maxxed out RLVR), these machines only get better as long as they have expert reasoner traces available.
Having an expert work with an LLM and successfully solve a problem is high signal data, it may be the only path forward?
My prior is that these companies will take this data without asking you as much as they can.
And importantly, this can be cross-lab/model too. I suspect there's a reason why e.g. Google has been offering me free Claude inference in Google Antigravity on a free plan...
We need enough experimental results to explain to solve these theoretical mismatches and we don't and at present can't explore that frontier.
Once we have more results at that frontier we'd build a theory out from there that has two nearly independent limits for QFT and GR.
What we'd be asking if the AI is something that we can't expect a human to solve even with a lifetime of effort today.
It'll take something in par with Newton realising that the heavens and apples are under the same rules to do it. But at least Newton got to hold the apple and only had to imagine he could a star.
But we can not yet experiment at the GR/QFT frontier.
To do so with a particle accelerator it would need to be the size of the milky way.
LLMs are at least designed to be intelligent. Our monkey brains have much less reason to be intelligent, since we only evolved to survive nature, not to understand it.
We are at this moment extremely deep into what most people would have been considered to be actual artificial intelligence a mere 15 years ago. We're not quite at human levels of intelligence, but it's close.
The issue to my mind is a lack of data at the meeting of QFT/GR.
Afterall few humans historically have been capable of the initial true leap between ontologies. But humans are pretty smart so we can't say that is a requirement for AGI.
Time to sit down, read, digest and understand it without the help of LLM.
Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.
This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.
(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)
It just changes the probability distribution that it is approximating.
To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.
I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.
[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).
[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.
Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.
Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.
The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.
NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.
But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.
[1]: https://en.wikipedia.org/wiki/WP:AISIGNS
But that does not mean that the results cannot be dramatic. Just like stacking pixels can result in a beautiful image.
Great! It will now correctly structure chess games, but we've created no incentive for it to create a game where white wins or to make the next move be "good"
Ok, so now you change the objective. Now let's say "we don't just want valid games, we want you to predict the next move that will help that color win"
And we train towards that objective and it starts picking better moves (note: the moves are still valid)
You might imagine more sophisticated ways to optimize picking good moves. You continue adjusting the objective function, you might train a pool of models all based off of the initial model and each of them gets a slightly different curriculum and then you have a tournament and pick the winningest model. Great!
Now you might have a skilled chess-playing-model.
It is no longer correct to say it just finds a valid chess program, because the objective function changed several times throughout this process.
This is exactly how you should think about LLMs except the ways the objective function has changed are significantly significantly more complicated than for our chess bot.
So to answer your first question: no, that is not what they do. That is a deep over simplification that was accurate for the first two generations of the models and sort of accurate for the "pretraining" step of modern llms (except not even that accurate, because pretraining does instill other objectives. Almost like swapping our first step "predict valid chess moves" with "predict stockfish outputs")
> Shock! Shock! I learned yesterday that an open problem I’d been working on for several weeks had just been solved by Claude Opus 4.6— Anthropic’s hybrid reasoning model that had been released three weeks earlier! It seems that I’ll have to revise my opinions about “generative AI” one of these days. What a joy it is to learn not only that my conjecture has a nice solution but also to celebrate this dramatic advance in automatic deduction and creative problem solving.
Overall I'm going with unsolved, because Knuth is a smart person who I'd expect to not miss the above. I'm also sure he falls for the above all the time even though the majority of the time he doesn't.