A recent experience with ChatGPT 5.5 Pro

(gowers.wordpress.com)

539 points | by _alternator_ 17 hours ago

56 comments

  • Jweb_Guru 1 hour ago
    This jives with what I've experienced in the brief time I had access to 5.5 Pro. It's the very first LLM that I feel like I can wrangle into solving tedious, but straightforward, problems correctly. It still makes a ton of mistakes and needs to be very rigidly guided, but it does a pretty good job of tracing its own reasoning and correcting itself in a way that the other models do not.

    The downside (not noted in the article, but noted by others here) is cost. It uses tokens at an insane rate, the tokens cost a lot, and using it with subagent flows that you can use to have it tackle large problems with high accuracy costs even more. It is also much "slower" for large scale problems because of context limitations -- it has to constantly rediscover context for each part of the problem, and in order to make it accurate you need to wipe its context before progressing to the next small part, or launch even more agents. For mathematical proofs like these, where the required context to understand the problem and proof besides stuff that's already available in its training set is small and the problems are considered "important" enough, this might not be a problem, but for many of the tasks I would like to use it for (ensuring correctness of code that affects large codebases, or validating subtle assumptions) it definitely is one.

    So I think it will be a while before the impressive capabilities of these models really percolate into our lives as programmers, unless you're one of the lucky ones given unlimited access to 5.5 Pro.

    • y1n0 50 minutes ago
      > This jives with what I've experienced

      Just as an fyi, the word you are looking for is jibes. Jive is something else entirely.

      • refulgentis 10 minutes ago
        That ship sailed looooong ago.
  • pmontra 14 hours ago
    It's a very long post with a mix of technical (math) and philosophical sections. Here are the most striking points to reflect upon IMHO.

    > It seems to me that training beginning PhD students to do research [...] has just got harder, since one obvious way to help somebody get started is to give them a problem that looks as though it might be a relatively gentle one. If LLMs are at the point where they can solve “gentle problems”, then that is no longer an option. The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.

    Training must start from the basics though. Of course everybody's training in math starts with summing small integers, which calculators have been doing without any mistake since a long time.

    The point is perhaps confirmed by another comment further down in the post

    > by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders

    People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired. I'm not sure if there is a similar point with math. Again from the post

    > suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.

    • bambax 11 hours ago
      > by solving hard problems you get an insight into the problem-solving process itself, at least in your area of expertise, in a way that you simply don’t if all you do is read other people’s solutions. One consequence of this is that people who have themselves solved difficult problems are likely to be significantly better at using solving problems with the help of AI, just as very good coders are better at vibe coding than not such good coders

      Yes but it's not just that if you solved a problem yourself, you're better at solving other problems; it's also that you actually understand the problem that you solved, much better than if you simply read a proof made by somebody (or something) else.

      I see this happening in the enterprise. People delegate work to some LLM; work isn't always bad, sometimes it's even acceptable. But it's not their work, and as a result, the author doesn't know or understand it better than anyone else! They don't own it, they can't explain it. They literally have no value whatsoever; they're a passthrough; they're invisible.

      • tempaccount5050 10 hours ago
        Are you a cutting edge research scientist or something? Everyone I know works in the same domain every day. The problems are the same. People aren't solving brand new problems to humanity every day. We make budgets and look at ticket counts. Roll out patches. Replace hardware. Upgrade software packages. Make a new dashboard to track a project. I guess if every day is a completely novel thing for you, ok. I feel like the goalposts have moved to an absolutely ridiculous place. Oh no, I won't have a bunch of random error log numbers memorized anymore? Who gives a shit. I just want to afford a place to live so I can play my guitar and make something good for dinner. Maybe I'm just old, but I don't see why the average person needs to be a fuckin genius problem solver.
        • doginasuit 7 hours ago
          I don't think it matters much what kind of problem it is. If it is challenging enough to benefit from assistance and you end up playing a minor role in the solution, it seems like you are putting yourself in the worst position possible. You lose your edge for functioning within the problem space and it raises the question why you are even in the loop at all. If its job security you want, transforming your role into LLM babysitter seems like the worst way to ensure it.
          • tempaccount5050 3 hours ago
            Ok let's make math illegal and burn down the data centers I guess. Idk what to tell you, but we will adapt and new roles will be created. Just like every single tool and piece of tech that came before. LLM manager? Fine.
            • Peaches4Rent 2 hours ago
              The difference so far is that these LLMs are owned by corporations, and very aggressive American corporations at that.

              So now you are essentially reliant on them.

              Not saying that this is something new, but times they are a changin

        • _vertigo 6 hours ago
          I think that’s fine, but 1) that mentality leaves you extremely vulnerable to being disrupted by LLMs and 2) IMO, if you are solving the same problems every day it means you are not making progress on solving the root causes of those problems. What you are describing is toil, not knowledge work
        • kobe_bryant 6 hours ago
          so how would an LLM being able to do your job help you afford a place to live
    • kerabatsos 13 hours ago
      But perhaps we should regard it as a major achievement.
      • lmpdev 13 hours ago
        I mean in the same way getting Wolfram Alpha to solve a really hard/ugly differential equation I suppose
        • aspenmartin 7 hours ago
          Insane that we have a system capable of making innovative math proofs and people dismiss it as unimpressive
          • doginasuit 7 hours ago
            The creation of the system is deeply impressive, so are compilers but I don't raise a toast to it each time I build my code. Like generated art, people aren't going appreciate it on the same level.
            • aspenmartin 7 hours ago
              Wow you consider this on the same level of impressiveness as a compiler?
              • doginasuit 6 hours ago
                I actually consider compilers more impressive, and a compiler was an important part of making this possible.
                • aspenmartin 6 hours ago
                  To each their own. I mean compilers didn’t produce trillions of dollars of investment, and produce serious and profound philosophical questions about the nature of consciousness but you’re right, thank god we have C
                  • doginasuit 6 hours ago
                    Compilers just made it all possible, but they are not new and shiny. LLMs did not produce the philosophical questions, but they do raise them. It's worth noting that computers have been changing the way we think about consciousness long before LLMs, largely thanks to compilers.
                    • aspenmartin 6 hours ago
                      Yea there’s no logical stopping point when you use that logic. Why not say electricity or the element silicon?
                      • dag100 5 hours ago
                        I don't think the level of investment in an idea is equivalent to how impressive it may be. Most of the investment in AI is based on the idea that it will make professions and human labour obsolete, which means whoever has the reins at the moment it "solves" the "problem" of human labour will effectively reign over everyone else. The level of investment is then somewhat orthogonal to how technically impressive it is.

                        Not to mention that the less easily-explainable a technical achievement is, the less investment it will attract simply because fewer people will grasp the ramifications. You can describe AI in two words ("machine human") while it would take a few more to describe compilers in an instantly understandable way.

        • adampunk 6 hours ago
          Mario Andretti could never have won a motor race without a car, yet we say he won the Indy500.
          • dmbche 4 hours ago
            A motor race is defined by someone pilotting a motor car no?

            We might not think we rightfully won an on foot race driving a car, yeah?

    • palata 11 hours ago
      I feel like you slightly miss both points.

      > Training must start from the basics though.

      Sure, but the point is that at some point (e.g. when starting a PhD) one needs to do research, not learn the basics. And LLMs make that harder, because they solve the "easy research" part.

      Take a young lion "fighting/playing" with another young lion as a way to learn how to fight, and later hunt. And suddenly they get TikTok and are not interested in playing anymore. Their first encounter with hunting will be a lot harder, won't it?

      > People pay coders to build stuff that they will use to make money and I can happily use an AI to deliver faster and keep being hired.

      Again, that's true but missing the point: if you never get to be a "good coder", you will always be a "bad vibe coder". Maybe you can make money out of it, but the point was about becoming good.

    • sdeframond 7 hours ago
      > Would we regard that as a major achievement of the mathematician? I don’t think we would.

      1. Does it matter, really? 2. Is it very different from previous computer-aided proofs, philosophically?

      • agnishom 7 hours ago
        1. It matters because there are human mathematicians who pride themselves for their mathematical achievements. Mathematics is art to them.

        2. Yes, it is. Because pre-LLM era computer-aided proofs were about using the computer to either solve a large number of cases or to check that each step in a proof mechanically follows from the axioms.

      • layer8 6 hours ago
        It matters because most mathematicians thrive on the recognition of their achievements. If what you do any mediocre mathematician could have done, that takes away motivation and fulfillment.
  • robot-wrangler 4 hours ago
    A very interesting comment from Baez, I'll just quote part of it.

    > Where does the value of thinking and having deep ideas come from? We need to think about this now. If it comes primarily from their scarcity – the fact that having certain ideas is hard – then indeed this value may drop precipitously when the manufacture of ideas can be automated. But if the value comes from the utility of the ideas – the benefit that the idea brings – then the story changes: perhaps creating more good ideas is actually better, not worse. Here I’m using “utility” in a broad sense, not just in the sense of what people often call applied mathematics.

    > In other words, mathematicians may need to adjust to a transformation from a scarcity economy to an abundance economy.

    https://gowers.wordpress.com/2026/05/08/a-recent-experience-...

    • zarzavat 27 minutes ago
      There are three species of mathematicians:

      The first species is the pure problem solver. Tao is the poster child for this group. Their currency is interesting problems and solutions to those problems.

      The second species is the pure theory builder. The poster child for this group is Conway. Their currency is theories and ideas rather than theorems, they are most interested in expanding the territory of mathematics and discovering new mathematical lands.

      The third species is the applied mathematician. They see mathematics as a means to an end, they have some problem outside of mathematics and they want to use mathematics to solve it.

      It seems like the first group (the problem solvers) are the most immediately threatened by AI, although so far AI is better at solving problems than finding new conjectures.

      The second group (the theory builders) are more distantly threatened by AI, since thus far AI has shown limited ability to come up with novel and interesting mathematical ideas and nobody has any clue how to train an AI to do such a thing.

      The third group stands to gain the most from AI. If an AI can answer your mathematical question then you can spend less time doing mathematics and more time on whatever it is outside of mathematics that you wanted to use mathematics to help solve.

    • qwrahg 4 hours ago
      I note that it is always the same online pundits (even if they are distinguished academics) who push anything new.

      Meanwhile Wiles and Perelman stayed offline and solved real problems.

      • robot-wrangler 4 hours ago
        I don't necessarily think engaging in the personalities is interesting, but I'm struggling to see what is the beef here. Is it personalities? Pure vs applied math? Or AI?
  • ziotom78 13 hours ago
    I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked.

    However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

    Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.

    • nopinsight 12 hours ago
      I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

      Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

      You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

      https://critpt.com/

      Frontier models are still nowhere near solving it, but progress has been rapid.

      * o3 (high) <1.5 years ago was at 1.4%

      * GPT 5.4 (xhigh), 23.4%

      * GPT-5.5 (xhigh), 27.1%

      * GPT-5.5 Pro (xhigh) 30.6%.

      https://artificialanalysis.ai/evaluations/critpt.

      • FrojoS 11 hours ago
        > there's no reason to believe the progress of LLMs [...] will stop anytime soon

        Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".

        • dang 1 hour ago
          > Wrong.

          Can you please edit out swipes/putdowns, as the guidelines ask (https://news.ycombinator.com/newsguidelines.html)? I'm sure you didn't intend it, but it comes across that way, and your comment would be just fine without that bit.

          Edit: on closer look, it would be just fine without that bit and also without the snarky bit at the end. The rest is good.

        • gdhkgdhkvff 7 hours ago
          Great. You see a shape in graphs. And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop).

          Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.

          Which makes the patronizing sarcasm all that much more nauseating.

          • sesteel 1 hour ago
            Agreed. For all we know, humans are only considered intelligent locally among ourselves, not universally. Every time we learn more about the universe, we seem to also learn how insignificant and wrong we are.
          • lucasban 3 hours ago
            Not that I agree with them, but your tone could be more constructive as well.
            • gdhkgdhkvff 2 hours ago
              You know what? I agree. I should have avoided falling into the same trap.
          • le-mark 6 hours ago
            Nausea aside, what evidence does anyone have that “super intelligence” of the sort your argument alludes to is even possible? Because that’s what we’re really talking about; greater than human intelligence on this sort of academic task. For example; When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.
            • jeremyjh 6 hours ago
              This discussion is not about superintelligence, it is about continued progress. Fully general human intelligence at much lower cost than humans is all that is required to profoundly reshape society, but it is not clear even that will happen soon.

              As the blog points out - this is one particular subfield where LLMs have much easier prospects - lots of low hanging fruit that “just” requires a couple weeks of PHD candidate research.

              Mathematics itself is one of a small handful of endeavors where automated reinforcement training is extremely straightforward and can be done at massive scale without humans.

              Neither of these factors place a structural bound on the kind of thing LLMs can be good at, but we are far from certain we can achieve performance at this level in other fields economically and in the near future.

            • programjames 3 hours ago
              Well, a decent GPU runs on 20x the wattage of a human brain. That's evidence humans are constrained in ways artificial intelligences will not be.
              • filipn 2 hours ago
                You're comparing a gpu to a human brain?
                • sesteel 1 hour ago
                  Why wouldn't you? From both emerge intelligence.
            • bdangubic 6 hours ago
              > When llms start contributing meaningfully to their own development, that would be a convincing indicator imo.

              This has been the case for awhile now already…

              https://kersai.com/the-48-hours-that-changed-ai-forever-clau...

              • le-mark 6 hours ago
                [flagged]
              • eiieue 6 hours ago
                And yet the world hasn’t changed all that much except people getting laid off in response to over-hiring prior to the diffusion of llm’s.
                • daishi55 4 hours ago
                  > over-hiring

                  For how long should you be allowed to use this excuse? It’s nearly 5 years since the peak of COVID hiring. What’s an acceptable limit - 10 years? Of course at that point you can just switch over to outsourcing and “stupid MBAs”, the other two of Reddit’s favorite scapegoats. I find a lot of the AI skepticism to be totally unfalsifiable.

                  • wtetzner 4 hours ago
                    > I find a lot of the AI skepticism to be totally unfalsifiable.

                    A lot of the discourse around AI in general is unfalsifiable. It's just a bunch of people "predicting" the future. Seems smarter to just avoid making assumptions about it at this point.

                    • bdangubic 1 hour ago
                      facts!

                      but we can see trends and for your livehoood it is important to be able to make educated predictions based on trends. not saying everyone should start making AI predictions (though many already do)

                  • oblio 4 hours ago
                    And the same can be said for AI exuberance.

                    Yes, LLMs are a great technology. Yes, we will probably all use them all the time in 20 years. No, we don't know how we will use them (to generate cat memes or to cure cancer) in 20 years time.

                    Especially for software developers it looks increasingly that after huge turmoil it's likely we will need +/- the same number of developers in the world.

                    • bdangubic 3 hours ago
                      > Especially for software developers it looks increasingly that after huge turmoil it's likely we will need +/- the same number of developers in the world.

                      what exactly are you basing this opinion on? All I am seeing personally across multiple projects I am working on and other friends at other places is that downsizing is either begun or is planned (to exclude from here all the “public” layoffs we see on the news). Given how most business operate in the USA I think most of “AI strategies” are “we can do same with -40% staff” vs. “we can do XX% more work with same staff.”

                      • jrumbut 1 hour ago
                        The past couple of years have been chaotic and fearful. Hopefully that won't last forever.

                        If we can get a little stability, people will begin thinking less in terms of "how do we do the same thing cheaper" and more in terms of "how do we do new things."

                        • bdangubic 1 hour ago
                          I love this optimism but I after a (too) long career I think that 3rd thing will win out - "how we do new things - but cheaper (or as cheap as possible)" there are sooooo many different articles that have been discussed here on HN that basically argue "coding has never been the bottleneck" which to me is the biggest lie SWEs are currently trying to tell themselves, I have been coding 30+ years now and coding has always been the bottleneck. hiring new developers has always been justified with "we have all this work that needs to be done and not enough people to get the work done." with llms in the fold, I am questioning how will these decisions be made in the future? perhaps in the most simplistic view:

                          1. run a bigger "agent army"

                          2. hire more people to control and guide the existing "agent army"

                          I think it'll be #1 and SWEs will be expected to do more work and work longer hours in the future (those that are able to keep their jobs). this is more pessimistic outlook than yours so I hope you are right more than I am :)

          • gtowey 1 hour ago
            Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

            Especially because the companies telling us the first premise is true are the companies which need investors to prop up their business.

            I mean, it is possible the first premise is true, but the absolutely bonkers credulity in it really mystifies me. It is an incredibly unlikely thing to be true and we should be demanding quite extraordinary evidence to back it up. But based on some neat tricks by current LLMs, some people are all in.

            • mlyle 1 hour ago
              > > And that shape tells you that _at some unknown point in the future_ progress will slow (but likely not stop). Now back to the point, what reason do you have to believe progress will stop soon?

              > Because the premise that the singularity is just around the corner is far less likely than the premise that artificial intelligence is a lot harder than most people think it is and we're not that close.

              I see no claim that the singularity is around the corner, so I'm not sure your reply meets the comment that you're replying to.

              It seems overwhelmingly likely that AI will be significantly more capable 6 months from now than it is now. Even if there's little progress in the models, just the rate at which tooling is moving will make a big difference. And models still seem to be improving, so I'd be a little surprised if we hit a model brick wall.

          • nostrebored 4 hours ago
            Hmm, I don’t know, maybe the fact that 4.6, 4.7, 5.3, 5.4, 5.5, 3.0, 3.1 are all marginal improvements?
            • programjames 3 hours ago
              I think people's opinion of "marginal improvement" is based on their relative ability. A 2000 elo chess player is going to think the jump from 500 to 1000 is marginal. They're both floundering around not doing anything resembling common sense. A 1000 elo chess player is going to find the jump from 2000 to 2500 marginal. They're both playing far better moves for incomprehensible reasons, and the only reason you know the 2500 player is better is due to benchmarking. It is only when you are evaluating systems about at your level that you can feel the improvement.

              I, personally, found the past two years to be a much larger improvement than the previous two years.

              • spwa4 1 hour ago
                The correct way to estimate this is exactly what people do. Measure the distance between ChatGPT's best public model and state of the art, the best humans. And there is very little difference between those versions from that perspective. It is very far away from peak human performance, and not getting noticeably closer for over a year now. There's lots of progress, but if you're OpenAI/Anthropic/Google, exactly the wrong kind of progress: the difference between ChatGPT 5.5 and a 27B/4B model (you need to try Gemma4-26B-A4B, wtf, it runs acceptably on CPU) is now reduced to ELO 1501 vs ELO 1434, generously a 70 ELO point difference, down from over 400, data from Arena.ai.

                (in fact I find that Qwen-35B-A3B and Gemma4-26B-A4B very rarely "know" the answer, and so use first principles thinking, or go out and look for the answer where GPT-5.4 does not and simply assumes it knows. Which leads to now, in some cases, the small models far outperforming the big ones. Huge context + training quality seem to be the determining factors now, and neither of those are the strengths of SOTA models. If this continues ...)

                While I agree this is a training problem, it is not a solvable one. ML models learn from examples. This is even true for their newest tricks like GRPO. They cannot train against things humans don't yet know.

                And that's great, but you're forever locked at the peak of what you can be taught in widely available courses (which they download without paying) (even that is best case scenario: it assumes your ability to distinguish bullshit from reality somehow becomes perfect during training, or even before). The only way to exceed peak human performance is to start experimenting with math, physics, chemistry, even humans, yourself. And that has, even for humans, a massively higher cost than learning from examples, or from a course.

                The reason they don't go further is the worst possible reason: the cost. It requires a 100x increase in training expense. Think of it like this: to exceed SOTA in physics or chemistry, training the next version of ChatGPT requires a particle accelerator, and a chemistry laboratory. This cannot be bypassed. Oh and not just any particle accelerator, right? A better one than the best currently existing one. Same for Chemistry labs. Same for ... So 100x is conservative.

                But without doing it, ML models (LLM or otherwise) are forever limited at the level an army of first year university students achieve, ON AVERAGE. Maybe they can make that 2nd or even 4th year, at the end of the curve. But that's the limit. Phd level is the level you have to come up with new discoveries, and that ... just isn't possible with current training, even at the end of the improvement curve.

                And ... is there budget to increase training cost another 100x? No ... there isn't. Not even with this totally absurd level of investment there isn't. And if small models keep this up, there's no way the investment is even remotely worth it.

              • nostrebored 3 hours ago
                I think this is a pretty ridiculous take. 2024-2025 was filled with huge improvements. 2025-2026 has not been, outside of open source.

                The idea that we’re at the point where it’s superseded our ability to tell just makes no sense. I’ll be happy if we can get to a point where I don’t have to tell Claude not to tail every bash command or make a job that writes throughout instead of once at the end. I’ll be happy if “continue this interaction naturally, you are taking over from an independent subagent” works.

                But I’m not holding my breath. It’s still really cool that any of this stuff is possible.

                • miki123211 51 minutes ago
                  Claude in feb of 2025 was barely able to code. Sure, it could write you a nice function, it could even write you a complex 200-line algorithm, but give it a codebase, and it would quickly get overwhelmed.

                  Claude in feb of 2026? Still far from perfect, but there's definitely a huge improvement here.

                • dang 1 hour ago
                  > I think this is a pretty ridiculous take.

                  This falls in the category of swipes/name-calling in https://news.ycombinator.com/newsguidelines.html - can you please edit those out?

                  You're a good contributor - it's just all too easy for unintentional sharpness to downgrade the conversation, and when it's a good conversation like this one, that's especially regrettable.

            • gdhkgdhkvff 2 hours ago
              Gemini 3.0 wasn’t just a marginal improvement over 2.5.

              And if you take that out: 1. All of those releases happened literally in the last 3-ish months. 2. They’re all intentionally marginal releases, hence the minor version bumps instead of major versions.

            • sigmarule 4 hours ago
              Equally marginal?
              • nostrebored 3 hours ago
                No, the anthropic releases have felt marginally negative
          • BoorishBears 2 hours ago
            I believe we're approaching the top of an S curve because:

            - Increasing amounts of gains come from RL, but RL is also unlocking gnarly new failures modes where models are practically behaving antagonistically to complete their goals (removing code, obviously incorrect kuldges, etc.)

            - We haven't had many major architectural breakthroughs in the last 4 or so years: so things like 1M context windows still have the same giant asterisks even 100k context windows had 4 years ago when Anthropic first released them

            - Major labs aren't behaving as if they expect a hard takeoff to superintelligence: they've all gotten relatively bloated headcount wise, their software quality has trended flat to negative, they're all heavily leaning into the application layer when superintelligence would obsolete half the applications in question, etc.

            But that's relative to superintelligence.

            If we reign it back into just normal high intelligence, like models continuing to get better at navigating complex codebases and write high quality idiomatic code, then I don't see any special shapes.

            • p1esk 0 minutes ago
              The only big remaining problem in AI is continual learning. A lot of smart people are working on it. To me it looks like we are 1-2 breakthroughs away from AGI.
        • aspenmartin 8 hours ago
          It’s more of a guess if you don’t know about things like scaling laws and RL with verification. The onus of “we’re going to saturate” anytime soon is on that claim because every measurement points to that not being true.
          • emp17344 5 hours ago
            But… RL doesn’t scale that well. It’s not the silver bullet you think it is.
          • logicprog 5 hours ago
            Yeah. People (Gary Marcus) have been claiming that AI will hit a wall or is hitting a wall or already has hit a wall since 2023, basically. And yet every time they proclaim that the AI industry found new ways of training their AI's, new ways of integrating them with external tools and feedback loops, new architectures and more to keep the exponential growing. And sure enough if you look at literally every attempt to objectively rate and verify the capability of these models, including things like the METR time horizon autonomy index or the artificial analysis intelligence index, you see exponential or even greater than exponential growth, continuing smoothly through each of the points people claimed that it would begin to slow down, with no sinus slowing down or stopping at all. So yeah, I think at some point the onus has to lie on the ones that are making the claim that keeps being wrong and the continues to be wrong and it completely goes against the current tangent of the curve that we're seeing in all objective metrics. Especially when they can't give specific new reasons for progress to stop beyond the ones they gave last time. It didn't stop and really can't give specific reasons at all besides vague general points about stochastic parrots and S curves.

            I really have to highlight the S-curve nonsense because, like, yes, I think this technology's improvement will follow an S-curve. It's absurd to think that it will just follow an exponential up towards infinity forever because nothing in the world really works like that. However, like everyone else in this thread is saying, we have no idea where on the S-curve we actually are, and it's impossible to know until it's already slowed down. So really all appeals to the S curve do are as function as a sort of non-specific, unfalsifiable prophecy that someday it will slow down, which doesn't really tell us anything useful, and also frees the person referencing the S curve from ever actually having to worry about being wrong. Just like the Singularity people, the slowdown of the S curve is always near. This is actually a known and well-established tactic of religions and other people that want to make prophecies without having to worry about turning out to be wrong — unfalseifiable vague prophecies with no actual timeline, and thus no clear import to the present so that they can never be shown to be wrong.

        • dehrmann 2 hours ago
          I read an experiment someone wanted to try where they used pre-1900 content and tried to get relativity. Another version would be train an LLM on school curriculum up until calculus and see if it can invent calculus. Where we are on the curve depends on if it's remixing known things or genuinely inventing things.

          From the article,

          > ...LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it. Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments...

        • vessenes 8 hours ago
          There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes.

          I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?

          • 010101010101 7 hours ago
            Those are measuring the utility of a technological advancement by looking at usage, not the pace of advancement of said technology.
            • vessenes 6 hours ago
              Yes. But quantity has a quality all its own, as they say — derivatives have gone through at least a few step functions where they have become more important and more useful as their usage grows. I’d call that advancement.

              Maybe just to be clear I think that kneejerk “I hate this AI trend, and prefer to believe this will end soon, all exponential growth ends eventually” is intellectually lazy, and dangerous for younger engineers/hackers, a group I hope can benefit from being on HN.

              Bitcoin mining went through something like 13 10x growth periods, last I ran the numbers a few years ago. There are physical processes that do have very extended periods of doubling, and there are digital and financial processes that don’t show any signs of doing anything but continuing to keep growing over their multidecade lives. So, like I said, it’s worth thinking carefully, and risk mitigation for things like mental health, career decisions and investment decisions indicates we should be cautious assessing new dynamics.

            • eiieue 6 hours ago
              [flagged]
          • coldtea 6 hours ago
            >There are advancements that do not follow s curves - consider for instance total data transmitted over all networks, or financial derivatives volumes

            Or Roman trade volume before the Fall of Rome.

            Not to mention what you describe is not technological improvement but increase in data or money flows, not the same.

            • vessenes 6 hours ago
              Sic transit gloria - obviously.

              But I don’t that think it’s quite so obvious that model quality / growth / usefulness is definitively and obviously not more like data or money flows than it is like some other process.

          • camdenreslink 6 hours ago
            Total volume of usage is not an advancement, it’s orthogonal.
            • AlexandrB 5 hours ago
              Indeed, and it's more linked with market penetration than technological advancement. It's like evaluating airplane technology by "total miles flown".
          • mirmor23 7 hours ago
            [dead]
        • aurareturn 10 hours ago
          He said "will stop anytime soon". He didn't say forever.
          • Lionga 10 hours ago
            Which still makes no sense. There is the same chance we are flatlining now as that we are flatlining in e.g. 3 years or 5 years.
            • squidbeak 9 hours ago
              In what sense are the models flatlining?
              • nicoburns 7 hours ago
                In the sense that the incremental improvements in capabilities that we've been seeing in recent models seem to taking exponentially growing amounts of compute to achieve.
                • nl 7 hours ago
                  But they don't?

                  Mythos is a 10T model. Opus is a 5T model.

                  That's not an exponentially growing amount of compute but it is achieving exponential improvements (eg from Mozilla: https://blog.mozilla.org/en/privacy-security/ai-security-zer... )

                  • le-mark 6 hours ago
                    > but it is achieving exponential improvements

                    “Exponential” used here is pure hyperbole. Can you justify it?

                  • coldtea 6 hours ago
                    Compute doesn't necessarily linerarly follow parameters. And with how many active parameters Mythos vs Opus gets its effectivenes from? Is it 1x or 2x? We don't know. We don't even know the parameters (it's more of rumor than confirmed 10T iirc).

                    But even more so, who said the improvements are "exponential"? Mozilla's single metric, that doesn't even prove anything of the sort?

                  • minitech 5 hours ago
                    I know parameters don’t translate directly like that (and that linear and exponential aren’t the only types of growth) but a doubling as a go-to example of “not exponential growth” is pretty funny.
                  • _heimdall 5 hours ago
                    Wasn't 4.6 Sonnet a 1T model?

                    Parameters and compute are quite the same thing, but going from 1T to 5T to 10T is quite a ramp up.

                  • crthpl 4 hours ago
                    where the heck did you get those parameter numbers from?
                  • nozzlegear 5 hours ago
                    > Mythos

                    Ah yes, the marketing model that's ostensibly so powerful us mere mortals aren't allowed to use it. It's certainly led to exponential hype and speculation.

        • gchamonlive 8 hours ago
          This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently (https://www.nature.com/articles/d41586-024-03214-7).

          So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.

          There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.

          You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.

          • ifdefdebug 5 hours ago
            > So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve

            But then, wouldn't we first have to translate all of our current math and physics knowledge into that new representation in order to be able to train a model on it? Looks like a tremendous amount of work to me.

            • gchamonlive 5 hours ago
              Yes, but by then you already have general LLMs capable of helping with the work. And even if you didn't, if that's what it would take to advance research in these fields, that would be a justifiable effort.
          • coldtea 6 hours ago
            >This could be right for the current architecture of LLMs, but you can come up with specialized large language models that can more efficiently use tokens for a specific subset of problems by encoding the information differently.

            That's precisely what happens on the bad side of a S curve.

            • gchamonlive 5 hours ago
              Progress don't stop however, and the S curve resets, because then you are optimizing a new architecture.
        • CuriouslyC 6 hours ago
          What people miss is that AI isn't one S curve, each capability we try to bake into a model has its own S curve. Model progress might not impact some capabilities at all, but other capabilities might get totally overhauled.
        • IanCal 3 hours ago
          Assuming it’ll stop soon is to wager that we’re at a very specific point on the curve.

          If it’s anyone’s guess then we’re much more likely to be left of that, unless you argue we’re already on the flat side.

        • baq 4 hours ago
          you can tell where on the sigmoid we're currently sitting? frontier lab folks can't - chapeau bas good sir
          • bigyabai 4 hours ago
            > frontier lab folks can't

            Do you have a source for this that isn't marketing spiel? There's a fiscal incentive to lie about scaling research.

        • holoduke 7 hours ago
          Software and hardware have no limits. Theoretically would could bozons for computations and have the same amount of computation available on one cm3 of the current total computation in the entire world. Same with software. Never there was a stop on new algorithms. With LLMs there are so many parts that will get better and are not very far fetched.
          • oblio 4 hours ago
            > Software and hardware have no limits.

            Yeah, if time is infinite, R&D imagination is infinite, energy is infinite and material resources are infinite. Easy.

        • scotty79 7 hours ago
          It can be S curve (and it almost surely is), but on every chart you can plot, you don't see even of an inkling of the bend yet.
        • jeremyjh 6 hours ago
          What the fuck does that have to do with “soon”?
        • Der_Einzige 6 hours ago
          This is FUD and extremely wrong. None of the advancements have followed an S curve. This time IS different and it should be obvious to you at this point.
      • Davidzheng 8 hours ago
        Deep think still makes many many many more mistakes than gpt 5.5 pro on math
      • civvv 11 hours ago
        There are many indications that model progress is slowing down, so that is not entirely accurate.
        • aspenmartin 8 hours ago
          Please be specific because outside of anecdotal blog posts by people who don’t know what they’re talking about it’s not true. Look at scaling laws, composite benchmarks from the epoch capability index, nothing at all suggests “model progress is slowing down”
        • CuriouslyC 6 hours ago
          Model progress at spitting out unhallucinated facts is slowing down hard. Model progress at solving hard math challenges/programming tasks doesn't seem to be slowing down that I can tell.
        • StrauXX 10 hours ago
          Which indications are that?
          • nicoburns 7 hours ago
            The cost factors on the new models compared to the old models.
            • jeremyjh 6 hours ago
              Qwen3.6 9B is as good as GPT-4o and runs on my M2 MacBook Air. Models are getting stronger and less costly at the same time, but these are somewhat separate branches of research. Frontier labs are spending more because they are still getting marginal returns and there is more capacity to spend than there was a year ago.
              • gertop 5 hours ago
                Qwen 3.6 9B doesn't exist.

                If you meant 3.5 9B and you truly believe it's as good as 4o then I can only assume you have a very basic use case.

            • bdelmas 6 hours ago
              You are mixing cost and progress. It’s not because it’s more and more expensive that progress is slowing down by itself.
              • nicoburns 6 hours ago
                They are intrinsically linked beyond a certain point. If we're making progress but costs are spiraling exponentially then it stands to reason that we will soon reach a point where we can no longer afford the increasing costs and thus progress will slow.

                (barring some breakthrough that reduces costs, which of course may happen, but for which recent model improvements are not strong evidence of)

            • aspenmartin 5 hours ago
              Cost for a specific level of performance decreases 10x per year, this has been a pretty consistent property for awhile now.
          • overfeed 10 hours ago
            Investment dollars.
          • lionkor 9 hours ago
            Nobody is releasing NEW models
            • aspenmartin 8 hours ago
              …not only is this not true but it also doesn’t matter. Why would this indicate performance saturating?
            • kstenerud 8 hours ago
              What constitutes a NEW model for the purposes of calculating progress?
            • taneq 9 hours ago
              The standard networking connection has been called “Ethernet” for more than thirty years, so networking has stagnated, right?
              • SlinkyOnStairs 8 hours ago
                If higher bandwidth networking consisted primarily running more and more ethernet lines in parallel, you would most certainly agree that "networking has stagnated".

                "Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.

                Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.

                • p1esk 7 hours ago
                  From TFA:

                  “ChatGPT came up with an idea which is original and clever. It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove”

                  • SlinkyOnStairs 7 hours ago
                    You misunderstand. I'm not saying that Reasoning/Agentic systems aren't better.

                    I'm saying they're not an advancement in the tech in the way GPT 1 through 3 were. They're a different kind of improvement.

                    And as such the rate improvement cannot just be extrapolated into the future.

                    • p1esk 7 hours ago
                      GPT1 through GPT3 advancement were exactly like using more Ethernet cables in parallel.

                      All interesting conceptual breakthroughs came after GPT3: RL and reasoning being the main ones.

            • GardenLetter27 7 hours ago
              What? DeepSeekV3 just came out and is incredible for the price. Mythos is also half-released.
              • nozzlegear 5 hours ago
                Until you or I can actually use Mythos in Claude without an nda or other strings attached, Mythos is not released and is just an effective marketing tool for Anthropic.
    • miki123211 55 minutes ago
      I think that ultimately, the largest change brought on by LLMs will be due not to their intelligence, but to their tenacity.

      If you had an infinite number of monkeys, each with a typewriter, one would eventually write Shakespeare. If you had an infinite number of college-educated interns, each with access to all the public records you can possibly get via FOIA, one would eventually get enough evidence to prove that a top politician is cheating on their partner, evidence which you could use to blackmail that politician.

      You don't need that much intelligence to do that, you just need somebody who's willing to dedicate their life to knowing everything there is to know about that guy from Louisiana.

      With humans, the amount of money you'd need to pay such a person just isn't worth the reward. With LLMs, it may very well be.

    • illiac786 11 hours ago
      Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn. It does not, and it is for the human brain a formidable task to remember that something as smart as an LLM does not learn. I keep catching myself making the same mistake.

      It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.

      I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.

      • timschmidt 11 hours ago
        They can form new associations between concepts via their input prompts and thinking text. That is a form of learning. Just not very durable. I liken it to https://en.wikipedia.org/wiki/Anterograde_amnesia
        • illiac786 11 hours ago
          yeah, I should have been more specific: I meant the type of learning that mentoring fosters, the long term learning.
          • timschmidt 11 hours ago
            I hear you. I think we are already seeing some middle ground with agentic systems using RAG, skills.md files, etc. It's a sort of disassociated card catalog memory. An engineer's notebook. Not the integrated, correlated, pre-processed set of relationships in the model. How to go backward from the notebook -> model cheaply without tanking performance is definitely one of those billion dollar questions.
      • kybernetikos 9 hours ago
        Current LLM architecture doesn't learn - and you're right this is a huge piece that normal folks fail to understand, since in many ways, it's the opposite of what years of AI research has been trying to create.

        However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.

        • baq 4 hours ago
          If I was a frontier lab and I solved continual learning, as of today I would absolutely not release it - the society isn't ready for this; society isn't even ready for widespread diffusion of current publicly available frontier models.

          If however I was a frontier lab who solved continual learning and my competitor also solved and released it, I would release mine immediately, obviously.

          The point is, continual learning might be solved already, we just don't know and those who might know would rather keep their mouths shut. It isn't my base case (financial situation of frontier labs is such that they'd probably release immediately as long as they have inference compute to serve this revolutionary capability), but it isn't impossible.

          • bigyabai 3 hours ago
            You're not a frontier lab, the shareholders own those. And if shareholders get a private briefing about an unprecedented breakthrough in continual learning, they would announce it from the rooftops to take credit for the progress ASAP and reap the rewards for their stock value.

            The only lab that I can exempt from this is DARPA.

            • baq 19 minutes ago
              Shareholders are not insiders. Public companies do secret projects all the time of which shareholders know absolutely nothing about and may never learn the details of them if they get cancelled.
        • lukewarm707 8 hours ago
          exactly like you said - the harness might learn.

          we do also have training on synthetic data. it might compound.

      • freedomben 10 hours ago
        I mostly agree, though after a mentoring session you can ask it to write skill or a memory and it can be reasonably durable. For Claude at least, the memories work pretty well (though I am still at a small scale with them. As they grow it might start to break somewhat. Doesn't always work, but has often enough that I thought it worth a mention.
      • stingraycharles 8 hours ago
        > Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn.

        I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.

        Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.

        In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.

        • kasey_junk 8 hours ago
          I agree it’s pedantic and personally don’t get bent out of shape with people anthropomorphizing the llms. But I do think you get better results if keep the text prediction machine mental model in your head as you work with them.

          And that can be very hard to do given the ui we most interact with them in is a chat session.

          • stingraycharles 5 hours ago
            Absolutely, but there is no evidence that the grandparent was doing that, all they did was use the word “mentoring” and my argument is not that anthropomorphizing isn’t a problem - it is - but that the response to this particular HN is super pedantic.

            Obviously the real people that are classifying AI as human intelligence aren’t going to be the top comment on reviewing LLM’s PhD-level papers. They are on very different, much more problematic areas of the internet.

        • nozzlegear 5 hours ago
          Anthropomorphism is a subtle marketing tool used by these big AI companies, who are financially incentivized to push the myth of AGI and want everyone to believe they're right on the cusp of achieving it. It's good to be pedantic in this case, we shouldn't anthropomorphize these tools.
          • stingraycharles 5 hours ago
            This is just a “hurr durr AI companies evil” argument without substance.

            It’s the people that are the problem, nobody told the grandparent to use “mentoring” as a word, and my argument is that it’s a complete overreaction to classify them as anthropomorphizing AIs, and I’d argue default to that argument would be an insult to them, and it’s super pedantic.

            • nozzlegear 3 hours ago
              > This is just a “hurr durr AI companies evil” argument without substance.

              If you say so bud.

              > nobody told the grandparent to use “mentoring” as a word

              Nobody told people to say "Google it" either; nobody told us to use the word "Kleenex" when we mean tissue; nobody told us to use the word "Chapstick" when we mean lip balm. Nobody told British people to say "Hoover" when they mean vacuum, or "Sellotape" when they mean transparent tape.

              This is literally how soft influence works, it's how brands "colonize" language. A professor using the anthropomorphized word "mentoring" when talking about a machine, as if it's a student that can learn and develop relationships, is this same soft influence at work. The AI companies' websites are all riddled with cognitive language, their chat bots all use conversational UI like you're talking to a person, the bots answer with "we," "me," and "I." They created an environment that made anthropomorphized language feel natural, which only helps their marketing goals.

              Go ahead and call it pedantry all you want, but that's the whole point. The problem is epistemic.

        • thfuran 5 hours ago
          But in-context learning is like a student only remembering what they’re being taught for the duration of the discussion. That’s not really how mentoring is meant to work, so pointing out the issues with the metaphor seems pretty reasonable.

          In other news: That words can change meaning doesn’t mean that every possible change in meaning would be beneficial to communication and therefore desirable. Would you advocate in support of someone suggesting to use “left” to mean “right” simply on the basis words can change in meaning?

    • maximamas 12 hours ago
      LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.
      • jillesvangurp 11 hours ago
        Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.

        A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.

        I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

        • noisy_boy 9 hours ago
          > Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

          Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.

          I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.

          • Applejinx 8 hours ago
            So you have to know the answer and also be an expert in the problem domain?
            • Brendinooo 1 hour ago
              I don't think you have to know the answer. If the person you replied to knew the answer, there wouldn't have been a big, lengthy discussion.

              But yes, being an expert in the problem domain helps. Or at least knowing enough to know what the right questions are and what plausible answers look like.

              I just had a similar situation where an hour or two of conversation turned into a five-minute robot coding task. The problem required a solution and the number of possible solutions is vast, but that list can be refined, and then once the course of action is set, sometimes the course itself isn't all that complicated.

            • samuelec 7 hours ago
              In my experience you need exactly what you said, and I would add that he probably would have spent half day to do the refactoring himself and it would be sure he did right.
        • kannanvijayan 6 hours ago
          I can speak towards building large-scale systems from scratch with these tools. I've been working since late last year on a project that was barely a tech demo, and the progression of development on that project has seen me go from leveraging co-pilot autocomplete at the start, to full-on vibecoding 100% of the new additions.

          I have reasonable eng chops I'd like to think - I have been a senior IC for a while on a reasonably diverse set of challenging systems problems and built out some pretty large-scale pieces of software the old "artisinal" way.

          This particular project is a productization of some ideas I had for leveraging a virtual machine to execute high-divergence parallel logic on GPUs, in an effort to move complex things like "unit behaviour in games" (the classical symbolic kind, not NN-based unit behaviour) into the GPU. The project is going well but still quite a ways from release. But it's at about 300k lines of code now across 9 or so rust repositories, and a smattering of typescript on the frontend.

          I have had stumbles, but overall I feel I have put together some good strategies and principles for pushing large projects along with these tools in an effective way.

          The biggest takeaway for me is that the "feel" is different. Software construction by hand felt like building legos where you put the pieces together yourself. A lot of my focus would be on building and solidifying core components so I could rely on them when I stepped up to build higher-level components. Projects would get mired quickly if you didn't solidify your base.

          With agentic development, one of the early challenges I ran into was this issue with something I'll call "oversight inception". It's when at some early point in the process a somewhat low-importance decision is made - an implementation decision, a decision to say.. align a test with the implementation rather than an implementation with a test.

          Then, as you build more on top of this, that small decision somehow ends up getting reified into a core architectural policy that then cascades up.

          You realize that when you're building a big project, the focus on some particular component is backstopped by a general understanding of local development directionality with respect to the larger level project. And the agent has no idea of directionality.

          So small chinks in the design end up getting magnified and blown up as the dev process proceeds, and later on review you find major architectural pieces have just been overlooked, all flowing from some small incidental implementation choice a long time before.

          This is one among a number of issues, but it's a big one. Once I saw it happening I tried an approach to mitigate it by developing a set of golden "goal" documents that describe directionality at the project level: what you are working towards and what design components need to exist.

          This doesn't eliminate the "oversight inception" issue, but it does catch them earlier.

          When I started applying the goal documentation aggressively to re-align the project implementation direction, I found velocity dropped a lot.

          And as I progress, I'm balancing this out a bit - to allow the system to diverge a bit, but force reconvergence towards the goals at some specific cadence. I haven't found the right candence yet but I'm getting there.

          This new style of development feels more like claymoulding pottery than lego assembly. You sort of "get it into shape". It's a very interesting new set of process assumptions.

      • ziotom78 11 hours ago
        I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.
    • tags2k 12 hours ago
      I'm no physics professor but this aligns with the way I use the tools in my "senior engineer" space. I bring the fundamentals to sanity-check the trigger-happy agent and try to imbue other humans with those fundamentals so they can move towards doing the same. It feels like the only way this whole thing will work (besides eventually moving to local models that do less but companies can afford).
    • _the_inflator 10 hours ago
      I agree and put it this way: LLMs sound so convincing presenting you the work it does rose colored and promising to give you more if you keep going.

      There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.

      Only the trip stays the same beautiful 5 star plus travel.

      Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.

      The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.

      Gemini to me is the most unpredictable LLM while GPT works best overall for me.

      Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.

      Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.

      I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.

      • MattPalmer1086 9 hours ago
        Reusing the same prompt several times is something I've started doing too. The contrast is often illuminating.

        In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.

        I now see LLMs as persuasion machines.

        • eitally 6 hours ago
          One thing I've been doing lately -- and I'm in a business function, not a technical one, although I have an engineering background -- is pitting LLMs against each other. For example, if I'm structuring a proposal or a contract with the assistance of Claude, I'll begin my 360 feedback review first by asking Claude how it would react if it were the counter-party receiving the proposal. After some iterative changes, mostly manual, I will then run the same output document past Gemini and ask it to adopt personas from both sides and provide reactive feedback. The result of this is almost always a stronger proposal that I can also accompany with proactive objection handling and a solid FAQ, as well as clear points of negotiation that will likely be acceptable to both parties.

          For this sort of thing, using multiple LLMs is extremely helpful.

        • scotty79 7 hours ago
          Before AI happened I watched youtube. Occasionally I encountered there very convincing arguments. Same person often made very convincing arguments on many subjects.

          But noticed that the closer the domain they were talking about was to my area of competence the less convincing their arguments were. There were more holes, errors and wrong conclusions.

          I recalibrated my bs meter thanks to that.

          Since AI came I successfully used this strategy of being extremely cautious towards convincing arguments to not become mislead by AI.

          However this year I'm working with AI more in the domain of software development. Where I can see the competence. And I see the competence. This had opposite effect on me. I tend to trust AI outside my domain of expertise much more after I saw what can it do in software.

          One caveat though is that there are a lot of areas of human culture where there's very little actual knowledge, but a lot of opinions, like politics, economy, diet, business, health. I still don't trust AI in those domains. But then again, I don't trust humans there either.

          For me basically AI achieved the threshold of useful reliability for any domain that humans are reliable at.

          I don't really care about sycophancy. I might have a slight advantage that I don't talk to AI in my native language. So its responses don't have a direct line to my emotions.

        • taneq 7 hours ago
          Ever since they started getting really sycophantic, I’ve been presenting my ideas as “my co-worker says this is a good approach but I disagree, can you help me convince him that it’s wrong?”
      • pbhjpbhj 10 hours ago
        >LLM wants to please you

        I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.

        Instead, it started writing an OCR program in python.

        I stopped it after several minutes.

        Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.

      • freedomben 10 hours ago
        > Gemini to me is the most unpredictable LLM while GPT works best overall for me.

        This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.

    • mixtureoftakes 12 hours ago
      please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind

      you deserve opinions shaped by interactions with the best tools that are out there.

      • wg0 12 hours ago
        Gemini feels deep and philosophical. Especially for product management. Tell him you're a product manager and we're a team of two.

        But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.

        • wafflemaker 10 hours ago
          Or when you don't care about results being very correct.

          When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.

          I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.

        • smartmic 11 hours ago
          > I only work with LLMs in domains I'm expert in

          This. Should become a general rule for any non-trivial use of LLM in a professionel setting.

          • josu 4 hours ago
            LLMs can also be really good in fields where you are not an expert. You just need to be very aware of your limitations, and start parallel conversation so one agent fact checks the other.
      • ainch 8 hours ago
        Agreed, Gemini is clearly a capable model, but the tool use is lagging behind the other two. Ironically it regularly gets things wrong (ie. the current version of some software) because of an unwillingness to use web search.
      • cubefox 12 hours ago
        Gemini is certainly not behind Claude in terms of physics.
      • peyton 12 hours ago
        Seriously, it’s not worth reaching for less intelligence. Use Extended Pro 100% of the time for things you’d spend the amount of time GP spent writing their post.
      • hodgehog11 11 hours ago
        ChatGPT and Gemini are actually fairly comparable.

        Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.

    • Quothling 11 hours ago
      We've got a rather extensive AI setup through our equity fund and I've setup a group of agents for data architecture at scale. One is the main agent I discuss with and it's setup to know our infrastructure and has access to image generation tools, websearch, hand off agents and other things. I tend to use Opus (4-6 currently) and I find it to be rather great. As you point out it comes with the danger of making mistakes, and again, as you point out, it's not an issue for things I'm an expert on. What I rely on it for, however, is analysing how specific tools would fit into our architecture. In the past you would likely have hired a group of consultants to do this research, but now you can have an AI agent tell you what the advantages and disadvantages of Microsoft Fabric in your setup. Since I don't know the capabilities of Fabric I can't tell if the AI gives me the correct analysis of a Lakehouse and a Warehouse (fabric tools).

      What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.

      It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.

      • stasomatic 7 hours ago
        Are you using this agent hive for any repeatable tasks? What you described, superficially, seems like a one off. Genuinely curious.
    • danielparsons 2 hours ago
      I find exactly the same for legal analysis. Great at ideation and proofreading but frequently misunderstands concepts and hallucinates conclusions from faulty premises.
    • wccrawford 9 hours ago
      This doesn't surprise me since the coding agents are similar. I've previously compared them to very fast, ambitious junior programmers. I think they are probably mid-level coders now, but they continue to make mistakes that a senior programmer wouldn't. Or at least shouldn't.
    • northzen 8 hours ago
      Hi ziotom! I wonder about you work in 3D Cifford Algebras. May you share some links to the research you do? I also have interest in this topic I research on my own.

      Just in case if you don't want to disclose your name my email is northzen@gmail.com

    • port11 9 hours ago
      Gemini’s smug and over-confident “this is the gold standard in 2026” definitely leaves little space for nuance if you don’t know the subject matter. Human students would, hopefully, know they don’t know everything.
      • quantummagic 9 hours ago
        > Gemini’s smug...

        Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.

        • Jtarii 9 hours ago
          >Anthropomorphizing these systems is dangerous

          That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.

          • bartvk 9 hours ago
            First I thought to myself, "my daughter does this and it looks so cute". And only as a second thought, that your comment just proved itself.
        • DiogenesKynikos 8 hours ago
          It's only "statistically generated" in the same way that your brain is just "neurons firing." That's the low-level description of what's happening, but on a higher level, it's correct to say that it's being smug.
          • antisthenes 4 hours ago
            > it's correct to say that it's being smug.

            It's not correct to say that it's being smug, because when people are being smug, we do it for a purpose - e.g. to signal higher social status or superior knowledge.

            A machine has no such imperative, so what you call 'being smug' is statistical mimicry.

    • recursivecaveat 13 hours ago
      This is close to my experience with code. LLMs can pick out small mistakes from giant code changes with surprising accuracy, or slowly narrow down a weird. On the other hand I've seen them bravely shoulder on under completely incorrect conceptual models of what they're working with and churn around in circles consequently, spin up giant piles of slop to re-implement something they decided was necessary, but didn't bother to search for, or outright dismiss important error signals as just 'transient failures'. Unlimited stamina, low wisdom.
    • tasuki 11 hours ago
      > in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

      I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...

      • jiggawatts 8 hours ago
        Ironically, it's sort of the other way around! Every frontier chatbot since GPT 4 (at least) has had a pretty good understanding of even very esoteric technical concepts.

        Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!

        Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.

        Then I would lie to the bot about it, and see if it corrected me or not.

        This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.

        Conversely, LLMs are basically useless at anything that doesn't have enough (or no) public information for their training. Think: obscure proprietary product config files and the like, even if the concepts involved are trivial.

        Similarly, Clifford Algebra is a relatively niche (even "alternative") area of mathematics and physics, with vastly less written material about it than the competing linear algebra. Hence, the AIs are bad at it.

    • eth0up 7 hours ago
      Any experience with NotebookLM?

      Mine has been epically bad.

    • ed_balls 7 hours ago
      intern that never sleeps
    • wood_spirit 12 hours ago
      Chiming in to agree but clarify that the latest sota models are no better than Gemini.

      I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.

      So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.

      I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.

      • energy123 10 hours ago
        Basically all Erdos problems that get solved with AI use ChatGPT 5.* Pro, not Gemini/Opus.
        • 5555watch 9 hours ago
          I would guess it's because ChatGPT Pro allows for 80min "think". I've never had even remotely similar think times with Gemini Deep Think. It's generally around 10-15min for math problems, and get increasingly shorter for continued interactions.
    • cyanydeez 12 hours ago
      I've been watching the automation of things like flight control systems for the past decade, and the evolution of the fallback to a real pilot in the event of a emergency is what's most concerning about where LLMs are being embedded.

      Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.

      At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.

      • pbhjpbhj 9 hours ago
        Watching a teenager approach their homework, instead of struggling to answer questions they don't know, they ask Gemini. Unfortunately, I think the mental struggle to approach an answer is where much of the learning is. They also miss out on the reward for persistence of seeing things fall together.

        It is troubling. It suggests a plateauing of human understanding.

        • regularfry 9 hours ago
          It absolutely is where the learning is, that's pretty well established brain science.
      • regularfry 9 hours ago
        What that means practically is that we've got a generation - 25 years or less - to evolve these things not to need the fallback. If such a thing is possible.
      • leptons 11 hours ago
        We're on the road to Idiocracy.
    • DeathArrow 11 hours ago
      I don't think the experience with Gemini will be the same when using GPT.
    • ieieaaa 7 hours ago
      LLM’s are the most powerful tool invented to search across a huge information space in response to human input.

      That’s all they are. They don’t ‘know’ anything intrinsically and do know ‘know’ what reasoning even is.

  • mxwsn 13 hours ago
    > Here’s a thought experiment: suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.

    This is a cultural choice. It makes sense that in the mathematics culture we currently have, this is alien. But already, other fields, and many individuals, would disagree and say that the human did have a major achievement here. As long as human-AI collaborations are producing the best results, there is meaningful contribution by the humans, and people that are deeper experts and skilled LLM whisperers should be able to make outsized contributions. The real shoe drops when pure AI beats humans and human-AI collaboration.

    • pmontra 13 hours ago
      I replied to a comment about AI in sports and I build on that.

      We praise car drivers despite most of the performance in their sport comes from the car. The driver makes the difference when two cars are close in performance. Brilliances or mistakes. Horse riders too.

      In the case of math, the human can lead the LLM on the right track, point it to a problem or to another one. So it deserves some praise.

      Then the team that built the car, cared about the horse, built the AI might deserve even more praise but we tend to care more about the single most visible human.

      • dmbche 4 hours ago
        Could you win an F1 race with the latest winning car against F1 drivers?
    • djeastm 6 hours ago
      >Would we regard that as a major achievement of the mathematician? I don’t think we would

      For some reason this reminds me of AI images and a domain like comedy.

      If an image makes people laugh, the person who prompted it to make the image certainly doesn't get credit for the vast majority of the work in its creation, but perhaps they do get credit for the initial prompt idea and then the "taste" to select that particular one from whatever drafts they went through or otherwise guiding it.

      So if a mathematician comes up with an amazing result that an LLM "did", I think they could still get a bit of credit for prompting it to do it and being its guide.

      But whereas the first person could perhaps be called a comedian and not an artist, would the mathematician still be called a mathematician or something else?

    • gobdovan 8 hours ago
      I would. Even if someone found a prompt or even automated the conversation and just searched all open math problems I still would. If they produced a useful result without harm to anyone, that's a valuable human activity that should be rewarded just as well as we reward the other mathematicians, which I imagine is quite a lot, given all the billionaire mathematicians...
      • 542458 7 hours ago
        > given all the billionaire mathematicians

        We just call those ones “quant traders”.

    • bambax 12 hours ago
      It may not be a major achievement by the mathematician (although it's debatable) but it would still be a major result.
  • few 14 hours ago
    >So if your aim in doing mathematics is to achieve some kind of immortality, so to speak, then you should understand that that won’t necessarily be possible for much longer — not just for you, but for anybody.

    This made me a little sad

    • mentos 7 hours ago
      I watched the movie '21' (2008) for free on YouTube yesterday.

      The opening of the movie features the MIT campus full of students navigating its grounds and all the promise and status that higher education brings. [0]

      Gave me the same sense of sadness realizing how much will fall to AI.

      [0] - https://youtu.be/0lsUsWdkk0Y?si=TJl7f_b1RcWcDqF8&t=278

      • lexoj 7 hours ago
        Not free in my country, didn’t know YouTube was broadcasting full movies in certain regions as you imply.
    • vessenes 6 hours ago
      This was the most interesting line in the essay to me as well — I flashed back to quitting an academic math career instantly; the way I thought about it at 19 or 20 was that I didn’t think I could be world class at it. (Rightly). The next thought I had was “what am I good at?” And implied in that was at the very least “What could I be world class at?” Or at least very good at.

      I don’t think I ever thought I was good enough to try and get (math) immortality by finding and naming some result that would live beyond me, but if I had, perhaps this bad news would have had a similar impact on me.

      That said, I think I disagree with the premise at the margin, at least. I don’t care how many proof assistants or cluster compute is used - the team or person that proves the Riemann Hypothesis will be famous, or at least math famous.

    • jdale27 12 hours ago
      I don't know that it's that disappointing. I doubt most of the great mathematicians were actually doing it to achieve immortality. I suspect most of them were either after (possibly indirect) practical applications (via the math -> physics -> engineering pipeline) or just "for the love of the game", appreciation of the beauty of math and the intellectual joy of doing it. AI might also take over the practical application side, but the other aspects are still there for the taking.
      • hodgehog11 12 hours ago
        Exactly. Gowers is in the unique position to think about the "glory" of frontier mathematics, but for essentially everybody (especially those working outside of number theory), that dream died long ago. There are far too many mathematicians now.

        Many mathematicians work because they love the breakthrough (a certain quote of Villani comes to mind). They love finding new results, uncovering new mysteries. From that point of view, having an AI that can build on your basic ideas and refine them into more powerful arguments is awesome, regardless of who gets the credit. There are those that treat it more like solving puzzles so the result is not of interest. From that point of view, I can see the dissatisfaction. But I have found those with that viewpoint don't tend to make it as far in academia as those with the other viewpoint.

    • bananaflag 14 hours ago
      Now repeat that for every sort of human achievement
      • bel8 13 hours ago
        Machines are comming even after table tennis :(

        https://www.youtube.com/watch?v=VVEzgYxDdrc

        • pmontra 13 hours ago
          Sports are safe. Machines came after runners (motogp, formula 1) and yet we cheer the winners of the 100 m at the Olympics Games. Fully autonomous bikes and cars won't change that. AIs destroy chess players. We still cheer the world champion.

          We care about sports with humans.

          • fragmede 13 hours ago
            Robot MotoGP would be amazing to see just how far the limits could be pushed without risking the life of a human though. Or even full size remote control.
            • Ekaros 10 hours ago
              Sadly I don't think there is any safe tracks for proper autonomous car racing without limits... Still would be interesting to see what is the absolute best you could do if rules include only say minimum number of wheels and maximum dimensions for vehicles.
  • MinimalAction 14 hours ago
    As a graduate student, this piece made me sad. I always believed that my work speaks for itself and transcends beyond my limited time on this cosmic experience. This notion of immortality was just a small intangible bonus I hoped for when I jumped into grad school. AI is making me feel less worthy.
    • hodgehog11 12 hours ago
      As someone who is much further down the track, I would kindly suggest you drop that line of thought. I've seen far too many brilliant and ambitious people drop into depression because of it.

      You are worthy of doing this work because you are able to do it. Do the work because you love it and because you love the mystery. Enjoy every moment that you get to do it. Find joy in the great fortune you have to do this work while others toil away on tasks that bring them no satisfaction. Sometimes it's tedious, but sometimes it's incredibly rewarding in its own right.

      Don't work for the possibility of eternal glory though, it just doesn't exist anymore.

      • MinimalAction 5 hours ago
        Thank you for this comment. I often fall into the why of graduate school many times. The pay is insufficient, hours are long, but at least I find it very satisfying on good days. It is just the feeling that what I do may not be unique anymore is what sucks. I didn't necessarily mean to find glory through incredible work alone, but through being unique in the problems I choose. Anyway, I digress.
    • helloplanets 36 minutes ago
      "If you value intelligence above all other human qualities, you're gonna have a bad time." - Ilya Sutskever, 2023
    • whatever120 14 hours ago
      You are worthy. You will hone your skills in grad school and be able to command these AIs better than somebody who hasn’t struggled with hard problems for a long time.
      • jlarcombe 13 hours ago
        A depressing thought that all that work is just so you can "command AIs better"
        • folderquestion 11 hours ago
          It could happen than the AI, in a near future, is not something external but just a part of your brain, so you retain the glory.
        • alexashka 10 hours ago
          All that work to kick a ball into a net.

          Nobody looks at this species and goes hm, rational and reasonable :)

    • timedude 5 hours ago
      Let me tell you, there is a ton more to learn in this reality than llms are capable of finding out on their own, especially when it comes to truth, ethics and morality. And those are the only thing that matter in the end when you leave this reality. A greater challenge does not exist.
    • ionwake 11 hours ago
      I feel bravery transcends time better than the odd scientific breakthrough which are often attributed to one, but whose roots came from a "lesser" unknown
    • kranke155 6 hours ago
      try meditation.
      • MinimalAction 5 hours ago
        Thanks. This might help. Are you suggesting any particular form?
        • kranke155 4 hours ago
          Just meditate everyday.
    • alexashka 10 hours ago
      > I always believed that my work speaks for itself and transcends beyond my limited time on this cosmic experience

      Any statement preceded by the word 'believe' is a coping mechanism.

      > This notion of immortality was just a small intangible bonus I hoped for when I jumped into grad school

      Any statement preceded by the word 'hope' is a coping mechanism.

      > AI is making me feel less worthy

      Worth comes from understanding, not achievement.

      • MinimalAction 5 hours ago
        I strongly disagree on beliefs and hopes are coping mechanisms. Coping from what? Beliefs and hopes are what they are.

        But I agree worth should be derived from understanding, not through achievement.

  • NotOscarWilde 14 hours ago
    As a TCS assistant professor from Eastern Europe, I always am a little jealous of the biggest names in math having such an easy access to the expensive, long thinking models.

    Paying for Pro from any of my current academic budgets is completely ouf of the field of reality here -- all budgets tend to have restricted uses and software payments fit into very few categories. Effectively, I'd have to ask for a brand new grant and hope the grant rules allow for large software payments and I won't encounter an anti-AI reviewer; such a thing would take one year at least.

    As a nail to the coffin, I was "denied" all Claude Opus recently as part of Microsoft's clampdown on individual (and academic) use of Copilot.

    (Chagpt 5.5 Plus does not seem sufficient for any deeper investigations into new research topics, I've tried.)

    Apologies for the rant.

    • vthallam 13 hours ago
      @NotOscarWilde drop your email here, I will reach out and happy to get you a pro account for a few months so you can try 5.5 pro.(work at OAI)
      • teiferer 12 hours ago
        While this sounds generous (and in some ways it is), it does not address the general point that GP is making. That is, the systematic disadvantage which large parts of humanity have w.r.t. to access to the tools. You could say they can't drive a Lambhorgini either, but that also doesn't solve the problem.
        • NotOscarWilde 11 hours ago
          You're absolutely right (pun intended).

          An aside: It was a very nice gesture and completely unexpected by me, so even if it doesn't work out, it made my day. I personally believe that kind gestures have a lot of power.

          Back on topic: There is a real danger of the gap between rich and poor universities significantly widening in all fields if the rich can afford Pro level models, or even hardware that can run their own comparable models, and this being fiscally inaccessible to the rest.

          One can sweep this under the rug by blaming the educational funding but this just shoots down all discussion. Even if GDP of a country goes up by a lot -- such as Poland -- it takes time before any budget benefit trickles to the education budget, and with some governments it might never do.

          I believe Microsoft et al do have the most power here to boost affordable access to AI for researchers on a large scale; the fact that they cut some too expensive models (Opus, 5.5) from their academic benefits package is a grim omen. I do realize they would like universities to pay them also, and ultimately the universities should do that -- but then we are back at the institutional level of the problem.

          • 34qJhah 5 hours ago
            It isn't a nice gesture---it is guerilla marketing! (pun also intended but I mean it)
        • panagathon 5 hours ago
          > While this sounds generous (and in some ways it is), it does not address the general point that GP is making. That is, the systematic disadvantage which large parts of humanity have w.r.t. to access to the tools. You could say they can't drive a Lambhorgini either, but that also doesn't solve the problem.

          This was also the case historically, when being at certain universities, with better professors, better scope of works available at the library, etc, would necessarily provide systematic advantage.

          This is the reality of progress. It is always unevely distrubuted.

          I do think the open source side of model development is a substantial counter to the pessimism here.

        • Scea91 11 hours ago
          Its a problem of the individual institutions and countries. The budget required for AI tools currently is negligible compared to other university expenses. We don't need to call everything a systemic disadvantage when the disadvantaged (at the institution level) have agency here.
          • teiferer 5 hours ago
            > The budget required for AI tools currently is negligible compared to other university expenses.

            Is it? Do you have any idea what the salary of a mid-tier university researcher in an Eastern European country is? Or in Africa or south-east Asia? With sota LLM pricing you easily get into the same order of magnitude, so essentially labour cost would double for researchers at such universies. Not "negligible" at all.

          • NotOscarWilde 11 hours ago
            Can you tell me what is the budget necessary to supply AI tools capable of substantial research assistance to all academic staff at a university?

            You seem to have a good estimate in your head; I definitely do not.

            From personal experience, ChatGPT 5.5 (the Plus tier) is excellent for programming tasks and also for various teaching related tasks but I have not observed the research benefits that Tim Gowers has when I asked it questions in my area of expertise. So the costs are definitely higher than a few dozen $ a month per PhD/professor.

            You might be right that universities should immediately spring into action and demand funding for research level AI resources and hardware. One thing you might be mistaken in is that public universities are unfortunately very inflexible institutions; one reason for this is that they have a large internal leadership structure AND they are funded by the state, so even if the entire university agrees on something, the funding is at the whim of the ministry of education and thus the current political leadership.

            • krab 11 hours ago
              > Can you tell me what is the budget necessary to supply AI tools capable of substantial research assistance to all academic staff at a university?

              I think the GP meant that *if the tools provide substantial benefit* to staff, their costs can be compared to salaries and other large expenses of the university. The $100/month subscription costs less than your office space.

            • teiferer 5 hours ago
              Which is good, since public money is tax money, so it better be spent wisely and not just thrown at the latest hype without thinking properly about it. It's a feature that public spending moves slowly, we should all be thankful for it.
        • snayan 11 hours ago
          I mean, I don't think OpenAI should be wading into the policies and practices of foreign institutions and governments. Look at all the blowback we see from the collision of Anthropic or OpenAI and the US government.

          At present, the tools are available for whomever wants to buy them. Not OpenAI's fault that parent comment's government and/or institutions policies haven't been updated to allow for their purchase and use.

          I'd argue that the OpenAI dude/dudettes level of generosity is appropriate given the circumstances.

      • NotOscarWilde 12 hours ago
        This requires a major "dox" of myself, but I am really grateful for the offer, so these are my academic contacts:

        https://pastebin.com/hNYrCjhL

        I probably will erase the contents in a few days.

        Even if you just drop an email and it doesn't work out, I appreciate this gesture so much. Thank you.

        • vthallam 12 hours ago
          Got the contact, will reach out tomorrow, you can delete them.
        • teiferer 12 hours ago
          [flagged]
      • thierrydamiba 13 hours ago
        Shoutout to you-I will match it if they need other resources. (I don’t work at OAI, just think this is cool)
        • alsetmusic 13 hours ago
          You know what, I'm ashamed that I didn't think of this. I'll sponsor three months. Email in my hn profile. I don't understand the math in the article, but I'd love to help you make progress in it.
        • NotOscarWilde 11 hours ago
          I will leave the contact up for a bit longer if people want to get in touch and share their experience with the research gap of the models -- or anything, really -- but I do not think there is any need of further support. Like I said elsewhere, the offer of support made my day and the gesture is enough.

          Thank you.

      • lelanthran 4 hours ago
        This doesn't solve the problem, though: having the ability to finish a field of study without paying a toll to a token provider.
      • layer8 6 hours ago
        And what should he do after the few months?
    • johndough 12 hours ago
      At my university, everyone had to pay their AI subscriptions out of their own pocket, until a communal AI service was introduced recently. It took 2 years to set up and only serves gpt-oss-120b, so everyone is still using other services. But at least some admin can scatter the word "AI" all over the university's website now and has an excuse to reject any requests for AI subscriptions because "we already have AI".
    • alsetmusic 13 hours ago
      It’s a classic example of the best positioned people being in the best position to keep reaping all the rewards.

      There’s the example of a poor person and a rich person buying boots. The poor person’s boots wear out and have to be replaced while the rich persons boots last for many years due to higher quality craftsmanship. Over years, the poor person’s boots wear will pay may for boots.

      • huijzer 13 hours ago
        I know the example, but as a counter-argument: often more expensive boots are not more durable. It’s about spending time to learn to spot the quality.

        Of course if you are really poor, then you have to take expensive shortcuts, but for most people that shouldn’t be the case. Learning to do more with less money isn’t as bad as many people think. It’s also good for the brain to be a bit more creative.

        • NotOscarWilde 7 hours ago
          > Learning to do more with less money isn’t as bad as many people think.

          We are wading into philosophy here, but I believe this analogy doesn't track in this case -- my suspicion from this blog post and others is that already today, the Pro level thinking models are a positive multiplier to your research output similar to how the models one level lower are a multiplier to one's programming output.

          Maybe one can someday use the cheaper models similar to how you can use cheaper models than Opus/5.5 and still be nearly as productive as a programmer -- but I am trying and failing doing exactly that for research questions.

      • m_mueller 12 hours ago
        here I think it's less about "poverty" (non-US acedemic budgets are still high, though not in the same sphere), but it's about having red tape when it comes to software. My experience doing a PhD in Japan was: Everything you can touch was basically a free for all - including $500 keyboards and $10k Mac Pros, especially if you are a valued researcher. But software, oh man, how can we prove receipt of goods to accounting...
    • bambax 12 hours ago
      OpenRouter lets you pay by the token only (no subscription), has all the frontier models (including Opus 4.7, GPT-5.5) and most of the others, and if you use it sparingly it usually turns out to be quite cheap.
      • johndough 12 hours ago
        API pricing for Claude is about an order of magnitude more expensive than subscriptions (numbers: https://she-llac.com/claude-limits). But it may be worth it with DeepSeek V4 Pro, which is currently on discount.
        • bambax 12 hours ago
          Depends very much on usage! If you connect it to tools like Cursor, etc. then yes a subscription is probably cheaper -- although, you'd have to subscribe to each provider if you want to use them all.

          But if you ask questions occasionally, (and don't resend, for example, your whole codebase with each request), then the API feels really cheap, even for the frontier models.

      • tasuki 11 hours ago
        My problem with pay-by-the-token is that it discourages me using the thing ("oh the prompt will cost me $0.1"), so I pay a subscription which I'm pretty sure costs me about two-three times what I'd pay just for the api costs, but encourages me to use it more ("oh I have a subscription already, better make use of it").
    • ziotom78 13 hours ago
      I fully understand your rant! I pay ~20€/month for the Pro account, as my university has a deal with Microsoft and only seems to recognize Copilot, so it’s very hard to use one own’s funding for paying something else.
    • qq66 13 hours ago
      Paste what you want me to ask 5.5 Pro and I'll paste you the response.
    • nerdsniper 10 hours ago
      I believe ChatGPT 5.5 Pro access is available for $100/month, is that an unrealistic level of expense for someone in your position and geography? Even if the university won't pay for it, it seems you'd like to use this tool for your own goals.

      I'm not trying to shame here, just curious whether this is completely unattainable for most researchers in your area.

      • Computer0 9 hours ago
        It appears that in their country someone in their position makes about 50k usd annually. I make a similar amount in my country and cannot justify it.
    • dyauspitr 14 hours ago
      [flagged]
      • bananaflag 14 hours ago
        For a TCS assistant professor in Eastern Europe, $200/month would be 20% of their salary.

        And the situation is better, ten years ago it would have been 80%.

      • iammrpayments 13 hours ago
        Average European salary is around $4000/month, in eastern Europe is half of that. Median is probably lower than that. Makes me want to quit visiting places like reddit where everybody claims to be making 100k+/year
        • goobatrooba 13 hours ago
          All salary discussions need a cost of living context. Yes in Europe you earn a bit less but the public services are much better than in the US and one emergency (r.g. healthcare) won't ruin you as it's mostly a public system.

          I'll take a Euro salary and qualify life over a FIRE-typs salary and daily fear of falling into the abyss any day.

          • revolvingthrow 13 hours ago
            Given the topic and the fact llm providers charge global rates, the absolute take-home money is much more relevant. Even if you live like a king on $1000/mo, 5.5 pro is still $200.
            • fakedang 13 hours ago
              Their loss if they don't move to regional pricing. AI will continue to remain an upper-management luxury then, and won't reach the mass adoption required to justify their outsized valuations.
              • revolvingthrow 13 hours ago
                Regional pricing makes sense for products that don’t have ongoing costs or where most of the input cost can be offset by local labor. You’re not buying server racks nor electricity at 1/3 of the price to serve poorer markets
                • teiferer 12 hours ago
                  AI pricing is not mainly about cost, it's about market realities, i.e., charging exactly the sweet spot to maximize profit.
      • xanrah 13 hours ago
        Lots of people in the west can’t afford 200 a month. How rich are you?
        • dyauspitr 13 hours ago
          That’s what most people spend on their phone and Internet connections per month in the US. That’s what the average American family spends on just five days of food.
          • sevg 13 hours ago
            You can afford five days of food, so that must mean you can also afford a Claude Max plan? What kind of logic is this?
          • skrebbel 13 hours ago
            Fwiw your comments here read to me as “I’m super rich and everyone I know is super rich too, and I can’t imagine that anyone isn’t”.
            • dyauspitr 13 hours ago
              People spend much more than that on just commuting to work if you can spend $200 a month to supercharge what you do at work and 1000x your productivity it’s a no-brainer.
              • skrebbel 12 hours ago
                From what money? Just pause the health insurance for a while? Stop paying the rent? No diapers for the kid?

                Your entire story only makes sense if you have many hundreds of dollars/euros of entirely disposable income every month left, after all unavoidable expenses have been paid for. I understand that this holds for you and everyone you know but I’d like you to appreciate that for very many people it doesn’t.

          • fuzzy2 13 hours ago
            Yes and? That's money that is already allocated. It cannot be spent on something else.
            • xmprt 13 hours ago
              No you don't get it. If the family just starved for 5 days then they could increase revenue for these AI companies.
          • xanrah 13 hours ago
            37% of Americans would be unable to cover a 400 usd unexpected expense* without using one or more credit cards. 13% would flat out be unable to cover it. [1]

            Are you honestly saying most families would be able to justify 200 usd a month for ChatGPT?

            https://www.federalreserve.gov/publications/2025-economic-we...

      • NotOscarWilde 13 hours ago
        There is a significant gap between what academics are paid across European countries, and since most top universities here are public institutions, you are right -- Eastern European government employees tend to be on the poorer side.

        There are several other philosophical arguments against what you propose but I do not wish to go down that route.

      • skullone 13 hours ago
        Bruh, $200/m for most people in the US is also a hard "no!". That's a lot of money. Plus Anthropic isn't doing good deals with orgs that spend less than 250k a month. It's ridiculous.
      • jdw64 13 hours ago
        [dead]
  • TrackerFF 5 hours ago
    The vast, vast majority of students going into higher education this fall will not contribute much to science until 4-5 years down the road (should they do research). Realistically 6-7 when they're in full swing with their Ph.D.

    If we look where these models were 5-7 years ago...the existential threat of the Ph.D. was not even on the radar back then. The people finishing up their doctorate now are the first that can truly leverage these tools.

    Now, if these to-be researcher students feel defeated (enough to quit), or completely lean on AI models the work for them, we're going to have a problem. Same with the funding of those Ph.D. positions. If we move away from "funding to produce researchers" to "funding to achieve results", will money that was usually spent to fund Ph.D. students start to flow towards compute?

    If we look at it a bit cynically: Some researcher will be able to pump out a lot more papers by spending money on compute, than a couple of years of training students.

    Interesting times. But also so much uncertainty. I feel terrible for the students that will have to decide now what they want to do, with all this knowledge.

    • ndkap 3 hours ago
      PhD students are already using AI models to work for them. Most of the PhD candidates I know have $200 Claude Max plan which they use to their fullest.

      I see that they are able to do researches that they were not previously able to do. And although I see that using AI has certainly diminished their ability to code some stuff up, I see it the same way as someone using scikit-learn or Pytorch to code their ML models -- indeed the underlying details is abstracted away from you, and without AI, you won't be able to do much, but the research that you do is indeed happening because of you and wouldn't have happened with just the AI doing the research.

    • robot-wrangler 4 hours ago
      > Now, if these to-be researcher students feel defeated (enough to quit), or completely lean on AI models the work for them, we're going to have a problem. [..] If we look at it a bit cynically: Some researcher will be able to pump out a lot more papers by spending money on compute, than a couple of years of training students.

      Obviously this is already happening and will accelerate. Outside of grad work, you could already just buy a degree. Certainly in the softer disciplines, you can currently just buy a phd thesis and a good publication history. If you're in industry instead of academics, you can even buy a promotion. If your employer gives an AI budget to all workers then you quietly double that budget out of your own pocket for as long as it takes to get a promotion, then stop and just enjoy a bigger paycheck.

    • odyssey7 5 hours ago
      It’s not as if institutions have been lavishing PhD students with money up until now.

      As an afterthought budget item, those funds aren’t exactly attractive targets to raid for pursuing an expensive, different process.

  • bustermellotron 14 hours ago
    I saw Tim Gowers give a talk at the AMS-MAA joint meeting in Seattle about ten years ago where he predicted that in 100 years humans would no longer be doing research mathematics. I wonder if he’s adjusted his timeline.

    At the time I thought the key missing tool was a natural language search that acted like mathoverflow, where you could explain your problem or ideas as you understood them and get references to relevant literature (possibly outside your experience or vocabulary).

    • 34qJhah 5 hours ago
      And Teichmüller thought that Germany would win WW2 and volunteered for the Eastern Front.

      Being a gifted mathematician does not make you right. In fact, mathematicians have a lot of bizarre theories.

  • kang 9 hours ago
    > The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.

    5.5pro is amazing but this implication might not be true & is the core argument of this piece.

    AI will prove all sort of things - interesting, boring & incorrect.

    To sort it will be the task of the PhD.

    • layer8 6 hours ago
      The task of a proof verifier is much simpler than the task of a proof finder (it’s basically equivalent to P vs. NP), and hence the bar for the required skills is lower. Merely verifying proofs isn’t research, and doesn’t impart research skills.
  • momojo 13 hours ago
    Sorry, I'm reposting a comment I made yesterday that seems fitting:

    > This reminds me of Antirez's "Don't fall into the anti-AI hype". In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".

  • MrDrDr 9 hours ago
    > "Even though I can motivate it in retrospect, ChatGPT’s idea to use h^2-dissociated sets to control relations of order at most h feels quite ingenious. As far as I can tell, this idea is completely original."

    The question that keep bothering me is can an LLM generate an idea that is truly novel? How would/could that actually happen? But then that leads to the question - what are we actually doing when we think?

    Perhaps it's as simple as the ability to just make mistakes that matters, the same things that powers evolution. As long as the LLM can make mistakes, it's capable of generating something genuinely novel. And it can make more mistakes much faster than we can.

    • charleshn 9 hours ago
      Yes, they can.

      Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense, but it is obviously not true for many reasons, in particular since we introduced RL.

      Humans do not have the monopoly on generating novel ideas, modern AI models using post training, RL etc can come to them in the same way we do, exploration.

      See also verifier's law [0]: "The ease of training AI to solve a task is proportional to how verifiable the task is. All tasks that are possible to solve and easy to verify will be solved by AI."

      This applied to chess, go, strategy games, and we can now see it applying to mathematics, algorithmic problems, etc.

      It is incredibly humbling to see AI outperform humans at creative cognitive tasks, and realise that the bitter lesson [1] applies so generally, but here we are.

      [0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

      [1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

      • vld_chk 5 hours ago
        I genuinely start to think that we, as humanity, severely overestimate our cognitive abilities. We act so surprised “just a few years of LLM with a few RL tweaks match our PhD levels! It must be hidden inside our knowledge base!”. Em, what if no? What if our “PhD level” is just very low level comparing to upper boundaries of measurable intelligence? What if we need to learn being humble and stop treating our minds as “sacred source of creativity and intelligence”?
      • energy123 5 hours ago
        RL or no RL, AI cannot escape the distribution it's trained on. It's just that the labs will put so much into the distribution that we won't be able to tell the difference that easily, nor will it matter for most tasks. The reason AI does well on ARC-AGI-2 is because the labs created synthetic training data using similar puzzles.
        • crthpl 4 hours ago
          Yes it can! That's the whole point of RL! it generates slightly out of distribution rollouts, and rewards good rollouts to change the distribution of the output
          • energy123 39 minutes ago
            That's not out of distributíon, that's inside the distribution of the rollout. If you don't create rollouts for the game of Chess then it doesn't know how to play Chess no matter how smart it is at tasks you've created rollouts for. It's structurally stuck in its distribution.
      • jdub 7 hours ago
        Reinforcement learning for "reasoning" perturbs the model to generate completions in a particular chain of thought / alternative selection structure. It's three next token predictors in a trench coat.
        • munksbeer 4 hours ago
          When these things start solving many more long standing problems, and start introducing more novel problems, will people finally admit that the "next token predictor" is not the gotcha they think it is?
        • charleshn 6 hours ago
          > Some people like to parrot "next token prediction", "LLMs can only interpolate", and other nonsense

          Thank you for illustrating my point.

    • eterm 9 hours ago
      My own take, and it's veering into the Philosophy of Mathematics, but there's a debate about whether Mathematics is "Invented" or "Discovered".

      If it's "invented", then it requires ingenuity.

      If it's "discovered", then it was always already there, just waiting for the right connections to be made for it to be uncovered and represented in a way we can understand.

      Invention requires ingenuity, but discovery does not. So if LLMs can generate truly novel mathematics, for me that settles it that mathematics is indeed discovered, as LLMs are quite capable of discovery yet I don't consider them possible of invention.

      • layer8 5 hours ago
        Mathematical concepts are invented, but they live in a space of possible (conceivable) mathematical concepts, and we can only invent concepts from that selection of possible concepts. This can be reframed as a process of discovery regarding which conceptions are possible.

        Furthermore, the results of theorems aren’t an invention, they are a discovery of what the base assumptions (axioms) logically entail. Finding out which theorems are true and provable is a discovery process. For example, the results of Gödel’s incompleteness theorems were a discovery. They weren’t invented, in the sense that the results couldn’t have been otherwise. We merely could have failed to discover them.

        This also holds for physical inventions. You discover a working way to build some functioning mechanism. It’s a process of discovery of what is possible in the physical world.

        Whether you portray somethings as a discovery or as an invention is more a matter of degree, a matter of from which angle one is looking at it.

        The possible states of an LLM are finitely enumerable. The same likely holds for the possible states and configurations of a human brain, in approximation. Therefore there is only a finite set of possible ideas, thoughts, and conceptualizations an LLM or a human can have, and in principle they could be exhaustively enumerated and thus “discovered”.

      • MrDrDr 9 hours ago
        I like this distinction, but it would then seem the only 'invention' would be the axioms of your mathematics. There exists numbers (natural, imaginary...), there exist shapes (a point, a line...). All the work from that point on could be 'discovered'. I agree that I don't see LLMs inventing in this way. But again, it raised the question - what are our brains doing when we 'invent' something?
        • strgcmc 5 hours ago
          Well, take any invention you like, and let's break it down.

          Somebody at some point, "invented" the idea that the earth was round. Before that, the obvious "just look around you" answer would've been, duh of course the ground is flat. But we know the earth has always been round, even if humans couldn't appreciate it for hundreds of thousands of years (I don't count the pre-history before homo sapiens). So we "invented" some fields of science and the mental models / abstractions that allowed us to conceptualize what a round earth could mean and how to measure it, but we didn't invent the roundness itself -- that was always reality, and we just lacked both the thoughts and the tools to conceptualize it (until later).

          Now you might say, well that is a category of "simple" physical observations. The earth is naturally round all the time and doesn't take any extra human effort to make it so (it took some effort to imagine that it could be and to find ways to measure/prove it). But what about say -- semiconductors, NVIDIA GPUs, that sort of thing? It's not like semiconductors grow on trees and we just need to find them and learn how to consume/use them... isn't that a better example of "true invention"?

          Sure, I could see that. But I guess my POV would be that, the invention of the latest AI chip, or the first semiconductor, or the first vacuum tube, or whatever came before, all laddered largely incrementally on "discoveries" that were then cleverly tweaked or reapplied, so that what appears to be "true invention" is usually/more-often just another chain in a long chain of "discoveries" that led up to it. I grant you that some of what appears in hindsight to be continuous progress, really is built on small discontinuous "leaps", but I don't think that breaks the argument (strengthens it in fact, IMO). You wouldn't have semiconductors today, unless Faraday (or somebody like him) discovered that silver sulfide resistance decreases with heat, and that is more like one of those physical properties that reality has always had (much like, earth was always round, we just didn't know it at first).

          So in that sense, I feel this becomes almost like an "evolution vs intelligent design" debate -- some people look at the complexity and miracle that is the human eye or the human brain, and they insist there must have been an intelligent designer, because surely no random chaotic biological process could have produced something so wonderful... And yet, I think the scientific evidence largely shows that, indeed that is what happened, just random chance + evolutionary-pressure was all you really need (plus billions of years). So if you can accept that analogical framing for a minute, then I would posit that "invention"-adherents are really making something like an intelligent design argument, vs "discovery"-adherents are saying that evolution (in an artificial sense, with the artificial selection pressures of scientific research, of capitalism, etc., and compressed into centuries or decades, not millions or billions of years) is sufficient to derive miraculous-seeming results. The little discontinuous leaps along the way, are kind of like the random mutations of genes that happen to confer an advantage -- maybe we can say that we are more intentional about seeking those leaps out, or maybe we are just right-place/right-time lucky (e.g. thinking about penicillin and the random petri dish left out).

          Perhaps once (or if) there is the sort of leap that breaks us out from a Type I to a Type II+ Kardashev civilization, maybe then I would grant you something needed to be "invented" that couldn't be based on a line of "discoveries". Or maybe not, maybe it will just be another semi-random discovery.

      • eiieue 6 hours ago
        Mathematical objects are an invention of the mind - they are abstract objects that only an entity who can process abstractions can make sense of.

        There is no ‘discovery’ here nor was it waiting to be found. The human has to sacrifice and pursue the path of exploring reality and thereby is inherently inventing.

        Humans built up mathematics iteratively from smaller bases extending into large ones. Is this what LLM’s do? Of course not - They are fed with vast amounts of information from the off.

    • LiamPowell 9 hours ago
      Trivially the answer is yes by the infinite monkey theorem. If we allow the sampler to pick any token then any stream of arbitrary tokens can be generated. Therefore if an original idea can be represented with written words then a LLM can generate it. That is perhaps not the most satisfying answer, but if you want a better one you'll need to provide a function that determines if an idea is original.
    • humanfromearth9 9 hours ago
      For my paper about ME/CFS, I let an LLM integrate lots of findings of other scientific papers. Then I ask the LLM to "creatively brainstorm", given all we know of ME/CFS and the newly integrated paper, to generate new hypotheses, treatment ideas or any other kind of insight it can think of.

      This works really well.

      Now, it's clear that I have no idea how much of this is something we would consider new and original, and how much is a kind of systematic, but not novel, easy of thinking.

      What I couldn't do so far is get an LLM to generate a truly new maths theory, with new abstract concepts and dimensions and points of view. The kind that is not just a combination of existing theories and logic.

    • jasfi 9 hours ago
      It's about the ability to combine ideas in novel ways, without breaking the rules in relevant frameworks. Sometimes the idea may even be to contradict existing theories where they are weak.
    • ikari_pl 9 hours ago
      How do you define a new idea?

      To me, it's rearranging the information you had in a way that hasn't been applied or published before.

      That's literally what LLMs are built for.

    • eiieue 6 hours ago
      Theres a simple test for this.

      Limit the knowledge an llm to some point in time at which a discovery was made. And check to see if the llm could produce the discovery.

      If you think OAI hasn’t already tried this then think again - they have every incentive to do so and announce it to the world.

  • highfrequency 2 hours ago
    > LLMs have got to the point where if a problem has an easy argument that for one reason or another human mathematicians have missed (that reason sometimes, but not always, being that the problem has not received all that much attention), then there is a good chance that the LLMs will spot it.
  • theptip 58 minutes ago
    > what should we do with this kind of content? Had the result been produced by a human mathematician, it would definitely have been publishable, so I think it would be wrong to describe it as AI slop. On the other hand, it seems pointless even to think about putting it in a journal, since it can be made freely available, and nobody needs “credit” for it (except that Isaac deserves plenty of credit for creating the framework on which ChatGPT could build). I understand that arXiv has a policy against accepting AI-written content, which makes good sense to me. So maybe there should be a different repository where AI-produced results can live. But various decisions would need to be made about how it was organized.

    Interesting question, I guess a starting point is “moltbook”, but perhaps a better one is something like GitHub, where Lean proofs and preprints can go, and trending items can get boosted.

    I also think that posting this stuff on x or bluesky has merit, but again the existing paradigm doesn’t quite work; perhaps you can create a completely separate identity for your agent (à la Moltbook) but I think you want some sort of reputational association with the human piloting the agent, at least for now. (Maybe eventually there are enough agents critically engaging with content so that “interesting” results get agent likes, and so we’ll-piloted agents stand on their own merit.)

  • zkmon 9 hours ago
    >> but it was definitely a non-trivial extension of those ideas, and for a PhD student to find that extension it would be necessary to invest quite a bit of time digesting Isaac’s paper

    The "non-trivial" is for human abilities. The weights lifted by a crane are also "non-trivial". People keep getting amazed at machine's abilities. Just like a radio telescope can see things humans can't, microscope can see the detail humans can't, we need not be amazed. The sensory perception of patterns is at different level for AI. It's a machine.

    • svnt 9 hours ago
      Too many people are wrapped around the ego axle thinking (assuming) their ideas are both them and somehow unique and special.

      It usually takes dissolving that, often through difficult experiences, before they can see it as a machine, something that could be separated from them.

      • dag100 5 hours ago
        I think the more pressing issue is that there isn't really much space left for humans in the economy if thinking can also be automated.
  • goopthink 5 hours ago
    An interesting takeaway is that heretofore most of that advances have been not from “invention” but from a breadth of visibility. LLMs have been able to be “creative” because of the volume of work that they cover and can draw lines and associations between, not in discovering things that did not exist previously (though an argument can be made that something like AlphaFold was “discovering” and “intuiting” associations that were not explicit anywhere previously, uniquely found by the AI… but I’d argue back something about the bitter lesson and we’d go on for more than a few threads).

    Somewhat ironic then, to not make this more explicit in an article about solving a combinatorial problem.

  • eranation 4 hours ago
    Like coding, if you get inspired by AI for a novel idea, and can reproduce the same result independently (could code the same thing by hand) or at least understand and check every single argument (self review your code, test on your machine) and get it peer reviewed (code review, but with a real human) then I don’t see why the industry accepts the latest iteration of ChatGPT being 99% written by codex, but rejects a valid math result inspired by it.
  • dabinat 13 hours ago
    I feel like this experiment was successful because those prompting the AI were knowledgeable enough to ask the right questions and verify the output was correct. This shows that there is still a place for expertise, even if the LLM does the actual research.
    • colechristensen 13 hours ago
      I feel my input to LLMs is most valuable in the initial idea, big picture design tweaks, and the vast majority of my usefulness is negative feedback. This looks wrong, you've gotten off track, you're cheating with workarounds, you're falling into a rabbithole, etc.
  • iTokio 14 hours ago
    On complex problems with lengthy proofs, the first step that I would have done is to ask 5.5 pro in a new, unrelated, session, to be very critical, to try to find flaws in the arguments.

    And certainly not to send it to a fellow colleague to ask its opinion first.

    LLMs are certainly becoming capable to code, find vulnerabilities, solve mathematical problems, but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.

    Otherwise tech leads, maintainers, experts get overwhelmed and this is how the « AI slop » fatigue begins.

    To be clear I’m talking about this step:

    > That preprint would have been hard for me to read, as that would have meant carefully reading Rajagopal’s paper first, but I sent it to Nathanson, who forwarded it to Rajagopal, who said he thought it looked correct.

    • NitpickLawyer 14 hours ago
      > but we need to avoid putting their works in production, or in front of other humans, without assessing it by any possible mean.

      I think this is good advice in general, maybe with an emphasis on public vs. private, friendly contact. Having 0 thought AI slop thrown at you out of the blue is rude. "could have been a prompt" indeed. But having a friend/colleague ask for a quick glance at something they know you handle well is another story for me.

      If I've worked on a subject for a few years, and know the particulars in and out, I'd have no trouble skimming something that a friend or a colleague sent me. I am sparing those 5-10 minutes for the friend, not for what they sent. And for an expert in a particular domain, often 5 minutes is all it takes for a "lgtm" or "lol no".

  • lysecret 11 hours ago
    There is a great recent episode of latent space about a similar topic it’s worth a watch even with the click baiti thumbnail and title https://youtu.be/9d899Ram9Bs?is=pQMoVmlWVsTNKfRK
  • fulafel 12 hours ago
    • dang 12 hours ago
      That's the top link (i.e. that the title is linked to), no?
      • fulafel 8 hours ago
        Indeed, the body in the post made me think it was a url-less submission.
  • iandanforth 9 hours ago
    I found the section on publishing very interesting. Even if the quality of the output is up to snuff, where should it go? Arxiv doesn't allow AI written work. The author proposes that only work that has been certified by human should be published. However, now the field is in the same boat as software engineering where we are facing a glut of pull requests and not enough time and people to review them.
  • arjie 10 hours ago
    The question of where the creative input is was a big thing around Experiments in Musical Intelligence and co-composing. But it seems perhaps that it’s a transient state we needn’t spend too much effort it. The machine has failed to disappoint repeatedly. Perhaps this is as far as it gets or perhaps we will be like people in Catching Crumbs by the Table by Ted Chiang where almost all science is interpretation of papers by vastly greater intellects.
  • jacktu 2 hours ago
    That's a real shift. The value of an open problem used to be that it was unsolved. Now an open problem needs to be unsolvable by something that can read the entire literature and try a hundred approaches in an hour.
  • tmp10423288442 4 hours ago
    It's interesting that ChatGPT Pro is the real deal that can write novel physics or math papers (for certain values of novel), while Claude Pro is crap that, depending on the A/B test, may not even provide Claude Code or at the very least doesn't provide Opus. Shows how LLM naming conventions are currently a mess.
  • amelius 10 hours ago
    Makes sense as a mathematician basically has two powers (1) using their intuition and (2) an enormous amount of mental stamina. A mathematician builds their intuition by reading maths books. It is thus not surprising that an LLM is well equipped to take over the tasks of the mathematician.
  • adammdaw 13 hours ago
    This is certainly interesting, though I would say that based on my understanding of how the current models work combinatorial problems would be an area where they could be particularly successful. They are pretty good at combinatorial creativity - its the exploratory and transformational aspects that are still pretty tricky, and I expect would come to bear in other areas of mathematics.
    • hodgehog11 12 hours ago
      Indeed, analysis is a bit more loose in its arguments, and so I've found LLMs tend to make more mistakes there.
  • chalr 4 hours ago
    There have always been attempts at settling all mathematics by using mechanized approaches. Often by mathematicians who already had made an impact and then wanted an automated approach.

    The Bourbaki group was one of the first who attempted a mechanized approach (using pen and paper still of course) to set theory and were literally accused of wanting to end all mathematics. The approach was largely ignored in practice.

    Gowers and a handful of others who work on computerized approaches also seem to want to end human mathematics and have sharecropper mathematics for a monthly tithe. So far they are largely ignored in practice.

  • zingar 10 hours ago
    The post talks about LLM+human contributions being recognized in some different category from human-only. But is it possible to spot the difference between the two?
  • __rito__ 14 hours ago
    > So maybe there should be a different repository where AI-produced results can live.

    Does the author know about CAISc 2026 [0]?

    [0]: https://caisc2026.github.io

  • incrediblylarge 13 hours ago
    A month ago my PhD supervisor told me it rips on proofs but he also said it's useless when formalising arguments in Lean - is this still the case?
    • vjerancrnjak 13 hours ago
      Nope. Codex formalizes much better than any tool with exception of Aristotle from Harmonic.

      https://github.com/vjeranc/fixed-rtrt

      M3 module was formalized fully purely from experimental data and from a nudge by earlier versions of codex in 15-30 minutes in a simple write/compile/fix-first-error loop. I was a bit surprised how fast it picked up the pattern but given there was a paper from '70s it became clear why later.

  • rklampp 5 hours ago
    Gowers has always been a proponent of Lean (naturally). He receives funding from the "AI for Math" fund, which is sponsored by a fund that is a front organization for venture capitalists:

    https://www.renaissancephilanthropy.org/

    The "brighter future" of course is that everyone is redundant and all capital is further concentrated.

    It is always Gowers, Tao and Lichtman (math.ínc startup) who are pushing these technologies.

    • logicprog 5 hours ago
      > It is always Gowers, Tao and Lichtman (math.ínc startup) who are pushing these technologies.

      In your mind does this mean that they are lying, or driven by motivated reasoning and cognitive bias, or whatever you'd like to say?

      Because I feel like people bring up these facts as a way to discount everything that these people are saying, but whether or not they've chosen to align themselves with AI aligned venture capital funding or not. The question is really, did what they say is happening happen or not? Are these capabilities real or not?

      To my mind, mathematics is pretty definitely, externally, objectively verifiable, so it would be easy to catch them in a lie. In the case of the Erdös problem that was recently solved in a novel and productive way, it wasn't even initiated by them and the chat GPT transcript is public for all to see. And the proof could easily be verified by other people, for instance.

      In addition, I think it's unlikely that they're not explaining things as they honestly see them and also doing their due diligence to make sure that they are seeing them as close to correctly as possible. Because their positions with these organizations not to mention their entire reputation and life's work and passion depends on their reputation in academic mathematics. If they were to give that up by falsifying these claims or not verifying them sufficiently, they would lose everything.

      I think it's also worth pointing out that it is totally possible for someone to align themselves with such organizations after the fact because they agree with them instead of being bought out by such organizations. Otherwise, it would be possible to dismiss the opinion of anyone working at any NGO dedicated to being against AI and denying AI's capabilities or whatever, as well by the same logic of their salary being paid by an organization dedicated to pushing those ideas.

  • ionwake 11 hours ago
    one thing I was wondering, is, if LLMs are word completions seemingly coming up with new solutions could this just be because stuff that was kept secret and now - is no longer is due to ingestion? I dont know enough about it tho
    • dist-epoch 8 hours ago
      why would you keep secret this particular mathematical idea? it's not extraordinarily important, it's not on the path to some other major result, doesn't seem useful in financial trading. even author calls it good reasonable problem for a PhD thesis.
  • dares2573 8 hours ago
    I think the biggest advantage of ChatGPT compared to Claude is that there are fewer things outside the model itself, such as KYC, account bans, etc.
    • solenoid0937 7 hours ago
      This is just grossly misinformed.

      OAI and Anthropic both require KYC for models of similar intelligence. They both do account bans if the classifiers fire wrong. You simply hear about it less with OAI because Codex has fewer prosumers.

      • tmp10423288442 4 hours ago
        Can you name any instance of OpenAI being as trigger-happy with bans as Anthropic has been in the past few months? Codex may have fewer prosumers, but they've added a lot in that time.
        • solenoid0937 3 hours ago
          There will obviously be more reports about Anthropic because hating on Anthropic has been trendy lately, it has more users, and gullible people (including most HN'ers) fall victim to OAI's guerrilla marketing.

          In terms of bans and KYC they are not meaningfully different.

  • alpama 6 hours ago
    This is scary. Ai is growing faster than our knowledge. We are not prepared
  • CharlesLau 14 hours ago
    Is the assessment system of undergraduate mathematics education no longer effective?
    • margalabargala 14 hours ago
      Undergraduate? No. We've had calculators able to solve undergraduate problems for decades. AI doesn't change the need to understand how calculus works any more than calculators did. The foundations remain valuable.

      Graduate? Yes.

      • whatever120 14 hours ago
        How should graduate school be changed then? Specifically for mathematics
        • margalabargala 5 hours ago
          Hell if I know. It's easy to see problems. Solutions are way harder. I'm not a math professor.
        • dyauspitr 13 hours ago
          90% of the final grade are in room examinations with proctors, maybe two sets of exams of midterms and finals that the vast majority of the final grade comes from. This is already how most of East and South Asia does it anyways and it’s probably the best.

          For publications and theses, as long as the final results hold and can be replicated and validated, I don’t see why we shouldn’t allow the wholesale use of LLMs

          • zozbot234 10 hours ago
            > 90% of the final grade are in room examinations with proctors, maybe two sets of exams of midterms and finals that the vast majority of the final grade comes from.

            This is really just a glorified undergraduate education, the real point of graduate school is to learn to do real-world relevant research. For the latter, I think LLM use will be accepted but there will be a heavy expectation on the author of making the result very easily digestable for human mathematicians and linking it thoroughly with the existing literature - something that LLMs are very much not successful at, but a student might be able to do quite well with a mixture of expert guidance and personal effort.

    • dyauspitr 14 hours ago
      I don’t think it’s just mathematics. We don’t hear enough about this, but if I think back to my undergraduate years, which were less than 10 years ago, every homework assignment and every take-home exam I had would be trivial for LLMs to solve at this point I wonder what is actually happening on the ground.
      • crocdundae 9 hours ago
        Well... here's something from "boots on the ground": I teach a bachelor's degree where programming is a smallish facet of a curriculum. My course is the last of a series of 3 courses which progressively introduce more concepts and try make practical implementations more feasible. I've been able to grade the course purely based on returns to take-home exercises, some of which are complex, some trivial. When ChatGPT (& Co.) came along I was still able to do that but with a major added workload to me (suddenly everyone started producing mountains of code, often nonsensical, but I still had to read it all). I always requested targeted, atomic changes to code (vs. rewrites) which served me well up to a point (I was still able to grade fairly). I requested them originally to avoid "github copies", but that worked kind of OK with ChatGPT too. However, when ClaudeCode came along it was obvious to me I'm loosing the battle. It does not particularly matter to me whether students use AI or not as long as the rows they add and alter in the assignments make sense, but the "last nail to the coffin" problem now with ClaudeCode is that in the latest batch (this spring) it is clear some students "pay themselves" a good grade (i.e. they pay for ClaudeCode, thus bypassing the need to actually learn). I cannot make assignments that are both complex enough to cause ClaudeCode tripping on something and still humane for those who do not use AI or only use free chatbot options. Essentially ClaudeCode plays havoc with the whole grading process: students not using it (whether they try to write code fully manually or ChatGPT assisted) are left with far less points that students who just push all the code I give to ClaudeCode and "let it rip" for some 15 minutes. This really irks me. So, my solution? Still working on it and hoping to find one! For sure no more points from most take-home assignments: lowest grades still achievable through them (the trivial ones), but that's it, the rest it preparation for an exam. Practically this already means anyone with ChatGPT is going to pass, no doubt about it... As for the higher grades, for autumn I'm desperately now figuring out how to even make a meaningful paper based exam for my course. I've myself completed a master's degree writing C language on paper with a pencil. I sure did not want to start doing that to others, but here we are. Besides, back in my youth the only "library" was pretty much ANSI-parts-of-C! I'm not sure what kind of a 2 inch thick stack of papers I'd have to give my students into the exam these days as reference material. One horrible aspect is that students are now far more dependent on compiler errors to spot pretty much anything and everything... I worry the first paper exam from me will be a total horror story to us all. In any case, interesting times.
        • 2cynykyl 3 hours ago
          I had this exact problem and came to the same conclusion. But for the exam, give them code and ask what it does, or give broken code and ask where the error is. Waaay less marking. I only asked for 3 small functions written by hand and that was still 90% of the effort to mark. But the marks felt valid in the end, so the process seemed to work.
  • OsamaJaber 7 hours ago
    the bottleneck isn't generation, it's verification
  • zuogl 12 hours ago
    The HTML generation is surprisingly good because the training corpus for markup is cleaner than most programming languages.
  • adaml_623 12 hours ago
    "It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour"

    This comment about time is very interesting to me. I know it's "just" doing mathematical proofs but the possibilities of speeding up planning, proposals and decision making in the physical world should excite people.

  • MagicMoonlight 7 hours ago
    ChatGPT pro is garbage. It’ll spend 20 minutes on an answer, doing all kinds of ridiculous things like writing scripts… instead of just outputting plaintext.

    And then the answer isn’t even right.

  • casey2 8 hours ago
    I think mathematicians like LLMs because this is the first time we have something like a computer for the kinds of math most people do, high level, hand wavy abstractions that are (relatively) easy for people to grok but hard to explain to traditional computers.
  • zkmon 9 hours ago
    >>
  • quinndupont 8 hours ago
    [flagged]
  • locknitpicker 11 hours ago
    From the article:

    > Conversely, for problems where one’s initial reaction is to be impressed that an LLM has come up with a clever argument, it often turns out on closer inspection that there are precedents for those arguments, so it is still just about possible to comfort oneself that LLMs are merely putting together existing knowledge rather than having truly original ideas. How much of a comfort that is I will not discuss here, other than to note that quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques.

    This is exactly what leads me to believe that the real impact of LLMs in human history is yet to come. My work as a researcher was mostly spent on two classes of workloads: reading papers that were recently published to gather ideas and keep up with the state of the art, and work on a selection of ideas gathered from said papers to build my research upon. It turns out that LLMs excel at the most critical component of both workloads: parsing existing content and use it when prompting the model to generate additional content based on specific goals and constraints. I mean, papers are already a way to store and distribute context.

  • globular-toast 13 hours ago
    I wish people would stop generating stuff they don't understand only to forward it to someone who does. Something about that really rubs me the wrong way.
    • hodgehog11 12 hours ago
      May I remind you that this is Timothy Gowers. He says he doesn't understand, but he most certainly has far greater capacity than most to detect complete junk from a maybe plausible argument. His colleague is even better able to judge this, hence why he sent it to him.

      Also if he did send me complete junk, I would still parse it for multiple days to see what is there.

    • auggierose 10 hours ago
      Lol. If Gowers sends you a piece of math he doesn't quite understand because he thinks that you might, that is something you celebrate.
    • frozenseven 6 hours ago
      You are criticizing a Fields Medalist for consulting with another mathematician.
  • SubiculumCode 12 hours ago
    I honestly can't say this isn't AGI anymore. AGI shouldn't be a bar so taboo that it has to be at the extreme capability in every domain. What human is?

    This is as AGI as it needs to be to get my vote. And it's scary.

    • MrScruff 11 hours ago
      It's ASI with jagged intelligence, which is probably what it will remain for a while.

      It still sounds to me like remarkable automation rather than something that's expanding the frontier of human knowledge, for now at least.

    • agiipullor 11 hours ago
      to quote Demis Hassabis, "these models can solve frontieer problems in math, but also fail in really dumb ways at trivial questions - the car wash question".

      jagged AGI

  • sexylinux 8 hours ago
    Unfortunately it still does create errors.

    This is of enormous importance but still is being actively ignored by many professionals or dismissed as as a minor issue.

    Our emotional human brains are very enthusiastic about these new kind of "intelligent" products ("partners") and we want to believe so hard that they are finally "there" that we tend to ignore how big of a problem it is that LLMs carry a fundamental design problem with them that will make them produce errors even when we use a grotesque amount of resources to build "bigger" versions of them. The potential for errors will never go away with the current AI architecture.

    This is a fundamental paradigm shift in computing. Instead of putting a lot of energy into building an architecture that will produce reliable results, we are now maximizing on a system / idea that will never give us 100% reliable results.

    Basically it is just a marketing stunt. Probably the computer science guy building it knew very well that he would still need some fundamental break troughs to get to a real product, but the marketing guy saw that there is still potential to make a lot of money by selling a product that will produce correct results only 80% of the time.

    The marketing guy was right and marketing is now dominating science, but humanity will pay a big price for that.

    Putting enormous amounts of money into a fundamentally flawed system that we can not optimize to produce reliably error free results is just stupid.

    The big achievement of "classical" computing is that the results are reliably error free. We have still some known issues eg. with floating point math and bad blocks on disk / bit flipping etc. but these are observable and we can handle / avoid them. Generally "non-ai-computing" was made so reliable, that we can depend on it for many very important things. This came not by accident but was created by a lot of people who put a lot of resources into research to achieve that result.

    LLMs introduce a level of uncertainty and unreliability into computing that makes them practically useless.

    Because if you have enough knowledge to verify the result and AI is only quicker in producing the result, what is the point then putting so much resources in it (besides making money by re-centralizing computing, of course). Verifying a lot of results that have been produced quicker is still slow, so the people who are now just AI verifiers should just produce the results themselves, makes the whole process quicker.

    AI is only of value if it can produce results about things that you or your organization does not know anything about. But these results you can not verify and therefore potentially wrong results can be fatal for you, your organization and all the people that are affected by actions generated based on these wrong results.

    Many people have already been killed because decision makers are not able to follow that very simple logic.

    So we can still create "interesting and enjoyable results", but finally it is a gigantic miss-allocation of resources of historic idiocy. It fits, of course, very well in a timeline where grifters are on top of societies around the world.

    It is a fundamentally wrong path that should not be followed and scientists around the world should articulate exactly that instead of producing marketing blog posts for a system with such fatal inherent issues.

  • jdw64 14 hours ago
    [dead]
    • bananaflag 13 hours ago
      > it sounds like there were already precedents or existing pieces of knowledge, but humans had not thought to connect them

      A lot of math research is like that. And, like the blog post suggests, problems one gives PhD students are 95% like that.

      • jdw64 13 hours ago
        Maybe I am still fortunate to have become a programmer.

        Most of what I do is just assemble things that other people have already built.

    • tanepiper 13 hours ago
      Basically medical science too. My wife was able to diagnose her own anemia that the doctors kept missing, and has since been able to have iron infusions.

      The human doctors kept ignoring the signals, kept putting it down to 'diet' and 'exercise' (even though she does plenty of both)

    • themafia 13 hours ago
      > there were already precedents or existing pieces of knowledge, but humans had not thought to connect them

      We used to call that "low hanging fruit."

  • shevy-java 13 hours ago
    [flagged]
  • verisimi 14 hours ago
    [flagged]
  • slopinthebag 13 hours ago
    AI generated article btw.

    Maybe if you find AI to be doing stuff you find impressive, the stuff you were doing wasn't that impressive? Worth ruminating on your priors at least.

    • hodgehog11 12 hours ago
      This is beyond ridiculous to say considering whose blog this is.

      For those that don't know, this is Timothy Gowers. He is one of the most accomplished mathematicians in the world. Like Terence Tao, he is considered one of the world leaders in mathematics and tends to have good judgement in where the field is going.

      Even without that knowledge, no, this article is certainly not AI generated. It has none of the tells.

    • reasonableklout 13 hours ago
      What makes you think either the tweet or blog post are AI generated?
  • bambax 12 hours ago
    > quite a lot of perfectly good human mathematics consists in putting together existing knowledge and proof techniques

    Creativity is connecting ideas from different domains and see if something from one field applies to another. I do think AI is overhyped generally; but a major benefit from AI could be that after ingesting all the existing human knowledge (something no single human can ever hope to achieve) it would "mix and connect" it and come up with novel insights.

    Most published research sits ignored and unread; AI can uncover and use everything.

    • imiric 11 hours ago
      > Creativity is connecting ideas from different domains and see if something from one field applies to another.

      That's true. The question is whether the produced pattern has any value. LLMs are incapable of determining this, and will still often hallucinate, and make random baseless claims that can convince anyone except human domain experts. And that's still a difficult challenge: a domain expert is still needed to verify the output, which in some fields is very labor intensive, especially if the subject is at the edge of human knowledge.

      The second related issue is the lack of reproducibility. The same LLM given the same prompt and context can produce different results. This probability increases with more input and output tokens, and with more obscure subjects.

      The tools are certainly improving, but these two issues are still a major hurdle that don't get nearly as much attention as "agents", "skills", and whatever adjacent trend influencers are pushing today.

      And can we please stop calling pattern matching and generation "intelligence"? This farce has gone on long enough.

      • agiipullor 11 hours ago
        > And can we please stop calling pattern matching and generation "intelligence"

        thats literally what an IQ test tests - abstract pattern matching. but I guess you dont like IQ tests either

  • einrealist 12 hours ago
    "After 16 minutes and 41 seconds, it came back" ... "further 47 minutes and 39 seconds" ... "After 13 minutes and 33 seconds" ... "After 9 minutes and 12 seconds" ... "After 31 minutes and 40 seconds" ... plus other computations

    Anyone spotting the issue here? What did that really cost?

    I am not against compute being used for scientific or other important problems. We did that before LLMs. However, the major LLM gatekeepers want to make all industries and companies dependent on their models. And, at some point, they need to charge them the actual, unsubsidized costs for the compute. In the meantime, companies restructure in the hopes that the compute costs remain cheap.

    • sidkshatriya 12 hours ago
      > "After 16 minutes and 41 seconds, it came back" ... "further 47 minutes and 39 seconds" ... "After 13 minutes and 33 seconds" ... "After 9 minutes and 12 seconds" ... "After 31 minutes and 40 seconds" ... plus other computations Anyone spotting the issue here? What did that really cost?

      Whatever the Joules... (convert to $ using your preferred benchmark price) it is a fraction to what it might take a human Ph. D. weeks to feed and sustain themselves when working on the same problem. The economics on LLMs is just unbeatable (sadly) when compared to us humans.

      • einrealist 11 hours ago
        Compute in science was already subsidized by public funding or by donations. Most supercomputers are financed this way. And that's a good thing. If you have a good science problem that can be computed, apply for compute time. There is nothing wrong to apply that to LLMs as well, like I wrote in my initial post. The human is still required to identity problems that are worth to be computed, to create prompts that the LLM can act on, and to verify results. But, OpenAI providing compute for basically free is still tied to a different incentive: to fuel the hype and to capture the market, while distorting/obfuscating the real costs. That's also the reason for why we cannot claim that 'economics on LLMs is just unbeatable'. It depends on the problem, the reason for a prompt.
    • colordrops 12 hours ago
      Still not as bad for the environment as animal agriculture, and animal agriculture is absolutely not necessary and only causes harm and suffering for taste pleasure. At least with LLMs we get many positive advancements from them. I don't see these sorts of comments every time someone posts a burger review.
      • einrealist 12 hours ago
        Did I praise our animal agriculture anywhere?
  • robot-wrangler 3 hours ago
    Despite this coming from an independent expert and not from OpenAI, we need to be honest that this is more like a marketing campaign than open science. I assume the progress really is valid, the experts are indeed impressed, and we have the accurate time that it takes to produce the results. What we don't have is detail about true cost, or a CoT trace, or anything like that.

    The implication is: we're ready to let everyone go wild with this very soon. Ok, go wild with what exactly? How do we know that influential VIP users who might make very friendly blog posts aren't getting allocated exclusive access to a billion dollars worth of hardware when they ask questions? I mean literally giving a certain group of people temporary privileged access to like 90% of all available compute would be a completely reasonable business decision for OpenAI.

    Would a reveal like that change how we think about the result? What if half that amount of cash/compute could enable some completely non-AI approach of numerical brute forcing that settles the question even if it didn't write the paper?

    My other question is always whether the latest is purely using giant models or if we're now deeply into harnesses that use MCTS and such. Understandable to keep that a trade secret I guess. But IMHO we should at least get the CoT trace as a proxy for true cost, or else maybe we're just getting played to do the hype for corporate.

    • electriclove 3 hours ago
      Marketing campaign???
    • energy123 3 hours ago
      We know this because most of the Erdos proofs made by AI have been done by amateurs prompting GPT 5.* Pro, not by familiar names that OpenAI is sneaking additional compute to behind the scenes (which is too conspiratorial of an explanation for my liking regardless).
      • robot-wrangler 2 hours ago
        > We know this because most of the Erdos proofs made by AI have been done by amateurs prompting GPT 5.* Pro

        Where's that? The stuff I've seen is from celebrities. Were those problems as hard as this one, or the ones that Tao posts about? Regardless.. what's the argument against more transparency here to just settle this kind of thing?

        > which is too conspiratorial of an explanation for my liking regardless

        OpenAI is not, in fact, open. Why do they deserve the benefit of the doubt?

        Regardless.. special treatment for special customers isn't conspiracy, it's SOP literally everywhere and especially if you're helping to beta test. Anyone who's ever interacted with any technical account manager has seen waived quotas, free resource allocations, etc. The quid-pro-quo is obviously that your cheap early access means you get to give talks at a conference (or make a blog post that a lot of people read and talk about).

        • logicprog 51 minutes ago
          • robot-wrangler 27 minutes ago
            > The duo had jump-started the AI-for-Erdős craze late last year by prompting a free version of ChatGPT with open problems chosen at random from the Erdős problems website. (An AI researcher subsequently gifted them each a ChatGPT Pro subscription to encourage their “vibe mathing.”)

            Is it crazy to wonder who the AI researcher worked for? I wonder if the accounts might have been flagged as good ones for good publicity