6 comments

  • jdlshore 45 minutes ago
    “Our systematic study exposes a phenomenon of constraint decay in LLM-based coding agents. While current models excel at unconstrained generation, their performance drops when forced to navigate explicit architectural rules. For end-users, this dichotomy implies that agents are reliable for rapid prototyping but remain unreliable for production-grade backend development.”

    One major weakness of this study is that they didn’t fully test frontier models for cost reasons, so the specific performance results should be taken with a grain of salt. But the overall conclusion that models degrade when both behavior and architecture must be correct is interesting, and something to keep an eye on.

  • maxbond 47 minutes ago
    Reminds me of the recent paper about delegating document editing tasks to LLMs across different disciplines [1]. That paper found that programming was the only discipline most LLMs can perform long horizon tasks on without accumulating errors & corrupting the document.

    I've only read the abstract of this one so far but it seems like this paper has zoomed in on programming with greater fidelity and shown a similar phenomenon. But not about long horizon tasks, more like "long style horizons" of larger sets of structural constraints.

    [1] https://arxiv.org/abs/2604.15597

    Discussion: https://news.ycombinator.com/item?id=48073246

  • leecommamichael 6 minutes ago
    These things don’t think. We’re going to have to reiterate this for a long time, I fear.
    • emp17344 4 minutes ago
      There is now a trillion-dollar industry bent to the task of convincing people these things can think. It’s gonna cause some damage.
  • yomismoaqui 17 minutes ago
    Also they used languages with dynamic typing like Python & JS. In my experience a statically typed codebase is easier to maintain for humans so maybe it is also for agents.

    When using Codex/Claude Code with Go code I cannot count the times the agent does some change, runs a build to check for errors, find some and fix them.

  • gkfasdfasdf 42 minutes ago
    Odd they used GPT-5.2 and not GPT-5.2-codex. i.e. the one optimized for coding agent tasks.
  • volume_tech 2 hours ago
    [flagged]