Transformers Are Inherently Succinct

(openreview.net)

38 points | by brandonb 2 hours ago

9 comments

  • thesz 52 minutes ago
    My comment in the previous discussion of that paper: https://news.ycombinator.com/item?id=48014197

    Authors used LTL (linear temporal logic) to express, basically, non-reduced non-ordered binary decision diagrams. Or just binary decision diagrams, BDDs.

    BDDs are almost guaranteed to have exponential size because they do not employ reduction (sharing of common expressions). Reduced BDDs are more succinct and reduced ordered BDDs are even more succinct.

    Also, transformers in the paper are constructed, not trained. Training any model to express some truth table is very hard. They also did not perform comparison with, say, Kolmogorov-Arnold representation, which is also universal approximator.

    So this paper is not as deep as one may think it is.

  • dang 2 hours ago
    Discussed (a bit) here:

    Transformers Are Inherently Succinct (2025) - https://news.ycombinator.com/item?id=48014197 - May 2026 (9 comments)

  • parasti 1 hour ago
    Paper went over my head but is this in any way related to my experience of Claude Opus 4.8 using increasingly terse language with very short, overloaded words? Lately I've been having trouble parsing the things it writes about my own code, it's using the kind of compressed language that you see typically in git commit message subject lines but relentless, always on.
    • AlotOfReading 1 hour ago
      No, this is in the same ballpark as ideas like big-O notation. The paper is saying that transformers can recognize a language with exponentially fewer symbols than other kinds of systems, i.e. they're more succinct.

      It's exactly as related to real models as computer science is to real computers.

    • dontwannahearit 55 minutes ago
      Not just me then.
      • nightshift1 24 minutes ago
        i noticed that with 4.7. i tried to add instructions in claude.md to unpack meaning when communicating to me but it did not work.
  • lee_ars 14 minutes ago
    "Why use lot word, when few word do trick?" —Optimus Prime
  • dfabulich 56 minutes ago
    The last line of the abstract has the most important takeaway.

    > As a consequence of this succinctness, we show that basic verification problems for transformers, such as emptiness and equivalence, are provably intractable: specifically, EXPSPACE-complete.

    If you were hoping to formally prove the correctness of a large transformer, it turns out that you're going to need an exponentially larger amount of space to do your verification, more than you could possibly afford.

  • lkm0 1 hour ago
    I had no idea that LLMs (or the transformer architecture) were within reach of complexity theory. But if transformers "can be" exponentially more succinct than RNNs, doesn't that mean we're approaching optimality?
    • thesz 36 minutes ago

        > doesn't that mean we're approaching optimality?
      
      No.

      Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1.

      For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality.

        [1] https://arxiv.org/abs/2410.02724
        [2] https://bactra.org/
      • pafoster 0 minutes ago
        Markov chains are themselves a particular case of state machine, namely a probabilistic deterministic finite automaton (PDFA), albeit where state is solely governed by the N most recent symbols. (Deterministic means that given a sequence, we can always infer the associated state transitions unambiguously). I believe the example in the reference you provide represents the more general case of PDFA, which is not representable as a Markov processs.
  • measurablefunc 1 hour ago
    What about the other direction? What languages are expressible w/ RNNs & LTLs that require exponential blowup for transformers?
    • yorwba 51 minutes ago
      Proposition 16. UHATs have polynomially bounded expansion over LTL. In particular, given an LTL formula φ, one can construct in polynomial time a UHAT T such that L(T) = L(φ).

      i.e. the blowup is only exponential in one direction.

      • measurablefunc 8 minutes ago
        That says every LTL formula can be compiled into UHAT w/ polynomial overhead. It doesn't say that all languages recognizable w/ UHATs necessarily do not have succinct recgonizers in LTLs or RNNs.
  • jalospinoso 41 minutes ago
    [flagged]
  • brandonb 2 hours ago
    This paper is being published at ICLR 2026 (top AI conference), and was selected as one of three outstanding papers.
    • dang 2 hours ago
      (We'll add that bit to the toptext as well. Thanks!)