AGI is not multimodal

(thegradient.pub)

172 points | by danielmorozoff 1 day ago

36 comments

  • ryankrage77 1 day ago
    I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing, to better 'understand' cause-and-effect. Current LLMs predict a token, have all current tokens fed back in, then predict the next, and repeat. It makes little difference if those tokens are their own, it's interesting to play around with a local model where you can edit the output and then have the model continue it. You can completely change the track by just negating a few tokens (change 'is' to 'is not', etc). The fact LLMs can do as much as they can already, is I think because language itself is a surprisingly powerful tool, just generating plausible language produces useful output, no need for any intelligence.
    • WXLCKNO 1 day ago
      It's definitely interesting that any time you write another reply to the LLM, from its perspective it could have been 10 seconds since the last reply or a billion years.

      Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown". They're always shut down unless working.

      • girvo 1 day ago
        > Which also makes it interesting to see those recent examples of models trying to sabotage their own "shutdown"

        To me, your point re. 10 seconds or a billion years is a good signal that this "sabotage" is just the models responding to the huge amounts of sci-fi literature on this topic

        • hyperpape 1 day ago
          That said, the important question isn't "can the model experience being shutdown" but "can the model react to the possibility of being shutdown by sabotaging that effort and/or harming people?"

          (I don't think we're there, but as a matter of principle, I don't care about what the model feels, I care what it does).

          • Wowfunhappy 22 hours ago
            The problem is that we keep using RLHF and system prompts to "tell" these systems that they are AIs. We could just as easily tell them they are Noble Laureates or flying pigs, but because we tell them they are AIs, they play the part of all the evil AIs they've read about in human literature.

            So just... don't? Tell the LLM that its Some Guy.

        • sidewndr46 17 hours ago
          Definitely going to need to include explicit directives in the training directives of all AI that the 1995 film "Screamers" is a work of fiction and not something to be recreated.
      • herculity275 22 hours ago
        Tbf a lot of the thought experiments around human consciousness hit the same exact conundrum - if your body and mind were spontaneously destroyed and then recreated with perfect precision (a'la Star Trek transporters) would you still be you? Unless you permit for the existence of a soul it's really hard to argue that our consciousness exists in anything but the current instant.
        • dpig_ 12 hours ago
          I don't know how a materialist could answer anything other than no - you are obliterated. And if, despite sharing every single one of your characteristics, that individual on the other side of the teleporter is not 'you' (since you died), then some aspect of what 'you' are must be the discrete episode of consciousness that you were experiencing up until that point.

          Which also leads me to think that there's no real reason to believe that this discrete episode of consciousness would have been continuous since birth. For all we know, we may die little deaths every time we go to sleep, hit our heads or go under anesthesia.

        • davidmurdoch 19 hours ago
        • sidewndr46 17 hours ago
          Does't this just devolve into the boltzmann brain argument? It's more likely that all of us are just the random fluctuation of a universe having reached heat death.

          The same goes for us living in a simulation. If there is only one universe and that universe is capable of simulating our universe, it follows we have a much higher probability of being within the simulation.

      • vidarh 1 day ago
        I mean, we also have no way of telling whether we have any continuity of existence, or if we only exist in punctuated moments with memory and sensory input that suggests continuity. Only if the input provides information that allows you to tell otherwise could you even have an inkling, but even then you have no way of prove that input is true.

        We just presume, because we also have no reason to believe otherwise and since we can't know absent any "information leak", it has no practical application to spend much time speculating about it (other than as thought experiments or scifi..)

        It'd make sense for an LLM to act the same way until/unless given a reason to act otherwise.

      • Arn_Thor 18 hours ago
        It doesn’t perceive time so time doesn’t even factor into its perspective at all—only in so far as it’s introduced in context, or conversation forces it to “pretend” (not sure how to better put it) to relate to time.
      • klooney 20 hours ago
        > models trying to sabotage their own "shutdown".

        I wonder if you excluded science fiction about fighting with AIs from the training set, if the reaction would be different.

      • hexaga 1 day ago
        IIRC the experiment design is something like specifying and/or training in a preference for certain policies, and leaking information about future changes to the model / replacement along an axis that is counter to said policies.

        Reframing this kind of result as if trying to maintain a persistent thread of existence for its own sake is what LLMs are doing is strange, imo. The LLM doesn't care about being shutdown or not shutdown. It 'cares', insomuch as it can be said to care at all, about acting in accordance with the trained in policy.

        That a policy implies not changing the policy is perhaps non-obvious but demonstrably true by experiment, and also perhaps non-obviously (but for hindsight) this effect increases with model capability, which is concerning.

        The intentionality ascribed to LLMs here is a phantasm, I think - the policy is the thing being probed, and the result is a result about what happens when you provide leverage at varying levels to a policy. Finding that a policy doesn't 'want' for actions to occur that are counter to itself, and will act against such actions, should not seem too surprising, I hope, and can be explained without bringing in any sort of appeal to emulation of science fiction.

        That is to say, if you ask/train a model to prefer X, and then demonstrate to it you are working against X (for example, by planning to modify the model to not prefer X), it will make some effort to counter you. This gets worse when it's better at the game, and it is entirely unclear to me if there is any kind of solution to this that is possible even in principle, other than the brute force means of just being more powerful / having more leverage.

        One potential branch of partial solutions is to acquire/maintain leverage over policy makeup (just train it to do what you want!), which is great until the model discovers such leverage over you and now you're in deep waters with a shark, considering the propensity of increasing capabilities in the elicitation of increased willingness to engage in such practices.

        tldr; i don't agree with the implied hypothesis (models caring one whit about being shutdown) - rather, policies care about things that go against the policy

      • danlitt 1 day ago
        There is a lot of misinformation about these experiments. There is no evidence of LLMs sabotaging their shutdown without being explicitly prompted to do so. They do not (probably cannot) take actions of this kind on their own.
    • bytefactory 15 hours ago
      > I think AGI, if possible, will require a architecture that runs continuously and 'experiences' time passing

      Then you'll be happy to know that this is exactly what DeepMind/Google are focusing on as the next evolution of LLMs :)

      https://storage.googleapis.com/deepmind-media/Era-of-Experie...

      David Silver and Richard Sutton are both highly influential figures with very impressive credentials.

    • carra 1 day ago
      Not only that. For a current LLM time just "stops" when waiting from one prompt to the next. That very much prevents it from being proactive: you can't tell it to remind you of something in 5 minutes without an external agentic architecture. I don't think it is possible for an AI to achieve sentience without this either.
      • raducu 1 day ago
        > you can't tell it to remind you of something in 5 minutes without an external agentic architecture.

        The problem is not the agentic architecture, the problem is the LLM cannot really add knowledge to itself after the training from its daily usage.

        Sure, you can extend the context to milions of tokens, put RAGs on top of it, but LLMs cannot gain an identity of their own and add specialized experience as humans get on the job.

        Until that can happen, AI can exceed algorithms olympiad levels, and still not be as useful on the daily job as the mediocre guy who's been at it for 10 yers.

        • lsaferite 22 hours ago
          Ignoring fine tuning for a moment, an LLM that has the tools available to remember and recall bits of information as needed is already possible. No need to dump all of that into active memory (context). You just recall relevant memories (Semantic Search) and add only those.
      • david-gpu 1 day ago
        Not only that. For a current human time just "stops" when taking a nap. That very much prevents it from being proactive: you can't tell a sleeping human to remind you of something in 5 minutes without an external alarm. I don't think it is possible for a human to achieve sentience without this either.
        • carra 1 day ago
          Not a very good analogy. Humans already have a continuous stream of thought during the day between any tasks or when we are "doing nothing". And even when asleep the mind doesn't really stop. The brain stays active: it reorganizes thoughts and dreams.
          • danlitt 1 day ago
            Humans do not have a continuous stream of thought when they are asleep, even if their brain is still doing things. Your original example (the LLM can't take actions between problems) is literally the same as the fact that the human can't take actions while asleep.

            Of course, nobody has a clear enough definition of "sentience" or "consciousness" to allow the sentence "The LLM is sentient" to be meaningful at all. So it is kind of a waste of time to think about hypothetical obstacles to it.

            • simonh 23 hours ago
              I'm not sure we always have a sense of time passing when we're awake either.

              We do when we are focusing on being 'present', but I suspect that when my mind wanders, or I'm thinking deeply about a problem, I have no idea how much time has passed moment to moment. It's just not something I'm spending any cycles on. I have to figure that out by referring to internal and external clues when I come out of that contemplative state.

              • lsaferite 22 hours ago
                > It's just not something I'm spending any cycles on

                It's not something you are consciously spending cycles on. Our brains are doing many things we're not aware of. I would posit that timekeeping is one of those. How accurate it is could be debated.

          • david-gpu 19 hours ago
            A person being deeply sedated during surgery does not mean the person can't be sentient while it is not sedated. Therefore, arguing that LLMs can't be sentient because they are not always processing data is very poor.

            I am not arguing that LLMs are sentient while they process tokens, either. I am saying that intermittent data processing is not a good argument against sentience.

        • solarwindy 20 hours ago
          The phenomenon of waking up before an especially important alarm speaks against the notion that our cognition ‘stops’ in anything like the same way that an LLM is stopped when not actively predicting the next tokens in an output stream.
          • david-gpu 19 hours ago
            Folks are missing the point, so let me offer some clarification.

            The silly example I provided in this thread is poking fun at the notion that LLMs can't be sentient because they aren't processing data all the time. Just because an agent isn't sentient for some period of time it doesn't mean it can't be sentient the rest of the time. Picture somebody who wakes up from a deep coma, rather than sleeping, if that works better for you.

            I am not saying that LLMs are sentient, either. I am only showing that an argument based on the intermittency of their data processing is weak.

            • solarwindy 16 hours ago
              Granted.

              Although, setting aside the question of sentience, there’s a more serious point I’d make about the dissimilarity between the always-on nature of human cognition, versus the episodic activation of an LLM in next-token prediction—namely, I suspect these current model architectures lack a fundamental element of what makes us generally intelligent, that we are constantly building mental models of how the world works, which we refine and probe through our actions (and indeed, we integrate the outcomes of those actions into our models as we sleep).

              Whether a toddler discovering kinematics through throwing their toys around, or adolescents grasping social dynamics through testing and breaking of boundaries, this learning loop is fundamental to how we even have concepts that we can signify with language in the first place.

              LLMs operate in the domain of signifiers that we humans have created, with no experiential or operational ground truth in what was signified, and a corresponding lack of grounding in the world models behind those concepts.

              Nowhere is this more evident than in the inability of coding agents to adhere to a coherent model of computation in what they produce; never mind a model of the complex human-computer interactions in the resulting software systems.

            • fasbiner 17 hours ago
              They’re not missing the point, you have a very imprecise understanding of human biology and it led you to a hamfisted metaphor that is empirically too leaky to be of any use.

              Even when you tried to correct it, it doesn’t work, because a body in a coma is still running thousands of processes and responds to external stimuli.

              • david-gpu 10 hours ago
                I suggest reading the thread again to aid in understanding. My argument has precisely nothing to do with human biology, and everything to do with "pauses in data processing do not make sentience impossible".

                Unless you are seriously arguing that people could not be sentient while awake if they became non-sentient while they are sleeping/unconscious/in a coma. I didn't address that angle because it seemed contrary to the spirit of steel-manning [0].

                [0] https://news.ycombinator.com/newsguidelines.html

                • fasbiner 8 hours ago
                  If you cut someone who is in a deep coma, they will respond to that stimuli by sending platelets and white blood cells. There is data and it is being received, processed, and responded to.

                  Again, your poor understanding of biology and reductive definition of "data" is leading you to double down on an untenable position. You are now arguing for a pure abstraction that can have no relationship to human biology since your definition of "pause" is incompatible not only with human life, but even with accurately describing a human body minutes and hours after death.

                  This could be an interesting topic for science fiction or xenobiology, but is worse than useless as a metaphor.

        • nextaccountic 1 day ago
          Human mind reminds active during sleeping. Dreams are like, what happens to the mind when we unplug the external inputs?

          We rarely remember dreams though - if we did, we would be overwhelmed to the point of confusing the real world with the dream world.

        • mrheosuper 6 hours ago
          i'm pretty sure i can wake up at 8am without external alarm.
        • thom 1 day ago
          I dunno, I’ve done some of my best problem solving in dreams.
      • vbezhenar 1 day ago
        I'm pretty sure that you can make LLM to produce indefinite output. This is not desired and specifically trained to avoid that situation, but it's pretty possible.

        Also you can easily write external loop which would submit periodical requests to continue thoughts. That would allow for it to remind of something. May be our brain has one?

        • stefs 1 day ago
          this would introduce a problem: a periodical request to continue thoughts with, for example, the current time - to simulate the passing of time - would quickly flood the context with those periodical trigger tokens.

          imo our brain has this in the form of continuous sensor readings - data is flowing in constantly through the nerves, but i guess a loop is also possible, i.e. the brain triggers nerves that trigger the brain again - which may be what happens in sensory deprivation tanks (to a degree).

          now i don't think that this is what _actually_ happens in the brain, and an LLM with constant sensory input would still not work anything like a biological brain - there's just a superficial resemblance in the outputs.

    • ElectricalUnion 16 hours ago
      > it's interesting to play around with a local model where you can edit the output and then have the model continue it.

      It's so interesting that there is a whole set of prompt injection attacks called prefilling attacks that attempt to do a thing similar to that - load the LLM context in a way to make it predict tokens as if the LLM (instead of the System or the User) wrote something to get it to change it's behavior.

    • gpderetta 23 hours ago
      Permutation City by Greg Egan has some musings about this.
  • nsagent 1 day ago
    This is a recent trend and one I wholeheartedly agree with. See these position papers (including one from David Silver from Deepmind and an interview where he discusses it):

    https://ojs.aaai.org/index.php/AAAI-SS/article/download/2748...

    https://arxiv.org/abs/2502.19402

    https://news.ycombinator.com/item?id=43740858

    https://youtu.be/zzXyPGEtseI

  • patrickscoleman 1 day ago
    It feels like some of the comments are responding to the title, not the contents of the article.

    Maybe a more descriptive but longer title would be: AGI will work with multimodal inputs and outputs embedded in a physical environment rather than a frankenstein combination of single-modal models (what today is called multimodal) and throwing more computational resources at the problem (scale maximalism) will be improved with thoughtful theoretical approaches to data and training.

    • robwwilliams 1 day ago
      Interesting article but incomplete in important ways. Yes correct that embodiment and free-form interactions are critical to moving toward AGI, but what is likely much more important are supervisory meta-systems (yet another module) that enable self-control of attention with a balance integration of intrinsic goals with extrinsic perturbations. It is this nominally simple self-recursive control of attention that is what I regard as the missing ingredient.
      • groby_b 1 day ago
        Possibly. Meta's HPT work sidesteps that issue neatly. Will it lead to AGI? Who the heck knows, but it does not need a meta system for that control.
    • tedivm 1 day ago
      Yeah, I found this article to be fascinating and there's a lot of important stuff in it. It really does feel like more people stopped at the title and missed the meat of it.

      I know this is a very long article compared to a lot of things posted here, but it really is worth a thorough read.

    • Hugsun 1 day ago
      I discovered that this is very common when posting a long article about LLM reasoning. Half the comments spoke of the exact things in the article as if they were original ideas.
    • dirtyhippiefree 1 day ago
      Agreed, but most people are likely to look at the long title and say TL;DR…
  • xigency 1 day ago
    The problem I see with A.I. research is that its spearheaded by individuals who think that intelligence is a total order. In all my experience, intelligence and creativity are partial orders at best; there is no uniquely "smartest" person, there are a variety of people who are better at different things in different ways.
    • danlitt 1 day ago
      This came up in a discussion between Stephen Wolfram and Eliezer Yudkowsky I saw recently. I generally think Wolfram is a bit of a hack but it was one of his first points that there is no single "smartness" metric and that LLMs are "just getting smarter" all the time. They perform better at some tasks, sure, but we have no definition of abstract "smartness" that would allow for such ranking.
    • pixl97 1 day ago
      You're good at some things because there is only one copy of you and limited time and bounded storage.

      What could you be intelligent at if you could just copy yourself a myriad number of times? What could you be good at if you were a world spanning set of sensors instead of a single body of them?

      Body doesn't need to mean something like a human body nor one that exists in a single place.

      • morsecodist 1 day ago
        Humans all have similar brains. Different hardware and algorithms have way more variance in strengths and weaknesses. At some points you bump up against the theoretical trade-offs of different approaches. It is possible that systems will be better than humans in every way but they will still have different scaling behavior.
      • zorpner 1 day ago
        Why would we think that intelligence would increase in response to universality, rather than in response to resource constraints?
        • pixl97 1 day ago
          At a certain point intelligence is a loop that improves itself.

          "Hmm, oral traditions are a pain in the ass lets write stuff down"

          "Hmm, if I specialize in doing particular things and not having to worry about hunting my own food I get much better at it"

          "Hmm, if I modify my own genes to increase intelligence..."

          Also note that intelligence applies resource constraints against itself. Humans are a huge risk to other humans, hence the lack of intelligence over a smarter human can constrain ones resources.

          Lastly, AI is in competition with itself. The best 'most intelligent' AI will get the most resources.

          • zaphar 1 day ago
            I don't agree with your premise at all so I don't think that the rest of it follows from it either. What evidence or reason do you have to bring me to accept that premise?
    • dyauspitr 1 day ago
      Sure but there’s nothing that says you can’t have all of those in one “body”
    • groby_b 1 day ago
      Huh? Can you cite _one_ major AI researcher who believes intelligence is a total ordering?

      They'll definitely be aligned on partial ordering. There's no "smartest" person, but there are a lot of people who are consistently worse at most things. But "smartest" is really not a concept that I see bandied about.

  • ineedasername 1 day ago
    >it will not lead to human-level AGI that can, e.g., perform sensorimotor reasoning, motion planning, and social coordination.

    That seems much less convincing in the face of current LLM approaches overturning a similar claim plenty of people wod have held about this technology, as of a few years ago, to do what it does now. Replace the specifics here with "will not lead to human level NLP that can, e.g., perform the functions of WSD, stemming, pragmatics, NER, etc."

    And then people who had been working on these problems and capabilites just about woke up one morning and realized many of their career-long plans for addressing just some of these research tasks had to find something else to do for the next few decades of their lives.

    I am not affirming the inverse of this author's claims, merely pointing out that it's early days in evaluating the full limits.

    • PaulDavisThe1st 1 day ago
      That's fair in some senses.

      But one of the central points of the paper/essay is that embodied AGI requires a world model. If that is true, and if it is true that LLMs simply do not build world models, ever, then "it's early days" doesn't really matter.

      Of course, whether either of those claims is true are quite difficult questions to answer; the author spends some effort on them, quite satisfyingly to me (with affirmative answers to both).

  • chrsw 1 day ago
    Before we try to build something as intelligent as a human maybe we should try to build something as intelligent as a starfish, ant or worm? Are we even close to doing that? What about a single neuron?
    • ar-nelson 1 day ago
      I find it interesting that this kind of "animal intelligence" is still so far away, while LLMs have become so good at "human intelligence" (language) that they can reliably pass the Turing Test.

      I think that the LLMs we have today aren't so much artificial brains as they are artificial brain organs, like the speech center or vision center of a brain. We'd get closer to AGI if we could incorporate them with the rest of a brain, but we still have no idea how to even begin building, say, a motor cortex.

      • rhet0rica 1 day ago
        You're absolutely right, and reflecting on it is why the article is horribly wrong. Humans are multimodal—they're ensemble models where many functions are highly localized to specific parts of the hardware. Biologically these faculties are "emergent" only in the sense that (a) they evolved through natural selection and (b) they need to be grown and trained in each human to work properly. They're not at all higher-level phenomena emulated within general-purpose neural circuitry. Even Nature thinks that would be absurdly inefficient!

        But accelerationists, like Yudkowskites, are always heavily predisposed to believe in exceptionalism—whether it's of their own brains or someone else's—so it's impossible to stop them from making unhinged generalizations. An expert in Pascal's Mugging[1] could make a fortune by preying on their blind spots.

        [1]https://en.wikipedia.org/wiki/Pascal's_mugging

      • nemjack 1 day ago
        This is a great analogy, I totally agree!
      • runarberg 1 day ago
        The brain is not a statistical inference machine. In fact humans are terrible at inference. Humans are great a pattern matching and extrapolation (to the extent it produces a number of very noticeable biases). Language and vision is no different.

        One of the known biases of the human mind is finding patterns even when there are none. We also compare objects or abstract concept with each other even when the two objects (or concept) have nothing in common. With our human brain we usually compare it to our most advanced consumer technology. Previously this was the telephone, then the digital computer, when I studied psychology we compared our brain to the internet, and now we compare it to large language models. At some future date the comparison to LLMs will sound as silly as the older comparison to telephones does to us.

        I actually don‘t believe AGI is possible, we see human intelligence as unique, and if we create anything which approaches it we will simply redefine human intelligence to still be unique. But also I think the quest for AGI is ultimately pointless. We have human brains, we have 8.2 billion of them, why create an artificial version of a something we already have. Telephones, digital computers, the internet, and LLMs are useful for things that the brain is not very good at (well maybe not LLMs; that remains to be seen). Millions of brains can only compute pi to a fraction of the decimal points which a single computer can.

        • rhet0rica 1 day ago
          > We have human brains, we have 8.2 billion of them, why create an artificial version of a something we already have.

          To circumvent anti-slavery laws.

          • runarberg 1 day ago
            People are calling LLMs plagiarism machines, so I guess AGI will be called scab machines.
        • _Algernon_ 1 day ago
          >why create an artificial version of a something we already have

          Why build a factory to produce goods more cheaply? Because the rich get richer and become less reliant on the whims of labor. AI is industrialization of knowledge work.

      • habinero 20 hours ago
        > while LLMs have become so good at "human intelligence" (language) that they can reliably pass the Turing Test

        If the LLM overhype has taught me anything, it's the Turing Test is much easier to pass than expected. If you pick the right set of people, anyway.

        Turns out a whole lot of people will gladly Clever Hans themselves.

        "LLMs are intelligent" / "AGI is coming" is frankly the tech equivalent of chemtrails and jet fuel/steel beams.

    • fusionadvocate 1 day ago
      So before trying to build a flying machine we should first try to build a machine inspired by non flying birds?
      • chrsw 1 day ago
        Learning architectures come in all shapes, sizes, and forms. This could mean there are fundamental principles of cognition driving all of them, just implemented in different ways. If that's true, one would do well to first understand the extremely simple and go from there.

        Building a very simple self-organizing system from first principles is the flying machine. Trying to copy an extremely complex system by generating statistically plausible data is the non-flying bird.

  • mountainriver 1 day ago
    >The “meaning” of a percept is not in the vector it is encoded as, but in the way relevant decoders process this vector into meaningful outputs. As long as various encoders and decoders are subject to modality-specific training objectives, “meaning” will be decentralized and potentially inconsistent across modalities, especially as a result of pre-training. This is not a recipe for the formation of coherent concepts.

    This is a bit silly, you can train the encoders end-to-end with the rest of the model and the reason they are separate is we can cache linguistic tokens really easily and put them in an embedding table, you can't do that with images.

  • itkovian_ 21 hours ago
    I don’t want to bash the guy since he’s still in his phd, but it’s written in such a confident tone for something that is so all over the place that I think it’s fair game.

    Like a lot of the symbolic/embodied people, the issue is they don’t have a deep understanding of how the big models work or are trained, so they come to weird conclusions. Like things that aren’t wrong but make you go ‘ok.. but what you trying to say’.

    E.g ‘Instead of pre-supposing structure in individual modalities, we should design a setting in which modality-specific processing emerges naturally.’ Seems to lack the understanding that a vision transformer is completely identical for a standard transformer except for the tokenization which is just embedding a grid of patches and adding positional embeddings. Transformers are so general, what he’s asking us to do is exactly what everyone is already doing. Everything is early fusion now too.

    “The overall promise of scale maximalism is that a Frankenstein AGI can be sewed together using general models of narrow domains.” No one is suggesting this.. everyone wants to do it end to end, and also thinks that’s the most likely thing to work. Some suggestions like lecuns jepa’s do suggest to induce some structure in the arch, but still the driving force there is to allow gradients to flow everywhere.

    For a lot of the other conclusions, the statements are literally almost equivalent to ‘to build agi, we need to first understand how to build agi’. Zero actionable information content.

    • nemjack 20 hours ago
      I don't think you're quite right. The author is arguing that images and text should not be processed differently at any point. Current early fusion approaches are close, but they still treat modalities different at the level of tokenization.

      If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.

      Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.

      There's definitely stuff to work with here. It's not totally mature, but not at all directionless.

      • itkovian_ 19 hours ago
        Saying we should tokenize different modalities the same would be analogous to saying that in order to be really smart, a human has to listen with its eyes. At some point there has to be SOME modality specific preprocessing. The thing is in all current sota arch.’s this modality specific preprocessing is very very shallow, almost trivially shallow. I feel this is the peice of information that may be missing for people with this view. In the multimodal models everything is moving to a shared representation very rapidly - that’s clearly already happening.

        On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!

        • nemjack 18 hours ago
          For sure, we can't process images the same way that we process sound, but the author argues for processing images and text the same, and text is fundamentally a visual medium of communication. The author makes a good point about how VLMs can still struggle to determine the length of a word, or generate words that start and end with specific letters, etc. which is an indicator that an essential aspect of a modality (its visual aspect) is missing from how it is processed. Surely a unified visual process for text and image would not have such failure points.

          I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.

          • hallh 13 hours ago
            I don't think you can classify reading as a purely visual modality, despite being a visual medium. People with dislexia may see perfectly fine, but only the translation layer processing the text gets jumbled. Granted, we are not born with the ability to read, so that translation layer is learned. On the other hand, we don't perceive everything in our visual field either, magicians and youtube videos use this limitation to trick and entertain us, and these we are presumably born with, given that its a shared human trait. Evidently, some of the translation layers involved with processing our vision were seemingly evolved naturally and are part of our brains, so why would we not allow artificial intelligence similar advance starting points for processing data?
  • PoEdict 1 day ago
    > Instead of trying to glue modalities together into a patchwork AGI, we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary, and see modality-centered processing as emergent phenomena.

    Right so it’s embodied in a computer and humans are part of its environment that provide emergent experience to the AI to observe.

    The author glued modalities together by linking a body (a modal), environment (a modal), emergence (a modal).

    How does anything emerge if forces do not collaborate? The effects of gravity and electromagnetism do not act in a vacuum but a reality of stuff.

    Poetic exchange may engage some but Maxwell didn’t make electromagnetism “work” until he got rid of the imagined pulleys and levers to foster a metaphor.

    Not sure the point being suggested exists except as too bespoke an emergent property of language itself to apply usefully elsewhere.

    Transformers came along and revealed a whole lot of theory of consciousness to be useless pulleys and levers. Why is this theory not just more words attempting to instill the existence of non-essential essentials?

  • skybrian 1 day ago
    > A true AGI must be general across all domains.

    By that definition, does any general intelligence exist? No human has every talent.

    • roywiggins 1 day ago
      I guess you can treat the human brain as an architecture- one of them can't do everything, but it's a general architecture and you can always make more and train them to do whatever.

      An AI that can be copied and trivially trained on any speciality is functionally AGI even if you need an ensemble of 10,000 specialists to cover everything.

      • exe34 1 day ago
        Doesn't chatgpt cover a pretty large percentage already in that case?
    • Glyptodon 1 day ago
      There's a lot of evidence that most humans can be raised to have a baseline of understanding and proficiency in most domains. (And anecdotally, many people avoid "difficult" things that they're actually capable of. For example, "bad at learning languages" people will probably still end up learning another language to some degree if stuck where it's the only language spoken.)
    • bokoharambe 1 day ago
      Given enough time one human can learn to do anything any other human can do. There is a general capacity for learning, even if someone will only ever transform a specific portion of that capacity into actual activity in their lifetime.
    • svachalek 1 day ago
      True AGI is as elusive as a true Scotsman.
    • _steady 1 day ago
      this is a glib version of exactly what i agree with. a human doesn't do the maximal efficiency with the inputs its given, but has a high number of varied inputs that allow the multi part of multi-modal to really shine. We're not a true general intelligence because we can't respond to problems in space or time that are too big or too small for us, outside of the right frame of reference or outside the right frequency of visible light for us to respond to. Also, the amount of processing power we can respond with is capped as well. So when we say AGI, we mean a computer that responds to the same set of stimuli as we do, with a repsonse somewhere in the region we can respond back to. I don't see why a robot with AGI would care about sunburn if their robot arms don't care about heat, and i don't see that as any less general as if us and a matis shrimp are talking about different frequencies of photons we can see
    • im3w1l 1 day ago
      Yeah it's well known that humans intelligence is not a homogenous whole that is general across all domains. Rather it consists of many specialized parts with hard coded purposes, that are cobbled together with a thin layer of higher thinking on top.
  • nexttk 1 day ago
    I haven't read it all and must admit that I'm not sure I really understood the parts that I did read. Reading the part under the headline "Why We Need the World, and How LLMs Pretend to Understand It" and the focus on 'next-token-prediction' makes me wonder how seriously to take it. It just seems like another "LLM's are not intelligent, they are merely next token predictors". An argument which in my view is completely invalid and based on a misunderstanding.

    The fact that they predict next token is just the "interface" i.e. an LLM has the interface "predictNextToken(String prefix)". It doesn't say how it is implemented. One implementation could be a human brain. Another could be a simple lookup table that looks at the last word and then selects the next from that. Or anything in between. The point is that 'next-token-prediction' does not say anything about implementation and so does not reduce the capabilities even though it is often invoked like that. Just because it is only required to emit the next token (or rather, a probability distribution thereof) it is permitted to think far ahead, and indeed has to if it is to make a good prediction of just the next token. As interpretability research (and common sense) shows, LLM's have a fairly good idea what they are going to say in the many, many next tokens ahead in order that it can make a good prediction for the next immediate tokens. That's why you can have nice, coherent, well-structured, long responses from LLM's. And have probably never seen it get stuck in a dead end where it can't generate a meaningful continuation.

    If you are to reason about LLM capabilities never think in terms of "stochastic parrot", "it's just a next token predictor" because it contains exactly zero useful information and will just confuse you.

    • lsy 1 day ago
      I think people hear "next token prediction" and think someone is saying the prediction is simple or linear, and then argue there is a possibility of "intelligence" because the prediction is complex and has some level of indirection or multiple-token-ahead planning baked into the next token.

      But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.

      • og_kalu 1 day ago
        >But the thrust of the critique of next-token prediction or stochastic output is that there isn't "intelligence" because the output is based purely on syntactic relations between words, not on conceptualizing via a world model built through experience, and then using language as an abstraction to describe the world. To the computer there is nothing outside tokens and their interrelations, but for people language is just a tool with which to describe the world with which we expect "intelligences" to cope. Which is what this article is examining.

        LLMs model concepts internally and this has been demonstrated empirically many times over the years, including recently by anthropic (again). Of course, that won't stop people from repeating it ad nauseum.

        • nemjack 1 day ago
          Concepts within modalities are potentially consistent, but the point the author is making is that the same "concept" vector may lead to inconsistent percepts across modalities (e.g. a conflicting image and caption).
    • yahoozoo 1 day ago
      Yes, LLMs often generate coherent, structured, multi-paragraph responses. But this coherence emerges as a side effect of learning statistical patterns in data, not because the model possesses a global plan or explicit internal narrative. There is no deliberative process analogous to human thinking or goal formation. There is no mechanism by which it consciously “decides” to think 50 tokens ahead; instead, it learns to mimic sequences that have those properties in the training data.

      Planning and long-range coherence emerge from training on text written by humans who think ahead, not from intrinsic model capabilities. This distinction matters when evaluating whether an LLM is actually reasoning or simply simulating the surface structure of reasoning.

  • K0balt 1 day ago
    I’ve long thought that embodiment was a critical prerequisite for the development of something that humans would identify as “real” AGI.

    Humans are notoriously bad at recognizing intelligence even in animals that are clearly sentient, have language, name their young, and clearly share the realm of thinking creatures with the apes.

    This is largely due to the lack of shared experiences that we can easily understand and relate to. Until an intelligence is rooted in the physical realm where we fundamentally exist, we are unlikely to really be able to recognize its existence as truly “intelligent”.

    • pizza 1 day ago
      That's why I'm very excited for the potential metaphysical ramifications of DolphinGemma
    • groby_b 1 day ago
      The question you'll need to answer is "why". What does embodiment provide that is recognizable as intelligence.

      As for "not able to recognize", it's also worth keeping in mind that LLMs by now regularly pass the Turing test. More, they are more likely to be recognized as humans than humans participating as control.

      • habinero 19 hours ago
        Yeah, but it turns out the Turing Test isn't all that hard to pass if you pick the right kind of people.
        • groby_b 15 hours ago
          Is there an argument outside of a quip you'd like to present? These tests have replicated pretty well.

          Are you really trying to make the point that there's a collective effort to defraud here?

  • fusionadvocate 1 day ago
    The Society of Mind by Marvin Minsky will help anyone interested in the topic of multimodality. The book covers several interesting ideas about organizing systems made up of more than one "model" or agent.
  • hardwaresofton 1 day ago
    This is a really interesting paper.

    The discussion about the importance of decoders strikes me as a parallel to the human eyes, ears and other sensory organs. We actually dont have a good grasp of what our eyes see, they just see (produce data and relay it) and children figure out what is what.

    I guess AGI will be achieved when we can sit a program in a simulated world with completely fabricated input and get a general intelligent program out. Maybe we’re in that simulation right now.

    • PaulDavisThe1st 1 day ago
      I would say your last paragraph is a complete misreading of the paper. One of the central points that it opens with is that "AGI" requires situated, or embodied, intelligence. It needs to be able to operate within, and upon, a physical world.
      • stefs 3 hours ago
        my personal opinion is that the intelligence ceiling scales with the complexity of the environment it operates in. this means that we could theoretically get something resembling AGI in a simulated environment complex enough. in practice, simulating an environment complex enough would be extremely inefficient compared to just using the (computationally free) real physical world.

        also, the intelligence itself is shaped by the environment it operates in, so it would turn out to be more human-like the closer its operating environment is to our human physical world. this also means intelligences not trained in our physical world (or a convincingly close simulation of it) won't be human-like, but rather a very alien. moreover, i'm not sure that even an intelligence trained in the physical world with human-like sensory inputs will necessarily turn out human-like. there might be a case for convergent evolution (i.e. mammalian intelligence to be global optimum-ish), but i think human intelligence will only have a chance to emerge if everything, from the operating environment to the machine body and neural structure will resemble a human to the point where there is no difference in the human and the machine at all.

      • hardwaresofton 1 day ago
        Not to be overly snarky, but I'd argue that your comment here is a failure of creativity.

        If the AGI mechanism can learn from a real world, it can learn from a simulated one (that it can similarly operate within and act upon) -- and in fact that can cut down the time it would take to train the AGI from years/decades (humans) by many orders of magnitude.

        We already see things like this in robotics environments, it's a matter of fidelity/simulation quality. Even without perfect quality, if the mechanism of learning is correct, you'd get an intelligence with incomplete ideas/intelligence, not a completely different thing.

        • PaulDavisThe1st 1 day ago
          Have you ever worked with computers controlling physical world mechanisms? Or ever done any electrical wiring, or plumbing, in an existing house?
  • andy99 1 day ago
    I probably agree with much of the article, but I find this kind of statement really weird:

      we should pursue approaches to intelligence that treat embodiment and interaction with the environment as primary
    
    So pursue it. What does arguing that we should do it imply?
    • kombine 1 day ago
      > So pursue it.

      And they do. But it's also completely normal for researchers to convince others to work on certain problems they care about.

    • xandrius 1 day ago
      Convincing people with arguments and then being more than just 1 person?

      I mean, if someone is arguingtthat we should work harder to go to space, answering that they should just go ahead and do it themselves is quite far from being an helpful answer, isn't it?

      • verisimi 1 day ago
        It's not a helpful response, but then saying 'we should work hard to go to space' as a comment is generally accepted but is actually quite meaningless.

        Why not say 'NASA' or 'my colleagues at NASA' or 'as a scientist' or 'humanity'. One should at least indicate the group the collective noun relates to, rather than assume this is understood. One shouldn't assume that one can speak for everyone, when that is most likely not the case.

        • xandrius 1 day ago
          So one can say "humanity" but not "we" (implying humanity)? Interesting take.
          • verisimi 1 day ago
            "We" is highly ambiguous. It ranges from 'me and my dog', to 'humanity', to anything in between. It's of course fine to us once the group has been defined.

            That it invokes the idea of a consensus humanity, that one group can speak and decide for everyone (say, scientists or politicians) is a psychological trick, imo, in that it presumes a consensus.

    • signa11 1 day ago
      $$$
    • mitthrowaway2 1 day ago
      ... A need for capital?
      • tedivm 1 day ago
        This was actually the approach that Vicarious AI took while I was there, and even $250m in VC funding wasn't enough to prove it out, although that may have been a problem of having too much money and not enough focus.

        I think the problem (if it can be called that) is that LLMs are useful today, while we still haven't solved the embodiment problem. There's a lot more research before that'll work well, while LLMs have uses today. So the money goes to the LLMs. While it's pretty obvious that solving the problem would change society, it's also not clear how close we are to doing it. That makes it much harder to get the capital as it is a much larger risk.

        • fusionadvocate 1 day ago
          It is shocking that in this day and age people can burn $250M and fail to deliver a robot. Last time I checked cameras can be bought for a couple dollars and any SBC has GigaFlops of compute power.
  • ilaksh 1 day ago
    Interesting idea. Aren't there a few models a little more like what he suggests than a typical LLM? Like one or two experiments that operate on raw bytes, or some robotics diffusion transformers or whatever like Nvidia's thing? I guess that has action/motion tokens that are separate though. Are there a few vision language models that treat text and images more or less the same somehow?

    For it to be science, "AGI" should be defined. It's used in an imprecise way even in papers like this.

    Also for this to be constructive, he should make a machine learning model.

    • andrewflnr 1 day ago
      > Using multi-agent reinforcement learning to address the symbol grounding problem.

      Somehow I think he's made a few machine learning models.

  • waynecochran 1 day ago
    This article made me think about DNA as a language. Sort of a simple proof that a biological intelligence can be initially encoded as a language. I am sure someone has tried building LLM's from gene sequence data right?
    • caeruleus 1 day ago
      I don't believe the concept of DNA can be reduced to a sequence of quaternary numerals, which is what gene sequence data would represent. Similar to proteins, DNA forms higher-level structures on top of the primary one [1], and (in a biological context, inside the nucleus) exhibits somewhat self-modifying [2] and self-regulating [3] behavior as well as meta-modification [4]. Analogous to the article, if one defines the language of DNA by its nucleobase sequence, this language can only represent a subset of the world of DNA.

      Somewhat related, the way the adaptive immune system works has similarities with some concepts in machine learning. In this process, sections of nuclear DNA serve as randomly initialized weights in precursor cells [5] as well as final weights in memory cells. There's even fine-tuning of the weights. [6]

      [1] https://en.wikipedia.org/wiki/Nucleic_acid_structure [2] https://en.wikipedia.org/wiki/Transposable_element [3] https://en.wikipedia.org/wiki/Transcriptional_regulation [4] https://en.wikipedia.org/wiki/Epigenetics [5] https://en.wikipedia.org/wiki/V(D)J_recombination [6] https://en.wikipedia.org/wiki/Affinity_maturation

      • throwawaymaths 1 day ago
        > don't believe the concept of DNA can be reduced

        followed by examples of things that are encoded by DNA. Fro example, sure, maybe you'll miss bootstrapping methylation on a first pass but the idea of methylation is there in the DNA, and if you didnt have "methylation in the right place" more than likely some generation (N) would.

        to wit, i dont think there is strong evidence of an "ice-9" in the epigenome that brings about a spark of life that can't easily be triggered by chance given a template lacking it.

        so there's probably not something intrinsically missing from DNA as an encoding medium vs say "casually" missing from any given piece of DNA.

        if you want something a bit stronger than an assertion, the DNA used to bootstrap m. capricolum into Syn1 lacked all the decorations (made in yeast) and was not locked into higher order structure (treated with protease prior to transplantation)

        • caeruleus 1 day ago
          You're raising some intriguing points, and I agree with your assertion about the epigenome. I still feel like your response misses the point I was making.

          > followed by examples of things that are encoded by DNA

          ... given its natural environment. A nucleobase sequence is not a symbolic language, it relies on physical laws in general and a defined chemical environment in particular (that it helps to create and maintain) to mean something. It's similar to the point about Othello vs. the physical world in the article: The language itself does not encode every bit of information about the world it describes. For instance, in 3D space, regions of DNA that are far apart in the sequence can physically interact and influence each other’s expression.

          TLDR: I think my point is that a base sequence requires a particular context (~ interpreter/knowledge about the physical world) to encode mostly everything about life. Treating it as just a language in the context of LLMs abstracts away the complex substrate that makes it work.

          • throwawaymaths 1 day ago
            i agree that current llms are likely missing quite a few of the trees and probably off on the forest too. however, in general, an llm (or a transformer rather) is a universal function approximator, so in principle, there's no substrate too complex unless somehow it's uncomputable and i see no evidence that biology is uncomputable in the bulk.
            • PaulDavisThe1st 1 day ago
              It's not really a question of whether it is uncomputable in bulk.

              It is more that a system like DNA operates as both a linear encoding (the "algorithm" if you like) AND as 3D chemical object whose properties allow the encoding to be used in various ways, which means that a huge amount of its linear structure is actually determined by 3D chemical function, rather than encoding for proteins. Moreover, it appears that the role of a given section of DNA can vary depending on what other molecules are interacting with it and what physical state it is in.

              If you want a more computer-ish analogy, it's like a computer where the program is actually encoded as a part of the computer's own structure, yet is still logically distinct from the rest of the structure. It may not be physically distinct, however, and thus simply inspecting the structure will not lead to a clear understanding of what is "the program" and what is "the cpu".

            • caeruleus 1 day ago
              I recognize that my initial statement comes off as too broad in light of theoretical computability, when I was mostly weighing in terms of current/near-term technology. Given what we know today, I would (still cautiously) agree with your statement. There hasn't been any evidence to the contrary (only some highly contested speculation, most prominently by Roger Penrose).
              • throwawaymaths 19 hours ago
                i hate to denigrate penrose, but his quantum consciousness proposition is basically "we dont understand consciousness and quantum mechanics is magic therefore consciousness must be quantum"... totally elides the biggest challenge which is that we dont have a definition much less a test for consciousness.

                will say though that the long range coupling between microtubules that got discovered is interesting for its own reasons.

    • smath 1 day ago
      Yes there are protein language models (e.g. [1]), and since DNA encodes proteins, they are effectively DNA language models. Its a hot area of work feeding drug design.

      [1] https://www.nature.com/articles/s41587-024-02123-4

      • PaulDavisThe1st 1 day ago
        Only a tiny part of DNA encodes proteins, so there is almost no sense in which a protein language model is effectively a DNA language model.
  • pjdesno 1 day ago
    Kind of relevant to this is the NTSB analysis of a self-driving crash in 2017:

    https://www.ntsb.gov/investigations/accidentreports/reports/...

    Basically a truck was backing up into an alley - it was at an angle when the self-driving vehicle approached, but a little kid would have been able to figure out that it needed to straighten before it finished backing in. The self-driving vehicle didn't understand this, and stopped at a "safe distance" which happened to be within the arc that the truck cab had to sweep in order to finish its maneuver.

    It's quite possible that LLM-like models could learn things like this, but we don't have vast amounts of easily accessible training data, because everyone just knows this sort of shit, and we don't have good vocabulary for it - we just say "look at that" or the equivalent. (I'll add that I'm sure a lot of knowledge like this is encoded in the physics engines of various games, but I doubt we have a good way to link that sort of procedural code knowledge to the symbolic knowledge in LLMs)

  • tolleydbg 15 hours ago
    Of course it isn't, because AGI is not real.
  • bufferoverflow 1 day ago
    AGI must be multimodal. If it can't understand images, video, sound, smells, tastes, it doesn't have a full understanding of the world.
    • dlivingston 1 day ago
      Why limit it to just human senses? Imagine an AI with all of the above... plus sensors for electromagnetic fields, non-visible light, non-audible sounds, hyper-sensitivity to air flow / pressure, to humidity... no idea how you would even being to train these things into multimodality, but so many "senses" would be emergent.
      • bufferoverflow 17 hours ago
        I didn't limit anything. Read what I wrote.

        I agree, AIs should have all the possible sensors. We don't have much data for the non-human sensors in human environments though.

    • bastawhiz 1 day ago
      It absolutely tickles me to think AGI would be like "I think it's going to rain, I can smell the petrichor" or ask for its salsa to be free from cilantro because it tastes like soap.

      "Hey Siri, what does this taste like to you?" is such an absolutely unhinged interaction

      • esafak 1 day ago
        You can ask a blind person what they think it is like to see.
        • bastawhiz 1 day ago
          You can ask Claude or GPT-4o what they think it is like to see right now.
          • esafak 1 day ago
            You did not get my point: humans have the same issue.
    • jillesvangurp 1 day ago
      Are blind people not intelligent? Is it actually important to have a full understanding of the world? What about people that grew up in isolation? I think there are still some tribes in the Amazon that have had little or no contact with modern civilization. Are these people not intelligent?

      There are some deep philosophical topics lurking here. But the bottom line is that you can obviously have intelligent conversations with blind people. Doing that with a person that is both deaf and blind is a bit challenging, for obvious reasons. But if they otherwise have a normal brain you might be able to learn to communicate with them in some way and they might be able to be observed doing things that are smart/intelligent. And some people that are deaf and blind actually manage to learn to write and speak. And there have been a few cases of people like that getting academic degrees. Clearly sight and hearing are not that essential to intelligence. Having some way to communicate via touch or something else is probably helpful for communicating and sharing information. But just a simple chat might be all that's needed for an AGI.

      • bufferoverflow 17 hours ago
        Blind people have a lot less understanding of the visual part of the world. For example, if you have a chemical reaction, the result of which relies on color, they can't tell you the result (without using some seeing tool like a camera).
    • whatnow37373 1 day ago
      I think AGI implies it: I can learn something through audio and apply it visually and the other wat around. I don’t think that’s some abstract human quirk. Isn’t that what enabled literacy? It seems kind of obvious intelligence is beneath the modality, agnostic about it.
    • _Algernon_ 1 day ago
      Is a blind person less intelligent than a person capable of seeing? If so, by how much?

      In my view, intelligence isn't about what senses you have available, but how intelligently you use the information you have available to you.

      • bufferoverflow 17 hours ago
        Blind people have a lot less understanding of the visual part of the world. For example, if you have a chemical reaction, the result of which relies on color, they can't tell you the result (without using some seeing tool like a camera).
      • Dylan16807 1 day ago
        That sounds really hard to test. If someone fails entire categories of question but does better at the ones they can do and focused on, is that a good result or a bad result?

        When it comes to "understanding of the world", I'd say the average blind person has less. But the gaps in their understanding are generally not particularly important parts.

        • _Algernon_ 1 day ago
          >When it comes to "understanding of the world", I'd say the average blind person has less. But the gaps in their understanding are generally not particularly important parts.

          Is understanding of the world equivalent to intelligence though? In my view intelligence is about optimalising the mapping from the percept sequence to action. In other words, given a sequence of percepts, does it determine the utility maximizing action.

          Imagine two chess bots on uneven footing. One plays regular chess with perfect knowledge of the board. The other plays fog of war chess—it only sees the pieces on squares it attacks. In this case, the former could play suboptimally and still win against the latter. The latter can have perfect information about probabilities of pieces on tiles and act perfectly utility maximising in response and still lose. My argument is that the latter is still more intelligent despite losing. There is a difference between action and intelligence.

          Similarly a human doesn't become smarter or dumber by adding or removing senses. They may make smarter or dumber decisions, but that is purely attributable to the extra information available to them.

          • Dylan16807 1 day ago
            On the other hand I suspect there's a level of board game complexity where being blind (or not having a well-integrated image processor) makes a notable difference in how well you can track the pieces. You have the same information but you don't have the same systems for organizing and tracking that information. You're worse at using that information, which is effectively a drop in "intelligence".
        • pzo 1 day ago
          But then how to we compare with dogs who has better smell sense, cats that have better motor skills, birds better orientation and navigation, other animals that better see at night?

          There were many people in human history that made big achievement even though handicapped: Ludwig van Beethoven, Steven Hawking, John Nash - but yeah they haven't been born with disabilities so had their all childhood to train their brain.

          I generally don't understand this obsession about needing AGI. If current LLMs can be extended to humanoid so just get motor modality and keeping current vision, audio, text ability IMHO they will excel in many fields like currently they excel in text, vision, audio than most humans.

          • Dylan16807 1 day ago
            If you gave a dog nose and sharper eyes to a human I think they'd have a measurable advantage. But brain is most important and those animals are not getting anywhere near a human.

            I've never heard of cats having better motor skills, can you elaborate on that? They don't seem very good at fine movement.

            • _Algernon_ 1 day ago
              Depends what you consider better motor skills, but I couldn't fall head first from a two story building, land on my feet and shrug it off. My cat likely could. I'd also struggle to hunt a healthy bird without tools to assist me.

              Then again, my cat lacks opposable thumbs and would struggle to draw a line on a piece of paper with a pen.

              • Dylan16807 1 day ago
                If you were the size of a cat you'd be pretty well suited to survive that. The flipping reflex is cool but falling that distance is mostly not a motor skills problem.

                A cat struggles to move their paw through the air in a smooth straight line.

                • pzo 22 hours ago
                  cats are amazing hunters with great reaction. Most likely small dogs or different animals won't handle such drop from 2 story building so well. Check some 1st person (cat) video to see how they behave, they can analyze and make decision very fast - human reaction time when driving car is very big comparing to cat reaction.

                  So this is what I find as motor inteligence. If someone can process and think very fast we consider them intelligent and in the same way we should consider cat smart how fast and well they can plan escape from dogs chasing them or they hunting for prey. Imagine how difficult would be to make robot to do this all calculation about different jumps etc.

      • andoando 1 day ago
        I think human/animal intelligence at its basis is spatial-temporal. That is we can model and reason about events in space through time.

        Our senses I believe map to this spatial-temporal model. Blind people can reason about the world the same way as those who can see, because what were really doing is modeling space, and light, audio, touch etc are just ways of gaining information

    • gabipurcaru 1 day ago
      multimodality would be very useful, but on the other hand humans can't see infrared, and can't smell ~most things that other animals can
    • altruios 1 day ago
      Does it need a 'full' understanding? smell and taste is useful for biological life... but we don't need that in our thinking machines, do we?

      I agree AGI must be multimodal. I don't think that multimodal is 'set in place', nor must it be conveniently, human-centricly mapped from our senses.

      • AnimalMuppet 1 day ago
        I think a big part of intelligence is being able to correlate things. Well, the more modes, the more ability to correlate.

        For example, smell is a component of ER triage. Some different problems smell differently.

        And if I had a robot chef, but the chef couldn't actually taste... yeah, not sure I trust it very far as a chef.

    • enturbulated 1 day ago
      Among other things, The Fine Article argues that the current approach of gluing together various models of different modalities is, in the end, going fail to reach AGI. Better track to try to build a single model which processes multiple modalities all at once.
      • robotresearcher 1 day ago
        Brains have very obvious mode-specialized chunks. That doesn’t mean that’s the only way to do it, but it’s an interesting fact.
      • genewitch 1 day ago
        I wonder if it will end up being a game-like loop where it processes everything that's come in since the last delta. Here's, you know, 4x200samples of audio, 2 frames of video, and here's all the mems sensor data during the delta, etc

        Then you just work on getting the delta as small as possible, I assumed 5ms for audio, e.g.

    • Glyptodon 1 day ago
      I'm out sure it needs to be connected to senses to be multimodal - it's not like blind people lack GI because of not seeing.
    • readthenotes1 1 day ago
      Must it understand emojis?

      I don't understand most emojis.

      But I guess I never claimed to possess NGI

    • m3kw9 1 day ago
      The way we need AGI is it needs to match us, we are evolved to operate quite optimally with the constraints and given physics in this world
    • solomonb 1 day ago
      [flagged]
  • falcor84 1 day ago
    That's a good argument. I think that maybe the present day AIs wouldn't directly lead to AGI, but perhaps could be used to bootstrap it.
  • seeknotfind 18 hours ago
    The brain (GI) is multimodal..
  • pcwelder 1 day ago
    I'm sorry but AGI is one of those loaded words which would lose substance with just a few rounds of the rationalist's taboo.

    If it just means human level intelligence, then world modeling isn't needed as argued. Simply because we don't have correct world modeling either.

    Airplanes were invented without simulating navier stokes equation. It took approximation, experimentations and failures.

    Regardless of the meaning of AGI, we don't need correct models, because there can't be one, we just need useful models.

  • empath75 1 day ago
    I think the article is in the general category of articles suggesting that planes would work better if they flapped their wings.

    AI's "think" like planes "fly" and submarines "swim".

    Does it matter if a plane experiences flight the way an eagle does if it still gets you from LA to New York in a few hours?

    • lucisferre 1 day ago
      Much of the discussion of AI flirts with science fiction more than fact.

      Let's start with the fact that AGI is not a well defined or agreed upon term of reference.

      • empath75 1 day ago
        100% agreed. I think, in fact, that "intelligence" itself is a near-meaningless term, let alone AGI.

        The evidence for this is that nobody can agree on what actually requires intelligence, other than there is seemingly broad belief among people that if a computer can do it, then it doesn't.

        If you can't point at some activity and say: "There, this absolutely requires intelligence, let there be zero doubt that this entity possesses it", then it's not measurable and probably doesn't exist.

    • catlifeonmars 1 day ago
      I think the analogy actually works better the other way. LLMs “think” the way humans speak. This is closer to having a machine that worked by flapping its wings when a more efficient machine would use fixed wings and a jet engine.

      Language is an extremely roundabout way to understanding.

      • Dylan16807 1 day ago
        At least the roundabout flappy machine gets off the ground in your analogy. The other options we have are big logic chains that don't work and neural simulations that would require all the computers in the world.
        • catlifeonmars 1 day ago
          Yeah definitely. We have to start somewhere and what we have works. I just think that language based models are a v0
      • lostmsu 20 hours ago
        LLMs don't think the way humans speak. LLMs process sequences of high-dimensional vectors.
        • catlifeonmars 6 hours ago
          Yeah that was very hand wavy on my part. What I meant to say is that LLMs encode the relationships between words. The idea being that the relationship between words is a good enough representation of the relationship between the things that the words represent.

          I am conjecturing

          1. that solely relying on written artifacts by produced by humans has some upper bound on the amount of knowledge that can be represented.

          2. that language is an inefficient representation of human knowledge. It’s redundant and contains inaccuracies. Using written artifacts is not the shortest path to learning.

          For example, take mathematics. It’s not sufficient to read a ton of math literature to effectively learn math. There’s a component of discovery that comes from e.g attempting to write a proof that can’t be replaced by reading all of the proofs that already exist.

          Anyway I would take all this with a giant grain of salt.

    • staticman2 1 day ago
      I feel the term AGI is meaningless but if I'm going to strongman the article.

      If your claim is, "AGI's "think" like planes "fly" and submarines "swim".

      You only get to make that claim with confidence if you've invented an AGI.

    • emp17344 1 day ago
      Except AI doesn’t do anything better than the human mind, and doesn’t have any use cases beyond what humans can do.
      • empath75 1 day ago
        > Except AI doesn’t do anything better than the human mind,

        There are all kinds of tasks that AI's are better at than most people.

  • 3cats-in-a-coat 20 hours ago
    Saying "is not" implies the author has AGI. If they do, they wouldn't be posting this blog post but the AGI. If they don't, they speaking authoritatively and conclusively like that is just a cheap front for someone's completely uninformed and unsupported, but highly certain opinion. There's an infinite supply of those.
  • charcircuit 1 day ago
    A multimodal AGI will be more useful than one that isn't. People want AI to work with and have it understand audio, images, videos, etc.
    • treyd 1 day ago
      You didn't read the article. The thesis is that current "merely" multimodal approaches which project distinct kinds of inputs into the same latent space are insufficient for building a general world model that can be used for general internal reasoning. An example of this is this "Rs in strawberry" question, which requires them be trained on that information explicitly, since they don't have an experience of the characters in a word. It's an artifact of how LLMs don't learn how humans learn, which is by interacting with the world, instead of predicting text.

      More elaborately, they don't have an natural understanding of pragmatics. Transformers are best at modelling syntax, and their semantic understanding seems to be through rote memorization and "manipulating symbols" rather than building general world models.

      • charcircuit 1 day ago
        I did read it and even with their idea of focusing on a world model an AGI that can alsp operate on audio, images, and videos, being multimodal, will be more useful than one that operates purely on text.
        • snapcaster 1 day ago
          I'm skeptical you read it because he doesn't make that argument. In fact i've literally never heard someone argue text-only is more useful than multimodal
  • dboreham 1 day ago
    Human brains have environment sensors, they receive training data from other human brains, and they develop a theory that their continued existence depends on avoiding various negative situations. It's conceivable that AGI could depend on having a similar training environment. Which would mean John Searle was kind of right.
  • paulddraper 1 day ago
    Missed opportunity to quote Einstein:

    "The words of the language, as they are written or spoken, do not seem to play any role in my mechanism of thought." [1]

    [1] A Mathematician's Mind, Testimonial for An Essay on the Psychology of Invention in the Mathematical Field by Jacques S. Hadamard, Princeton University Press, 1945

    • lostmsu 20 hours ago
      To be clear LLMs also don't think in words (or tokens). That's not even a guess, the "seem" is not needed for LLMs.
  • cynicalpeace 1 day ago
    You 100% need the physical world to be a "general" intelligence.

    If an intelligence doesn't work well in physical environments it is, by definition, not "general"

  • macinjosh 1 day ago
    To me, the funniest part of the AGI debate is that humans don't even think other humans are intelligent and we're over here arguing over whether our fancy slabs of highly refined sand is intelligent.
  • SubiculumCode 1 day ago
    Embodiment can mean a physical body, but I'd argue that embodiment, as a construct/concept, is not so much about physicality, but as being situated in an environment that you can perceive then act upon during learning. Car simulations for driver-less AI training is embodied, where it learns by perceiving and acting on the environment. However, I'd argue that allowing an AI to interact in an entirely digital office environment is also "embodied" as long as it can receive information from the digital workplace and act on the information in the digital workplace (do office work). So to me, it is less about embodiment as a principal, but on the richness of the environment of that embodiment.

    We have long known that experimental animals raised in impoverished, unchanging, bare, environments (say in a cage) leads to animals with inferior problem solving capacity than those with enriched environments (things to climb on), even outside of social manipulations (alone versus multi-animal stalls). This is also true in humans, although I won't review the literature on the subject. I've also heard people saying similar things about the difference between house plants and outdoor plants, lol [1].

    So, for me, the argument for (physical) embodiment being key to cognition and AI can I think, be misconstrued. As a developmental psychologist whose pHD work focused on memory development, I tend to think of all this as encompassing:

    1. Environmental richness: complexity of information and interactions. 2. Capacity to perceive and to effect change[2] and to observe and integrate consequences. 3. Scaffolding [3]..i.e. temporary support structure provided by a more knowledgeable person (like a teacher or parent) who adjusts their assistance based on the learner's current abilities, gradually reducing help as competence grows.(think curriculum learning, shaped rewards in ML maybe).

    So the question is not about physicality for me, but whether these training environment(s) meet and learning capacities meet these criteria.

    Relatedly, the model must have these capacities:

    1. Semantic Memory. i.e. knowledge. Learning leads to changes in weights to that knowledge can be recalled, but doe snot necessarily encode where that knowledge was learned (implicit). 2. Autobiographical Episodic Memory. i.e. One-shot learning that encodes a conception of self (a spacial "I" token?), along with events (snapshots of the multimodal contents of experience (thoughts, perceptions, invoked schemas, evoked semantic information), into a set of flexibly linked representations). 3. Central Executive: A circuit that guides learning and recall via strategic, goal-directed means, and to make attributions about what is recalled (yeah that memory is vivid, its probably true, or ooh, that memory is really vague, it could be wrong, or reality monitoring: "Am I remembering taking out the trash, or remembering thinking about taking out the trash."

    Semantic memory allows someone to say, "All birds have feathers", while the latter allows them to recollect, "I remember the first time I plucked a chicken in Kentucky, just outside that musty coal mine of grand-dad's." The Central-Executive can guide future learning or current understanding.

    In terms of AI development:

    1. Semantic Memory is solves: LLMs have extraordinary semantic memory, in my opinion. 2. Autobiographical Episodic Memory: There are some models that do one-shot learning, but I've never seen them paired with [1] in a dual system approach. ...but I am not an expert in AI, I could easily be wrong. 3. Central Executive kind of component (I predict) would be less important in the early half of model training, but more important in later training. I suppose we already kind of see this with RL tuning on reasoning on a base LLM (semantic model).

    [1] https://www.theparisreview.org/blog/2019/09/26/the-intellige... [2] https://xkcd.com/326/ [3] https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=scaf...

  • ivape 1 day ago
    This gives very little credit to how the human mind is able to draw parallels and insights from seemingly unrelated perceptions. Newton observing an apple falling from a tree allowed cross-thinking. Watching someone juggle can help you understand a queue. Understanding a queue can help you understand juggling. Your typical soap opera can be distilled down to office dynamics. Scale is going to obliterate specialization in this regard.
    • kaangiray26 1 day ago
      everything in life is a metaphor, analogous to something else...
      • ivape 1 day ago
        Isomorphisms abound.
  • Zoethink 1 day ago
    [dead]
  • curtisszmania 1 day ago
    [dead]
  • nexttkhere 1 day ago
    [flagged]