Gemma 4 12B: A unified, encoder-free multimodal model

(blog.google)

932 points | by rvz 22 hours ago

57 comments

senko 20 hours ago
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...
The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.
So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)
I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.
To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).
Lists of various models I tested: https://senko.net/vibecode-bench/
[-]
- 0xbadcafebee 17 hours ago
  It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.
  For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
  Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.
  [-]
  - dirkg 9 hours ago
    > For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
    you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.
    https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s
    even the 27B in some quants can fit.
    https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...
    qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.
    Gemma family is better for almost all other tasks you'd use a local llm for.
    [-]
    - dofm 16 minutes ago
      The larger Gemma models are quite good at PHP. I would not be surprised if that was a training objective — it's one of the more consumer-focussed programming languages. They have very good knowledge of wordpress hooks.
    - ricardobayes 1 hour ago
      You can run it, however those low quantized models (iQ2, iQ4, Q2) will very likely underperform the 9B versions at Q6/Q8.
  - dotancohen 15 hours ago
```
  > For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
```
    You seem like the guy to ask. For a laptop with 12GB VRAM (RTX 5070) and 32 GB system RAM, what is a good multilingual (English, Hebrew, Greek) model for conversing with personal notes in Org mode format? I don't care how long updating the model or rag takes, and even inference can be reasonably slow, but the results of the query as they relate to my personal notes are important. I don't care about general knowledge, for those questions I can use e.g. ChatGPT.
    Thanks
    [-]
    - akmarinov 9 hours ago
      Joins us over on Reddit at r/LocalLlaMA to get 10 different opinions on that
      [-]
      - dotancohen 4 hours ago
        I read there regularly. I find little value there between the memes. I was hoping to ask a knowledgeable person here.
        [-]
        alfiedotwtf 3 hours ago
        /r/localllama for a while now seems to prefer Gemma 4 E4B for creative writing (especially the uncensored GGUFs).
    - nl 3 hours ago
      Qwen 3.5 35B A3
      Qwen models are always good. The 35B A3 model is a MoE model which means it has higher performance in RAM constrained environments compared to the 27B dense model (which is better at coding).
      I don't have experience to rate it's Hebrew or Greek performance but apparently it's not bad.
    - sourcecodeplz 12 hours ago
      Any Gemma 4 model, they are great at translations, multilingual
      [-]
      - silversmith 9 hours ago
        For the biggest languages, Spanish, French, maybe.
        For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.
        It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.
      - dotancohen 6 hours ago
        While Gemini 4 seems fine, Gemma 4 does not do Hebrew well. I've replaced it with Aya Expanse and am getting much better results, but there is still much improvement to be had.
        I'm not doing translations, rather querying Hebrew text with a Hebrew prompt.
    - emmelaich 11 hours ago
      You may like https://www.llmfit.org/
      (not recommendation, I've not used it .. yet)
      [-]
      - hypfer 6 hours ago
        Just tried it and honestly it's a terrible experience lacking any sort of intent or reason.
        Which is unsurprising in the AI space.
        You get a wall of text showing you various random fine-tuned models by random people, and that is basically it.
        Actual sane default requirements like "just give me the normal AI labs", "please filter for dense only" and "I want this exact context size at this quant" are not part of the tool, apparently. Neither is "compare these quants for me for the same model".
        Or maybe it's just hidden enough that I did not find them before I've stopped caring.
        Conway's law is at it again.
        ____
        Edit:
        I have since then had qwen3.6 ponder the codebase and think about my complaints.
        Seems to require a major data model overhaul to actually fix those, so they're legit. Which I didn't doubt, but nice to have some extra fabricated confirmation after it initially refused and said "nooooo the readme says otherwise nooo hypfer is just a hater noo"
        ___
        Edit 2:
        It gets worse the longer I stare at it. This could've been a web calculator.
        [-]
        hypfer 50 minutes ago
        Done:
        https://github.com/Hypfer/will-it-fit-llama-cpp
        https://hypfer.github.io/will-it-fit-llama-cpp/
        dofm 12 minutes ago
        I have found these things to be fully exasperating, to be honest, even though I am seeking information about a pretty "known" machine — a 64GB M1 Max MBP.
        (Honestly I think Apple's "AI push" could do worse than just focus on a curated model library, a couple of Apple-standard Gemini distillations, an OS-level model manager and some sort of tweak of their containers system to do what Docker's sbx does. They could demystify a lot of this shit.)
        hparadiz 4 hours ago
        We need benchmarks by engine, cli switch sets, and device with filters by cpu, gpu, and type. And if someone could please aggregate that in a way where people can upload results and just automatically see the best of any model for their device that would be a killer app.
        [-]
        alfiedotwtf 3 hours ago
        I've wanted to vibe code a tuning app, that pumps data through your CPU-GPU-RAM to try and determine the best parameters for each model, but I think it's just too much work compared to manually running by hand a one-liner and changing things here and there.
    - tacomagick 9 hours ago
      Gemma 4 26A4B
  - dofm 17 minutes ago
    It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?
    (Though it is gaslighting me about PHP anonymous functions.)
    I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of agentic coding tutorial environment.
    I test these models with simple things. My favourite mini test is asking an AI to write a "last login" tracker facility for wordpress with a sortable admin column, which is trivial code — only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.
    It can write the code. Not tested it but I am sure it works. It's not as elegant.
    It is not as good at understanding nuanced instructions as either the 26B or the sparse Qwen 3.6. There are concise things you can say in a prompt to Qwen 3.6 that have it draw logical conclusions that fully impress me.
    I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.
    (I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud "intelligence tap", this is progress)
  - ricardobayes 1 hour ago
    Qwen 3.5 9B is great for coding, but somehow, based on a few hours of subjetive tests, the Gemma 4 12B seems even better.
  - kajecounterhack 16 hours ago
    Have you found Gemma 4 31B better than Qwen 3.6 27B Q8? I just started using Qwen + Pi agent and it's great, but "which model works best" is still totally crowdsourced and I was going off of peoples' opinions on reddit. Would love to hear more opinions if people have them.
    [-]
    - embedding-shape 16 hours ago
      > Have you found Gemma 4 31B better than Qwen 3.6 27B Q8?
      Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.
      > I was going off of peoples' opinions on reddit
      It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.
      [-]
      - xenophonf 15 hours ago
        It took me way too long to realize you were referring to r/localllama.
        [-]
        MoonWalk 15 hours ago
        Why the obfuscation in the first place?
        [-]
        embedding-shape 4 hours ago
        Just a bit of flair. Also, bunch of people have "keyword watchers" setup for various terms, so when you mention certain things on HN, reddit and elsewhere, you get commentators who enter the conversation not because the context or larger conversation, but because the single term/thing they care deeply about was mentioned, and it just gets very boring to read the whole attackers/defenders comments over and over again. But ultimately I just did it like that because it was more fun to write it like that.
        zozbot234 14 hours ago
        I'm not sure that GP is correct, many people in that forum tend to hate Qwen for closing up many of their more recent models and leaving the whole local inference community 'stranded' on their older releases.
        [-]
        julianlam 10 hours ago
        Are you sure? Prior to today the sub seems to be pretty partial to Qwen.
        kajecounterhack 10 hours ago
        That was definitely not the subreddit where I got my info.
    - thangalin 16 hours ago
      Yes. I'm using Gemma-4 31B (gemma-4-31B-it-assistant.Q4_K_M.gguf) with llama.cpp to attribute quotations throughout chapters of my sci-fi novel. I started with Qwen3, but couldn't get it to work. Qwen3 TTS Voice Design, on the other hand, is incredible (Qwen3-TTS-12Hz-1.7B-VoiceDesign). I'm using both for an audiobook generator that produces a variety of voices.
      Screens:
      * https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)
      * https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)
    - qingcharles 8 hours ago
      Gemma 4 31B is enormously impressive. You get 1000 requests/day for free on Google's API and another 1000/day off OpenRouter. Only problem is you get 503 like crazy.
  - jmpeax 12 hours ago
    > nowhere in the announcement is coding mentioned
    It's right there in the middle benchmark bar "LiveCode Bench" 72%.
  - senko 17 hours ago
    Yeah, I agree 24B-36B sizes are better in general.
    I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.
  - iso1631 14 hours ago
    I find ram crazy. My thinkpad has 32G of ram, it's a t470 that's nearly a decade old
    Why do people with modern laptops have such little amounts of ram?
    [-]
    - willy_k 14 hours ago
      The ram that’s important for LLMs is gpu-accessible memory, meaning either systems with unified ram or VRAM, the latter of which is tied to the caliber of GPU one has.
    - SturgeonsLaw 5 hours ago
      Unified memory is soldered to the motherboard and needs to be ordered with the new laptop, for prices that are well above what the equivalent amount of SODIMM would cost.
      Fine if work's paying, but for personal devices (that might have been purchased before local models got good), people have what they have.
    - doubled112 14 hours ago
      My job still issues 16GB laptops as standard. You need a business reason to get more. This has been going on since before the price hikes.
      I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.
      Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.
      Close some Chrome tabs?
    - alfiedotwtf 3 hours ago
      8Gb was the standard for a long time (before Apple went Silicon), because from what I understood, is that SDRAM needs to contantly power cycle the memory bus otherwise the bits will fade, and so by having more RAM, your battery would last a little less... this was around the time when 3 hours charge was unheard of, so every little bit helped.
      Probably doesn't matter these days with all-day batterys, but now the demand-supply curve is lopsided.
- borissk 11 minutes ago
  We are really getting close to singularity - the pace of LLM improvement is constantly accelerating.
- zigzag312 18 hours ago
  > It roughly compares with GPT-4.1 (!!), released 14 months ago
  I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.
  [-]
  - mdp2021 17 hours ago
    > I suspect ... still wins in general world knowledge due to bigger size
    Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.
    I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.
    (Corroboration: we can't delirate, and neither can the machine...)
    [-]
    - bitexploder 16 hours ago
      Don't LLMs work on attention though? The closer in their hyperdimensional space you can land your problem to their inherent understand the better they are at understanding your problem domain. RAG loops can be very slow and agents may simply lack the knowledge to use them correctly.
    - pu_pe 3 hours ago
      I agree with you in general, but depending on the task I also find that a certain level of encyclopedic knowledge can be very valuable. For example, if you use it for coding, the model will likely not resort to search or RAGs when deciding whether to use a particular package or stack.
    - coldcity_again 17 hours ago
      A great position to take. Strong opinions, weakly held.
- superkuh 17 hours ago
  >consumer-grade card with 12G of VRAM and got 5t/s
  That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.
  [-]
  - senko 17 hours ago
    Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).
    I should play a bit more with llama.cpp options and see what bappened there. Thanks!
    [-]
    - superkuh 14 hours ago
      I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.
- frikk 19 hours ago
  Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.
  [-]
  - profunctor 19 hours ago
    With a harness you could feed the code to a linter and if there are errors feed that to a model automatically. It’s amazing that the models are good enough that I haven’t bothered doing this
- pseudosavant 16 hours ago
  Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.
  Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.
  [-]
  - pseudollm 14 hours ago
    > usefulness of the RTX Spark
    Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow
    [-]
    - SwellJoe 14 hours ago
      Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.
      The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.
      [-]
      - zozbot234 14 hours ago
        Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.
        [-]
        int_19h 10 hours ago
        I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.
        SwellJoe 14 hours ago
        Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.
        Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.
        Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.
        [-]
        hedgehog 13 hours ago
        The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.
        [-]
        SwellJoe 12 hours ago
        Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).
        The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.
        I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.
        [-]
        hedgehog 11 hours ago
        Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).
        milch 9 hours ago
        What hardware do you need to run MiniMax M2.7 230B locally?
        [-]
        hedgehog 6 hours ago
        Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).
  - dirkg 9 hours ago
    The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).
    the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.
    there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.
    [-]
    - green7ea 1 hour ago
      I'm using a Strix Halo laptop (~3k, 64GiB) and with Gemma 4 and Qwen 3.6, both at 8 bits, I'm seeing very impressive results.
      As a work tool, this is reasonably priced. You can save a bit of money by opting for a non-laptop form factor.
    - organsnyder 39 minutes ago
      My Framework Desktop with 128GB was about half that. I did luck out by buying before RAM prices went crazy, though.
      I'm looking forward to the fallout when the data center bubble bursts. There's a good possibility we'll see a glut of hardware, either on the used market or from manufacturers that no longer have massive orders from OpenAI and the like.
  - zozbot234 14 hours ago
    RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.
- McGlockenshire 17 hours ago
  > my consumer-grade card with 12G of VRAM and got 5t/s for output
  Thank you for giving me hope!
- DeathArrow 6 hours ago
  >The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually
  Can you instruct it to use a lsp?
minimaxir 22 hours ago
The big story here is the encoder-free part, which I still don't fully understand.
> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
[-]
- georgehm 21 hours ago
  Embedded within that developer page is a good explainer of the encoder free architecture . https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
  [-]
  - amelius 6 hours ago
    I skimmed it, but I still wonder why (1) we still need a tokenizer for text, and (2) why the other modalities (audio/video) don't need one.
    [-]
    - sigmoid10 1 hour ago
      How do you think the other modalities are fed into the attention layers? The other modalities are tokenized as well, that's literally what these separate image/audio encoders created as output before feeding it into the main network. Tokenization is at its core just a tradeoff between sequence length and embedding size, so it will probably stay relevant as long as attention layers scale quadratically with sequence length.
  - asim 19 hours ago
    That's a great explainer, thanks for sharing it.
- spott 21 hours ago
  This is just early fusion basically.
  FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818
  I've been waiting for something like this to be released since then.
  The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).
  [-]
  - santiagobasulto 19 hours ago
    I don't think it's the same. It's a similar concept, but Gemma is using just a linear projection, which I assume is a lot faster. The developer guide has more details: https://developers.googleblog.com/gemma-4-12b-the-developer-...
```
    Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48x48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input
```
    the "single matmul" is the key here, I haven't tried it, but it's probably pretty fast and memory efficient.
  - ahmadyan 17 hours ago
    Some of the FAIR people moved to Thinky, and they also started doing encoder-free MM-LLMs. Now Google. This seems to becoming a trend working at small scale, but the difficult part is scaling.
    Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.
- dofm 20 hours ago
  I would contend that the actual big story is the gallery app:
  https://developers.google.com/edge/gallery
  Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.
  Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.
  I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.
  But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.
  (Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)
  [-]
  - minimaxir 19 hours ago
    I had discounted Edge Gallery because it didn't support system prompts, but now it does so I will give it another go. I believe the implementation does use MTP since I got an update to Gemma-4-E4B on iOS indicating such, and on macOS it's very speedy.
    However, on my 18GB RAM MacBook Pro, selecting Gemma-4-12B-it results in this error:
    > The model "Gemma-4-12B-it' requires more memory (RAM) than is available on your device.
    So yeah, my questions about the 16GB marketing copy are fair.
    [-]
    - dofm 19 hours ago
      Interesting; they may have fluffed up somewhere then.
      (Though perhaps it'll squeeze in with a small context window? Not sure I understand that aspect yet)
      It does seem to use MTP, yes, and it is quite quick — seemingly the underlying LiteRT stuff can do MTP with Gemma 4 and presumably MTP is a big part of the practicality picture here.
      The system prompt thing was a surprise when I poked around.
  - sureglymop 14 hours ago
    Is the story that it's now also available outside of android? I've had this app on my phone for I believe about a year.
    [-]
    - dofm 3 hours ago
      It has certainly not been well-publicised that it is available on Mac and iOS but you are right, likely I just missed this news.
      The combination of these things, though, I still think is significant. It’s a product from an old-fashioned (!) FAANG that installs as easily as Chrome, downloads a model as easily as it could be, combines a chat interface with audio and video analysis/transcription, has a customisable system prompt, MTP, agent skills support etc.
      Now, it is from Google so they could kill it when they get bored! But clearly this is local AI packaged in a really accessible format, and the model seems quite capable for its size. It is something Microsoft could do when they can really point to easy consumer hardware that can do it well. It’s certainly something Apple could do better with their distillations of Gemini under the Google deal.
      I think a sane line of enquiry for a tech journalist is: 1) doesn’t this threaten the appeal of consumer-tier subscriptions to ChatGPT (which is a big part of OpenAI’s revenue plans), and 2) is it therefore not questionable that the buy-and-hold economics of DRAM, SSD and GPU products that OpenAI benefits from having provoked into causing ridiculous price increases is fundamentally anti-consumer?
- jszymborski 22 hours ago
  Totally agree that it is "encoding" in the general sense, but I think they are referring to the lack of an "encoder" neural network.
  [-]
  - minimaxir 22 hours ago
    In hindsight I may have been pedantic.
    [-]
    - wilkystyle 21 hours ago
      I had a similar thought to you, and found your question and the resulting discussion helpful!
    - santiagobasulto 19 hours ago
      Not at all, I had the same feeling as yours the first time I read it. I think the key is that the "encoder" they're using is just a linear projection, which is probably pretty fast and memory efficient. A single matmul vs a ViT encoder is probably a huge win.
    - alberto467 21 hours ago
      Not at all. Getting really pedantic, tokenization is also a form of encoding, so it doesn't matter the modality you're using, you'll end up doing some type of encoding in some way.
      [-]
      - altruios 20 hours ago
        Tokens are such a strange base unit. Couldn't we do something that naturally conforms better to reality than such choppy units that cause all sorts of artifacts? making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language, but its far more like multiple waves of data coalescing into an 'idea', internal... subjectively (n=1) at least. I think wave/signal based transformers are the next jump.
        After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
        Tokens create and hide too many problems to be the 'optimal' solution.
        [-]
        selectodude 18 hours ago
        Not to be too snarky but there’s a few trillion dollars and some of the brightest minds of our generation working on this. I’m sure there’s a reason why they’ve settled for or are stuck on tokenization.
        [-]
        andai 17 hours ago
        Yeah, I'm sure we ended up with JavaScript for great reasons too.
        TeMPOraL 16 hours ago
        > making everything 'language based' prevents true multi-modality. Thinking isn't done in language. Thinking outputs language
        Your problem isn't with tokens, but with "language". Tokens have little to do with language, other than usually being consumed in sequence, but that's true of anything that has to span over time. Thinking of tokens as letters or subwords is mistaking the general with the specific. We may have started with letters and words and subwords (trying to find the best balance for training), but then people figured why not add pixel patches to the dictionary, and then sounds, and then other signals, and after iterating on it a bit, we now have image and sound and symbol sequence data all being part of the same token space.
        LLMs stopped being about "language" - in the sense of English, or C++ - long, long time ago. We're still using tokens, but they're more like quanta of sensory input now.
        You can take it in two directions, I guess - either consider "Large Language Model" to be an anachronym, a name that couldn't keep up with times, but we got used to it back when it made sense, or alternatively, just broaden your understanding of "language" to encompass any pattern of quantized sensory inputs, regardless of modality :).
        (Given how we know humans can communicate with pictures, gestures, body language, noises, movement, actions, or even gaze, and that when it becomes common enough, such systems develop their own pattern structure - dare I say vocabulary and grammar - and that none of it requires or usually involves going through a "normal language" intermediary - I'd lean towards the second direction :)).
        --
        ETA: also wrt. "thinking with tokens", LLMs don't really think in tokens. You may have heard that phrase, that may have been coined by Karpathy, that "for LLMs, tokens are units of thinking". It's a useful shorthand to remind people that prompting models to be terse and skip prose is effectively dumbing them down, but it's also a bit misleading.
        A better analogy is that tokens act like clock signals: each consumed token causes certain amount of computation happen in the network, much like a single clock signal in digital electronics, or turning a crank one revolution in a mechanical contraption. This makes tokens "units of thinking" in the sense that processing N tokens causes M amount of computation to happen. Now, for whatever problem you're solving, there is a minimum amount X of computation that is required to solve in correctly, and it's mathematically impossible to do with less. So if you ask an LLM to solve it, it needs to process at least as many tokens as it takes for M = X. If you force the model to be so terse that it makes M < X, you literally make it impossible to succeed. In practice, you need M >> X.
        [-]
        altruios 16 hours ago
        Can you elaborate more on what a token looks like as a pixel patch/sound/general signal as it currently is (in this model)?
        My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).
        First, before continuing: do I have that mostly correct?
        [-]
        yorwba 4 hours ago
        > number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric)
        There is no such projection step. The array of x numbers is the token. For text, there is a one-to-one correspondence between the textual representation of a token, its index in the vocabulary of the model, and the array of x numbers that is fed into the linear algebra of the model, so people often equivocate between them; but for images or sound, there is no discrete vocabulary and no textual representation, only the array of x numbers.
        refulgentis 17 hours ago
        This sounds like when crystal people talk quantum physics.
        [-]
        CamperBob2 17 hours ago
        I agree with the GP. The idea that there's not a better intermediate representation between tokens and embedding vectors seems absurd. But how to arrive at such a representation and implement it effectively is a few zeroes above my pay grade.
        [-]
        refulgentis 17 hours ago
        I find your agreement seductive because it side steps the unfounded assertions and simply asserts there must be something different and we don’t know it, which is easy for me to agree with too. Or maybe hard to disagree with.
    - cortesoft 16 hours ago
      Being pedantic isn't a bad thing in technical discussions.
- kristjansson 22 hours ago
  > quantization
  12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
  But TBD how well the base model performs before thinking too much about quantization
  [-]
  - magicalhippo 17 hours ago
    Smaller models are less forgiving to quantization. For a 12B model I wouldn't expect Q4 to be "pretty close", unless it underwent quantization aware training (QAT). Of course it's not set in stone, there's a huge variance between models, so this might surprise.
- mchinen 21 hours ago
  The audio side is even more interesting, as it seems they totally got rid of positional embedding are just doing a single linear transform to match the LLM input dimension and that's it.
  > Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
  [-]
  - make3 21 hours ago
    I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning
    [-]
    - neosat 21 hours ago
      Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.
    - aesthesia 16 hours ago
      Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.
    - mchinen 20 hours ago
      Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.
    - pseudollm 14 hours ago
      No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum
- matja 21 hours ago
  One side-effect, is that the separate .mmproj file (Multi-Modal Projection encoder) is no longer needed, when using the model with llama.cpp etc.
  [-]
  - lambda 19 hours ago
    It's not? There's an mmproj in the GGUFs released by ggml-org: https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...
    From the visual guide, there's still the 35M parameter embedder, then the linear projector, for vision, and the linear projector for audio, so it does have some parameters used for the multimodal input to project it into the LLM latent space: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
    And the Unsloth quants, which are missing this, don't support multimodal input. (edit: actually, I may have just needed to update my llama.cpp, will check with an updated llama.cpp soon)
    I'm downloading the ggml-org GGUFs now, I tried Unsloth but got some weird problems, double checking with the bf16 model to see if the issue was just the quant.
    [-]
    - lambda 13 hours ago
      Ah, Unsloth has uploaded mmproj now as well.
  - pferdone 21 hours ago
    But do I have the option to run it 'text only'?
- mips_avatar 20 hours ago
  I don't think we've bottomed out on what we can do with embedding models. They're these tiny models that absolutely rip on modern cpus with 8 bit int optimizations. Like in my app we can say pretty definitive things about hundreds of millions of places in the world on retrieval tasks on regular hardware.
- wolttam 22 hours ago
  I think the idea is that the model is seeing embeddings that map directly to underlying pixel data, rather than being fed semantically rich embeddings from an encoder model which itself had seen the raw pixel data.
- rao-v 21 hours ago
  Encoder free is huge for running on SBCs etc. often the encoding time is a significant fraction of generation time if you are using a VLM as a all purpose vision model
- goobatrooba 19 hours ago
  Either Google changed the text or you editorialised it a tiny bit - just for all others that got excited, they mean 16GB VRAM. So a premium graphics card requiring a >2500€ device is the minimum to run this.
  Still progress, but not quite democratic yet.
  Weird though that Google might be cannibalising it's own AI subscription service?
  [-]
  - LoveMortuus 16 hours ago
    I've bought a laptop for <1500€ that came with 32GB of RAM and an RTX 3080 with 16GB or VRAM. So I don't think >2500€ device is necessary, though I'm certain it would yield better and faster results.
  - thot_experiment 19 hours ago
    I haven't tried this model yet, but I can run Gemma 31B w/ the MTP drafter in pure CPU at about 10tok/s so this should run at about 20-30tok/s on a decent CPU, it'll probably run at >50tok/s on any Mac that can fit it, and lots of people have a gaming GPU with enough VRAM. In terms of access to hardware being a gate, it's one you can hop pretty easily.
    [-]
    - dofm 19 hours ago
      Could you outline how you are running the MTP drafters? I've tried LM Studio but no dice there. I'm probably missing something but I think llama.cpp and Ollama can't do it yet either?
      [-]
      - thot_experiment 17 hours ago
        I just build llama.cpp from scratch on the PR that has MTP drafters.
        https://github.com/ggml-org/llama.cpp/pull/23398
        Please don't use Ollama, it's a bad actor in the OSS community.
        [-]
        dofm 17 hours ago
        I don't have the energy to build stuff all the time, that's a rabbit-hole side tunnel I don't really want to get into. I have larger concerns in my life that are more urgent than developing that side of things.
        But I've moved on from Ollama for the time being, though I am mainly interested to see what the Gemma 4 MTP speeds are like on my M1 Max, so I may test it.
        I am quite impressed with the tools in LM Studio, which is also a beautiful app, but it is not open source (which challenges my personal strategy somewhat) and I dread its inevitable enshittification.
        Nevertheless the GUI has been very helpful while I learn, and I will probably use it until something else presents or my usage pattern settles down from experimentation to something a bit more routine.
        I will try oMLX, too, but judging by the LiteRT page I may soon be able to just use that for the larger models if I end up settling with Gemma 4.
        [-]
        thot_experiment 16 hours ago
        Totally understandable. YMMV but I found the llama.cpp build process to work on the first try on my machine, and it only takes a couple minutes, which definitely isn't my usual expectation or experience. I was very pleasantly surprised. Their web-ui is also getting very polished while still doing a great job of letting you tweak all the weird settings.
        [-]
        dofm 16 hours ago
        Sorry, I sounded a bit terse there!
        You have probably convinced me to give it a try, to be honest.
        It's just that, to cut a long story short, I am currently recovering from a level of burnout so severe that twelve months ago had me fully convinced I was actually in early-onset cognitive decline (I am a bit over fifty).
        Only a little over two months ago I was still sure I'd have to quit IT and find a slow job because I was so out of the loop; this whole industry shift even in just the last few months is so shocking and strange.
        So I have to be a bit cautious about how many indirections I add, if that makes sense. But I am compiling bigger projects than llama.cpp so I will give it a go.
        Thank you for the extra detail.
      - Patrick_Devine 18 hours ago
        I haven't yet pushed the MTP enabled gemma4 12b model for Ollama because in my testing I wasn't getting a performance bump. The other gemma4 MTP models should work OK right now, but there are some fixes we're just about to push. This is specifically for the MLX backend.
        [-]
        dofm 18 hours ago
        Thanks for your reply. I will go back and look at Ollama again.
        So much to learn but this news has really vindicated my decision to direct my limited span of concentration and focus to learning how to use open weights models and opencode.
      - ch_sm 18 hours ago
        can‘t speak to compatibility with this new model, but oMLX supports MTP drafters very well.
        [-]
        dofm 18 hours ago
        Thank you, I will test that.
  - ActorNightly 17 hours ago
    Google is an advertising company first and foremost. At some point, these local models have to fit into that umbrella. I don't quite know how yet, but its going to happen.
    That being said, the real value in paid plans is that you get ecosystem integration that can read your gmail, photos, docs, and so on.
    [-]
    - bitexploder 16 hours ago
      Google is also a Cloud Provider. Cloud is now ~18% of Google. While it is an advertising juggernaut. Cloud is also rapidly growing, so the local models simply fit as AI research and dev and getting more people on Gemini models. They /are/ advertising, effectively :)
      [-]
      - hattimaTim 12 hours ago
        I wish they were :) But the gemini models are so unstable in API that I can not even use them for production.
    - jpadkins 16 hours ago
      local models still need information retrieval.
- reactordev 22 hours ago
  It actually works well because unlike encoders, the latent space is trained on that initial layer so it “knows” what to do with that sparse density. I’ve been using gemma4-12b with Flux2 and its ability to reason on visual input is pretty good. That said, each model is good in their own ways so YMMV but overall, it’s about as solid as Qwen just with a more advanced architecture.
- teravor 17 hours ago
  I dont see how encoder free audio isnt a mistake here. a mimo model will at least get the audio to 12.5 Hz as opposed to the 25 Hz they are doing. and you dont need to finetune mimo either.
- woadwarrior01 21 hours ago
  There are many priors to encoder-free VLMs. I specifically remember the EVE series of models from ~2 years.
  https://github.com/baaivision/EVE
- GaggiX 22 hours ago
  > That's technically encoding
  Isn't that just projecting the patches into the d_model size vectors that the models takes?
  >I am assuming that involves of quantization
  12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
  [-]
  - minimaxir 22 hours ago
    The guide describes it as projection although there is apparently an extra step: "A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input."
    12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI.
    [-]
    - WhitneyLand 20 hours ago
      Yeah and it’s pretty memory efficient with only 8 attention layers so at int8 in 16GB ram maybe you still get 64k-128k context.
      The part I hate though is that I’d bet none of the performance claims are based on int8.
      Why do we care about bf16 benchmarks when no one will be using that with this model.
  - WhitneyLand 20 hours ago
    I don’t think so, the HF weights are bf16 which means 24GB + cache/overhead.
    It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version.
    [-]
    - Pixel-Labs 18 hours ago
      [flagged]
- madduci 19 hours ago
  VRAM, not RAM. I wish it was light enough for iGPUs too
  [-]
  - KiwiJohnno 9 hours ago
    I ran the 26B model on my i5 which has no discrete graphics card. It ran about 7 tokens/sec, and appreared to be a very capable model.
    [-]
    - madduci 48 minutes ago
      I've tested the e4B with the iGPU and it works incredibly good locally. I'll try this model as soon as it is available on ollama
    - madduci 9 hours ago
      Interesting, thanks for the feedback!
- LarsDu88 22 hours ago
  Well its a real simple encoder I guess
- lucamark 20 hours ago
  [dead]
- fushigokira 22 hours ago
  [dead]
accrual 23 minutes ago
Splendid model, it reminds me of Gemma3 27B which was my favorite local model last year. Gemma always had a bit more warmth/empathy compared to Qwen and Mistral in my experience and I found it more useful for personal questions.
My system has a 4080 Super (16GB) installed and using llama.cpp (b9333-35c9b1f39) I got these results on a test prompt:
* Qwen3.5-9B-Q6_K.gguf - Prompt: 1492.0 t/s | Generation: 81.0 t/s
* gemma-4-12b-it-Q4_K_M.gguf - Prompt: 1329.2 t/s | Generation: 72.3 t/s
* gemma-4-12b-it-Q8_0.gguf - Prompt: 504.4 t/s | Generation: 25.2 t/s
asim 20 hours ago
We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.
I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.
Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.
[-]
- bityard 19 hours ago
  > We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time.
  It's definitely an exciting time, but in terms of advancements in the state of the art, there is a lot of low-hanging fruit left to pick. There IS a bottom, however, as you can only encode so much "knowledge" in a small number of parameters.
  This feels to me a lot like what the early days of what radio or aviation must have been like. Or, heck, microcomputers even.
  [-]
  - asim 19 hours ago
    It's definitely a core component of a bigger system. We are effectively trying to recreate intelligence and human life through models and robotics. So the key insights for me, the LLM is the cerebral cortex but we have a lot more to recreate. Once you map in sensory input continuously and give it physical robotics, things start to change. But even before that leaving these things in simulated realities is what will happen, and right now we have things that operate based on our commands, but a complete step function will be the things that act on their own and that will be a very dangerous time but also where we see some very surreal things happening. They might not necessarily be made in the same way either, they might operate on entirely different types of architecture.
- 0xbadcafebee 14 hours ago
  1996 didn't look that different than today, in the US anyway. Biggest difference, besides the electric cars, is everybody has a phone but nobody uses it to talk to people.
- Scrounger 11 hours ago
  > May God protect us.
  Today, data systems and algorithms can be deployed at unprecedented scale and speed. Unintended consequences will affect people with that same scale and speed
  —Michael Chapman
- Flere-Imsaho 18 hours ago
  Yes I've taken the "must optimise longevity" route, taking priority over other things such as my career and hobbies. I want to see the future - all this AI stuff fascinates me.
- btbuildem 12 hours ago
  > which of anyone is religious knows Noah and others lived to that age in a totally different era
  My favourite conspiracy theory lately is that the above isn't a silly fairy tale, that we actually used to live much much longer -- until the common cold came on the scene, and the sequelae dramatically shortened our lifespans. Today we dismiss it as "just a cold" unbeknownst of what it robbed us from.
- ddorian43 4 hours ago
  > people will live to 125 quite steadily
  Only after the current generation(s) of doctor(s) dies. And only if you make this in pill-form. Otherwise people will be people and won't even go to the gym.
  It might be also the reverse, they develop a powerful+personalized drug that brings heaven on earth to your neurons (first time heroin experience + sexual gratification + childhood fulfillment + extremely addicting etc etc etc).
  -----
  Now that I think of it I'm gonna go with the latter.
- ActorNightly 16 hours ago
  Nope, lol.
  Large models still are quite far ahead, don't be fooled that even Gemma:31b (which is better than the 12b overall) is anywhere close to big models.
  There is definitely room for optimization, but fundamentally, for complex tasks, you need visible small gradients for accuracy that allow the model to be trained on (and consequently be followed during inference). For example, if you specify in instructions not to write code but ask coding question, Gemma will still write code. Whereas Gemini/Claude will pick up on that and follow your instructions better.
  [-]
  - mitkebes 14 hours ago
    It doesn't matter if Large models are undeniably better, if a local model is "good enough" to handle the task. With API costs ramping up, I think a lot of companies are going to want to look into what can be run locally instead, possibly only using larger models when the local models fall short.
ethanpil 22 hours ago
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?
Is it simply goodwill and/or marketing? Or am I missing something strategic?
[-]
- gen220 21 hours ago
  A big part of the frontier labs abilities to charge 80% gross margins on inference is having the cornered resource of frontier models.
  If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.
  Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.
  By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.
  It's a strategic play.
  [-]
  - zozbot234 21 hours ago
    A 12B-sized model is a far cry from "frontier inference". That's more like DeepSeek V4 Pro territory which is a 1.6T model. Or for multi-modal models, Kimi 2.6 which is 1T.
    [-]
    - gen220 21 hours ago
      at risk of quoting myself... :)
      > By offering frontier inference closer to cost *and* open-sourcing everything that's sub-frontier
      It's two prongs! One prong is that their frontier inference pricing is significantly cheaper/closer-to-at-cost as Anthropic's.
      The subject of this thread is the other prong: offering compelling models that are sub-frontier and self-hostable.
      Self-hosting models and at-cost frontier models are the high-end and low-end disruptions, respectively, to Ant/OAI/etc.'s business models.
      [-]
      - echelon 21 hours ago
        Google needs an anti-trust breakup about 10 years ago.
        They need one more than ever now.
        This is ridiculously anti-competitive.
        [-]
        airstrike 20 hours ago
        This is literally competition
        [-]
        echelon 17 hours ago
        1. Google is dumping on the market to weaken OpenAI and Anthropic.
        2. Every time you search for Claude or ChatGPT, you get presented with an AdWords bidding war.
        3. Google is deploying its models in Search, Docs/Drive/Office, YouTube, Chrome, ...
        [-]
        airstrike 17 hours ago
        1. This isn't dumping
        2. I'm not sure what this has to do with the case, unless you're arguing Google has an ads monopoly, in which case the best argument would likely not be that adwords lead to bidding wars because that just sounds like they're selling a product people really want to pay for
        3. There's nothing criminal about being a very diversified business
    - boutell 21 hours ago
      You're right that it's not literally frontier. But like recent Qwen releases, it is a lot more capable than anybody thought models of this size could be a year ago, like capable enough to set a ceiling on what you can charge for AI for certain applications. Others still clearly justify a stronger model, but this trend may continue, etc.
  - ActorNightly 16 hours ago
    Don't think its that.
    Basically with upcoming spark laptops, the smaller models will likely get fine tuned to interface with google services. Then, Google can essentially make Chromebook software include those models, which is the same use case as android.
    And you better believe that they will be collecting user data and building advertising models.
- browningstreet 22 hours ago
  This won't replace commercially viable, revenue generating alternatives of their own devising, but it does enable development activity and initiate conversations with enterprises who start with this model but want to do slightly more.
  That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
  [-]
  - johnnyApplePRNG 20 hours ago
    re "Unmetered intelligence" goal of Microshaft.
    Of course it is...
    This is Windows-Licensing-Level Money Opportunity 2.0.
    [-]
    - browningstreet 10 hours ago
      I said they “said” that.
      And Google releases another free local model. As did Microsoft.
      The actual facts of the day belie your snort take. At least a little bit.
- Mr_P 21 hours ago
  Android and Chrome need on-device AI capabilities. Google can't lock down those weights like it can with server-side ML.
  So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.
  [-]
  - Aachen 21 hours ago
    Could say the same for camera processing in the Pixel Camera app or any other binary someone wants to re-use that comes included in a software distribution (seemingly for 'free'). They can't lock the instructions up on the server so they might as well make the binary be freely distributable?
    Companies don't commonly give away executable binaries "just because", why'd they start now for these binary blobs that are the models?
    Not that I'm unhappy about it! Yay for open data any day, I'm just not understanding why, at least beyond PR in nerd circles
    [-]
    - lukeschlather 20 hours ago
      Binaries are source code outputs, they are copyrightable and patentable. Weights are not copyrightable so people can freely extract the weights and run them. If Google patents any of the novel algorithms here releasing it all freely isn't an impediment to making people license it.
      [-]
      - Aachen 16 hours ago
        Weights are not copyrightable?!
        Are you sure that isn't about LLMs' outputs? There I know there have been some court cases that say this, but the model itself is a work created in intricate and somewhat creative ways (I hesitate to use the word "creative" here, but would similarly hesitate to label a routine picture of the moon creative whereas pictures basically always have copyright; the bar for creativity is basically an epsilon amount above zero, afaik)
    - jack_pp 21 hours ago
      Because a model like this can't be as easily obfuscated as image processing. Image processing is a bundle of many moving parts, a lot of functions each with it's own inputs and outputs. A model is a single function which can be easily extracted and reused, in comparison
      [-]
      - Aachen 16 hours ago
        Arguably, but that's not the point. Take image (e.g. png) files on a CD-ROM shipped by a game vendor, which can be trivially copied even by my grandma. That doesn't move the game vendor to release them as freely distributable under the Apache license
        [-]
        jack_pp 14 hours ago
        Good point but still, why would Google police this model? If they had a restrictive licence on it do you think it would be worth it for them to enforce it? This way they at least buy some good will and mindshare
        [-]
        Aachen 14 hours ago
        That makes sense to me. Guess one might say the same for game icons and other such files that lay around in disks, but yeah maybe it's as simple as that
        [-]
        jack_pp 14 hours ago
        Not quite the same, understandably Blizzard cares a lot about their IP because otherwise private servers leech their users. Maybe a small game designer cares a lot about the small game they made or whatever since that's all they have. A four trillion market cap company can afford to be "charitable".. where it costs them nothing and might cost them more to enforce their rights.
  - panarky 20 hours ago
    > can't lock down those weights
    They could lock them down legally which would prevent commercial use, but they choose not to, and they boast about how many tens of millions of times Gemma models have been downloaded by developers.
    So there must be more to the rationale than just local model weights getting hacked out of devices.
  - goobatrooba 19 hours ago
    But these can't be the same model - the model is far too demanding to be part of regular chrome for most people.
- onlyrealcuzzo 22 hours ago
  If you're an AI lab, you definitely want research teams in this space - as this is where you can most easily iterate and make improvements which you'll then bake into larger, frontier models.
  The question is: do you want to release your models, or use them purely for R&D?
  Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
  The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
  [-]
  - hadlock 20 hours ago
    >The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
    Nobody would be looking at Qwen if their ~30b class models weren't fantastically good, it's great advertising and builds significant goodwill with developers, who are going to be your biggest advocates.
    The other thing is, all these models are already disposable grade, and in a year they'll all be outclassed by The Next Big Thing. "Open" models are less than 18 months behind SOTA right now and I can't imagine that will slow down much over the next two years, they may even begin to close the gap. Nobody even talks about llama 4 anymore despite only being a year old.
- beambot 21 hours ago
  Google is one of the few verticalized options in AI: Data, models, cloud services, low-level silicon (TPUs), internal use cases, retail use cases, B2B uses, distribution (browser & mobile), etc.
  They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.
- rootusrootus 21 hours ago
  Neutering OpenAI and Anthropic would be my guess. Commoditized LLMs won't hurt Google nearly as much as it hurts the LLM-only companies, and so accelerating the inevitable just helps knock out potential future competition in areas where Google -does- make a lot of money now.
  [-]
  - literalAardvark 21 hours ago
    I think this plays a part, but the truth is that Google doesn't need to do that, Chinese open models are already doing that by themselves.
    So perhaps another part is just Google showing that they can indeed play at the big boys table.
    [-]
    - gdiamos 20 hours ago
      There is demand for US open models.
      [-]
      - literalAardvark 19 hours ago
        I sincerely wonder why. Chinese censorship is only really relevant if you're doing anti China stuff, which is to say never, while the Western kind of model censorship ( a combination of copyrights and general fairness ) are something everyone's had to work around at least once, even if just for writing an interesting story.
        [-]
        gdiamos 16 hours ago
        It’s about enterprises who care about supply chain risk and having a throat to choke if they have a problem.
        Here’s a real example.
        I’m in a design meeting talking about a model use case. We have a question about the data pipeline or the prompt format that would benefit from knowing about how the model was trained. The enterprise team lead calls the dev tech engineer from the company who produced the model. He is already in the office and walks into the meeting to answer the question.
- staticman2 21 hours ago
  As long as Chinese firms are releasing good open models I imagine there isn't a huge downside for Google to release state of the art small models to compete in the "free" space.
- schipperai 18 hours ago
  Demis at YCombinator said that they think its best their edge models are open cause once they are put on device they are vulnerable anyways
  https://youtu.be/JNyuX1zoOgU?is=PdzCILyi8SP6cfDr
- baq 20 hours ago
  Demis is on record saying they need models on the edge and if they’ll be there they might as well be properly open as they’ll be dumped anyway.
- estearum 22 hours ago
  It's to destroy possible footholds for competitors and prevent them from making money in segments that Google doesn't care too much about, but can trivially commoditize.
- mchusma 19 hours ago
  I think its even more puzzling because you can't even run Gemma 31b on google cloud, they only let you test it with a rate limit. No way (I can find) to actually pay them to use it.
  We saw great results in our usecase using google direct. Moved to Openrouter because google wouldn't let us use it beyond a test.
  Then Openrouters performance looked worse, not sure if there was a quantized version or something. So we instead looked at Deepseek v4 Flash, and opted to go for that.
  This model would probably be great for a super low cost cloud model, would love to use it in the cloud, Google makes you go elsewhere.
  [-]
  - __mharrison__ 16 hours ago
    I'm using it for one of my use cases (ocr) on openrouter right now.
- bachmeier 19 hours ago
  A strong business case for Gemma includes fine tuning, adding AI to apps that run in the cloud, strengthening Android, shifting unprofitable small AI compute to devices, and harming competitors. The first two would be done using Google's cloud services due to integration with Gemma. I think Google is currently the best positioned company to profit from AI sales to businesses over the next few years, and Gemma is a critical part of the story.
  [-]
  - cknoxrun 17 hours ago
    Google is actively, and directly helping companies continuously train use-case specific models based on Gemma 4 foundation. The company gets a model they fully own, trained on internal, sensitive data, and Google scoops up the profits from the training and ongoing compute spend to keep the model up-to-date.
- ismailmaj 20 hours ago
  Gemini is a huge team while Gemma is relatively small. They can totally do this at a loss with no ulterior motive.
  They remind me a bit of HuggingFace, create something great then make money … maybe.
- ppeetteerr 21 hours ago
  Isn't Apple about to license some variation of this from google for on-device AI? Maybe it’s their sales pitch to Apple and then they will lock it down.
- XzAeRosho 22 hours ago
  Google's MO since always has been to release great products or services for free, position themselves high and then abandon them or just find uses for Enterprise sales.
  I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.
  [-]
  - Aachen 21 hours ago
    Google's "free" is and was ad-supported, even if some products now have a paid tier. These models don't include ads. Doesn't seem like the same underlying reason
- theturtletalks 22 hours ago
  Maybe they are hedging against a future where local models are just as good as cloud models? Or maybe they can go the Taalas route and start hardcoding Gemma on a chip and hardware manufacturers can use it for local private AI.
- CuriouslyC 22 hours ago
  They're trying to capture the segment of the market that wants to control the model, with the intent of getting you to run them on Vertex.
- stevenhubertron 21 hours ago
  My guess is testing for Apple’s Siri replacement and partnership but that’s a total SWAG
- mmarian 22 hours ago
  Marketing + Pro Serv if I had to take a guess.
- moffkalast 17 hours ago
  The complete Chinese worldwide domination in this sector would be the alternative, since nobody else is releasing anything meaningful.
  Plus every open model undermines their local competition by furthering open research and reduces moats, especially since Gemini as a frontier model isn't really competitive with GPT nor Claude for most applications.
- accountrequired 21 hours ago
  edge compute
- verdverm 19 hours ago
  Competition from Chinese alternatives hopefully forces more openness and efficient models. DeepSeek for example is nearly on par and far more resource efficient, good for the planet imo
- re-thc 20 hours ago
  On-device, e.g. Android.
- dist-epoch 21 hours ago
  Evangelism for AI. Google is one of the big AI providers.
  Eventually the local model is not enough, and you'll upgrade to the big ones.
- mugivarra69 18 hours ago
  [dead]
- superchicken099 22 hours ago
  Gemma overtakes and kills real open-source AI projects, pushing people who would support them towards enterprises like Google
petercooper 20 hours ago
Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.
It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.
[-]
- CMay 15 hours ago
  For Qwen 3.5 0.8B presumably you're running it unquantized, because it's so small. Get at least the Q8 of Gemma 4 12B with the F32 mmproj and use an f16 kv cache.
  Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja
  It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".
- usef- 16 hours ago
  That sounds like a bug. They're very common for open model releases on the first day. If I wasn't on mobile I'd try it on Google's own app.
- JacobAsmuth 8 hours ago
  Sounds like you're doing it wrong, to be honest.
- ma2kx 19 hours ago
  I guess Google implements more / stronger guard rails than Alibaba and thus confuses these small models. At least this was my impression with Gemma3 models where it often said that the image contains some nudity / sex scenes and therefore it cannot give a description of the image. Never understood the point of this behavior....
  [-]
  - jimmy76615 17 hours ago
    The biggest problem with all the Google models has always been RLHF, particularly safety training. They take a good, smart model and make it behave like a corporate person that has been to far to many forced anti-{sexism, racism...} seminars so that it is now living in fear of saying something that could be construed as wrong by some moral standard.
    [-]
    - staticman2 16 hours ago
      This is almost certainly not true.
      If it was, they wouldn't need to be using the classifiers they are using to warn Gemini about problematic prompts.
    - ai_fry_ur_brain 16 hours ago
      [flagged]
- thot_experiment 19 hours ago
  I've always found the Gemma models to vastly under-perform on vision tasks compared to Qwen so that's nothing new.
  [-]
  - mountainriver 17 hours ago
    The Qwen series adopted vision wayyy earlier than anyone else. No idea why the other labs were sleeping on it but they had about 2 years of experimentation without any competition.
ComputerGuru 21 hours ago
Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!
A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
djyde 21 hours ago
What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
[-]
- philipkglass 21 hours ago
  I have vLLM running on a Linux machine in my basement, connected with Tailscale, and I use small models as part of tasks like this:
  - Transcribing scanned documents into formatted text
  - Captioning/describing images and classifying them for audience suitability (includes anti-spam)
  - Matching documents with relevant Wikipedia pages for tagging
  I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.
  I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.
- robgough 21 hours ago
  I've got a home-built dictation app that uses a local model to clear up the text and fix grammar. It was super easy to build. I’m extending it to capture meeting notes and summarise too. All on-device.
  I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.
  There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.
  [-]
  - fittingopposite 9 hours ago
    That's a great user case. Am sorry using parakeet but sometimes it garbles up things. Can you open source it?
    [-]
    - lnenad 3 hours ago
      Handy is open source and works flawlessly for me with Parakeet v3.
- properbrew 21 hours ago
  I think small models have a very good niche for specific tasks. I utilise a fine tuned Phi-4 model (smaller than this one) that fits in about 3.5gb of RAM (not vram) for the document processing side of things for the desktop app I develop (a bit of a shameless plug - whistle-enterprise.com).
  If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.
  [-]
  - SeriousM 20 hours ago
    Thank you for sharing your usecase! I like your product very much!
    Could you talk a bit how you did the finetuning? Did you use unsloth or any other tool and how went the verification to proof the outcome?
    [-]
    - properbrew 18 hours ago
      Thank you!
      Yea absolutely, but man, where to even start, it is very specific.
      Fundementally I didn't use any wrappers like unsloth or axolotl, although I have used the latter before a year or two back and it was good, but I needed something very very custom. I also wanted the whole fine tuning pipeline to exported OpenVino model to be seamless.
      I heavily leaned on codex, claude and some manual sleuthing around the internet to understand what I needed. I'd played about with QLoRA finetuning with axolotl before and felt most comfortable with that. So I needed to keep everything as stripped down as possible and figured I can just utilise the 3 main huggingface libraries (transformers, peft and datasets) and also bitsandbytes (as suggested by claude to quantize the model to keep this working on my GPU) along with some custom scripts generated by claude/codex (each cross referencing each other) that will do the different stages of the training run.
      The next part was the data. Obviously didn't have access to thousands of meetings and associated output documents but I did have a 3090ti sitting there and a codex subscription. So I set about working out what format I needed the data in (many thanks again, to claude/codex) and started generating hundreds of different transcripts, different amounts of speakers, content, tones, subjects, spelling mistakes - like all the different things you could think a meeting would have. Then it's a case of actually generating a good meeting document off the back of the transcripts and creating the "gold standard" that we'd use.
      I'm going to gloss over a lot here as I'd rather not detail it as it relates to some propriatary stuff that I had to work through, but you basically pair the transcripts together and run the training.
      At the verification stage, there was pretty much 3 things:
      1. "just" do some regex string matching to see if there's any of the source transcript key facts in the output to ensure fact preservation. Same with owner fabrication (who said what), I don't want something attributed to someone when it wasn't them that said it and then finally markdown validation.
      2. Using codex/claude to validate the transcript and output from the model - I used the latest frontier models, probably overkill for my task, but they were good at the job
      3. Finally me going through some actual recordings of myself, groups, meetings and manually verifiying the output
      So a fair bit of work, and for context I'm on version 10 now, so it's been a journey!
- quickthoughts 20 hours ago
  I use small models like Gemma to improve transcriptions from ASR models amongst other micro-tasks. I actually built out a fine-tuning whisper pipeline with all local (smaller) models meaning no cloud/big-tech co is able to train/sell my (private) data.
  Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference
- thot_experiment 19 hours ago
  I don't know about this model, but the next one up, the 31B I've been using as an agentic coding assistant in OpenCode, and basically anything that's easy enough that I'd trust Sonnet to handle, I trust Gemma 4 to handle and it's been doing a great job, it surprises me positively much more often than negatively. I not infrequently run into situations where Gemma 4 fails to do the task and I switch to Opus 4.7 and it fails also.
- mhitza 21 hours ago
  In theory, locally you'd use these where lossiness is acceptable for audio transcription and image labeling (as simple examples).
  In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.
- Aachen 21 hours ago
  "Small" models are the ones I can run myself on my own terms. LLMs aren't useful enough for me to justify spending hundreds of euros on a GPU with 16GB VRAM or something, and that's assuming I have the rest of the desktop just laying around. Back when I checked (before the RAM price hike), these models weren't meaningfully better than 4-8GB ones anyway, you'd have to go for the top tier cards at 24 or 32 GB iirc to get something vaguely in the direction of the SaaS versions, and that was absolutely out of my budget. Even if that changed, so have hardware prices so it'd probably still work out the same
- OtherShrezzing 19 hours ago
  I use them for research on new features. If my feature is going to interact with a frontier language model in prod, I start with these free local ones which are all competent enough to produce structured output, make tool calls, interact with mcp etc. I don’t care much for the content at the early phase of engineering, I care about the schema & failure modes.
  Then when I’m getting close to feature-complete, I’ll move to a hosted frontier model for the final integration.
  Cost savings are enormous if you’re making dozens of calls to language models a minute.
- SwellJoe 16 hours ago
  I've used Gemma for reviewing and categorizing my writing online over several years (~5 million words across a forum for an OSS project I work on, HN, reddit, etc.), experimenting with training LoRAs (again, on my own writing, since I don't have to worry about ethically sourcing the data if it's all mine), and I'm currently using it to perform web searches and extract data about a specific type of business. It's plenty smart to use a web search MCP to find all the businesses of the right type in a given city, read their website, extract business address, phone number, etc. among other things, and de-dupe and cross-check other sources.
  I found Gemma 4 to be better, or at least more nuanced, than Gemini 2.5 Flash. And, the new Gemini 3.5 Flash is very good but is unrealistically expensive (ten times more expensive than DeepSeek or MiMo). So, since I don't need extremely fast performance, a self-hosted Gemma 4 wins for a bunch of stuff.
  I've also found Qwen 3.6 27B to be shockingly good at finding security bugs for its size. It beats several larger models, and is close to Gemini Pro 3.1 (but Gemini 3.5 Flash surprisingly beats it soundly). Since it only costs electricity, and my electricity is cheap and 100% renewable, I can use it more broadly than I might otherwise use a hosted model.
  All that said, the smart money is still on buying the subsidized tokens from the providers that offer them, rather than buying the hardware needed to run models that are 30+GB in size, as all of the ones I've been using regularly are (8-bit quantization, as they get a little dumber for every bit you drop below that). A $100 subscription to Claude or Codex currently provides access to the best models at a heavily discounted rate. And, DeepSeek/MiMo are extremely cheap, one or more orders of magnitude cheaper than the top models from Anthropic or OpenAI, if you need an API for automated usage. I spent about $4000 on my two inference machines (a Strix Halo with 128GB unified RAM, and a new desktop build based around two cheap old 32GB AMD data center GPUs), which buys a lot of tokens for tiny models like this...probably a couple/few years worth. But, I like tinkering, so having an excuse to play with hardware is its own reward. If it happens to pay me back some of that money, that's a bonus.
  Of course, as the major providers decide they need to ring the cash register and stop burning money on subsidized tokens, that math may change, and I may find I'm grateful to have already bought this stuff before the RAM prices made everything 2-3x more expensive.
  But, I think if you're not interested in learning about the technology and doing your own training experiments and such, you should probably not try to run stuff locally most of the time.
  [-]
  - ai_fry_ur_brain 16 hours ago
    So one of thr things you're using it for is to generate leads to spam businesses with unwanted LLM produced marketing materials it sounds like.
    Wow LLMs are changing the world, what a utopia.
    [-]
    - SwellJoe 16 hours ago
      > So one of thr things you're using it for is to generate leads to spam businesses with unwanted LLM produced marketing materials it sounds like.
      You don't know me. And, no.
- pilooch 16 hours ago
  Yes, all my emails gyer sorted out by a finetuned gemma. There are turned into images passes to the model, as multimodal is so practical.
- Xiol 21 hours ago
  I've yet to see someone answer a question like this with a decent, useful answer.
- sureglymop 13 hours ago
  I moreso run other small special purpose models like Whisper, SAM, Matcha, CLIP etc. and then do contextual correction passes with models like this.
  Think almost like unix pipelines, have used it for many workflows.
- airstrike 20 hours ago
  This is one https://post.bot/
  [-]
  - bensyverson 19 hours ago
    What model is it using?
    [-]
    - airstrike 17 hours ago
      I do not know which model specifically, but I saw the founder answering a question about how it's a small model that's focused on just this one specific requirement.
      I expect it to be something like https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF
  - ai_fry_ur_brain 16 hours ago
    Why would I want an AI receptionist. A human receptionist is about 1000x more careful, caring and intentional.
    They are charging $15.00 an hour for an llm powered assistant. Like wtf, how do these people think that's a valid business model. This will 1000% annoy every customer that uses it. I hate this timeline so much.
    [-]
    - airstrike 16 hours ago
      No, this is a phone service. They charge $0.25 per minute on the phone on a call that would otherwise not connect.
      Can you call a receptionist at 10pm and book an appointment? Or ask for directions? What if it's 10am and she's already on the line with someone else and you just want to ask if there's parking?
      [-]
      - ai_fry_ur_brain 13 hours ago
        Please tell me what 0.25c x 60 is.
        Yes, they're called after hours answering services and they're exponentially better because I get to talk to a human.
        If my doctors office replaced a receptionist with this I would switch and leave bad reviews across every platform possible.
        Ive already switched doctors once because they used an LLM transcription service during my appoitment that influenced the doctors recommendations for care. Sorry technology does not belong everywhere.
        AI produces low quality work and will turn your business to shit.
        [-]
        gnabgib 9 hours ago
        Are you, perhaps, missing that $0.25/minute is only minutes on call? An agent not answering the phone for an hour is $0 (not $15).. for after-hours calls (rare) this is a meagre rate, compared to pay-per-hour (no matter the call volume) answering services.
briansm 3 hours ago
Strange that they are feeding raw audio in. Even in humans, there is a hardware transform to the frequency domain (the cochlea) before data is fed to the brain, effectively doing this part in the LLM seems inefficient.
baalimago 2 hours ago
I don't understand why Google does this. If I can run this locally, why would I need a subscription or use any inference provider, including Google..?
Scorched earth tactics to make anthropic and openai IPO fail?
[-]
- tgtweak 2 hours ago
  It's a disruption game - releasing competent open models disrupts smaller labs trying to release their own or commercialize their own. It's a similar rationale behind the Chinese labs releasing near-frontier open-weighted models, the goal is to disrupt and lift the barrier of entry for would-be competitors.
nickandbro 22 hours ago
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
[-]
- embedding-shape 22 hours ago
  I dunno, feels a bit unfair to companies that actually do FOSS releases (Gemma 4 being released under Apache 2.0 license) to compare them to a company that never done any FOSS releases, and mostly done proprietary "available to download" releases.
  [-]
  - seba_dos1 21 hours ago
    Note that a binary released under Apache 2.0 license does not yet make it FOSS.
    [-]
    - embedding-shape 21 hours ago
      Agreed, miles ahead though from "proprietary" which is what Meta been using for most model releases.
      Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
      [-]
      - jimmy76615 17 hours ago
        NVIDIA does a pretty good job on that front.
- redman25 22 hours ago
  IDK this model release is a bit disappointing considering the community has been chomping at the bit for the 124ba4b model. There was some leaked info about it but people suspect it was not released because it was too close to gemini flash in performance.
- brianwawok 21 hours ago
  Every other Google model I have tried felt very weak compared to qwen models. I dont have a ton of use case for multimodal though, so its very possible this is a fantastic multimodal model.
  [-]
  - wongarsu 21 hours ago
    Gemma 4 27b and 32b feel pretty capable for text and visionn. Comparable with qwen, maybe a bit better on tool calling heavy tasks
    I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others
  - thot_experiment 18 hours ago
    Hard disagree, Qwen multimodal is way better than google's, but Gemma 31b runs laps around Qwen 27B in complex engineering tasks. Maybe Qwen is better at slopcoding web framework CRUD, but for embedded dev there's no comparison.
  - verdverm 19 hours ago
    qwen3.6 was my favorite, then I tried the deepseek-v4-{flash,pro}
    still making my way through deep dives on the chinese open weights, they are all pretty good and way more cost / resource effective
outageroom 2 hours ago
I really like the idea of small models that you can get the most out of. If I weren't a programmer, I wouldn't even know what I would use Opus 4.8 or GPT 5.5 models for.
dwa3592 21 hours ago
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
julianlam 20 hours ago
Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.
Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.
[-]
- verdverm 19 hours ago
  Turns out when you block people from the best and biggest hardware, they get innovative. It reminds me of the Pentium days when everyone was shipping inefficient programs because the processor would be better next year.
  [-]
  - iknowstuff 16 hours ago
    we never stopped doing that!
kristianp 15 hours ago
What quantisation do the creators intend this to be run at? They talk about 16GB of ram, so should it be run at 8 bit? People here are talking about using q4, but I would have thought a smaller model like this wouldn't perform well at such low bits per parameter. Edit, it looks like their bechmarks would have been done at 16 bit float, as the hugging face release is that size: https://huggingface.co/google/gemma-4-12B . Which is a little deceptive: they're advertising an 8 bit size will fit on 16GB laptops, while releasing a 16bit size.
I guess we have to wait for someone to produce perplexity curves at different Q's.
[-]
- easygenes 15 hours ago
  They haven't made one for this new model, but Unsloth has a comprehensive quant KLD map of Gemma 4 26B A4B here: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...
dgacmu 14 hours ago
I was excited about this until I fed it one of my local test problems: coin identification. I then spent 10 minutes arguing with it that a photo of a 1998 washington quarter was not, in fact, a Morgan Silver Dollar. I mean, I wish it was.
It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.
In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.
Was getting about 50 t/s output on a 3090 with Q8 which seems ok.
[-]
- sureglymop 13 hours ago
  Why would you expect it to be good for this particular highly specific task? Curious.
  [-]
  - dgacmu 13 hours ago
    Ah! Good question: Google's non-open-weights models (Gemini, etc) have almost always outperformed on image recognition tasks compared to any other models. I use a mix of in-house and Gemini for image classification tasks for $startup. No other models have done as well, and I had hoped that some of that would spill over into their open source models. It does to a degree - bigger Gemma models are okay.
scirob 18 hours ago
Quickly deployed it to check some benchmarks relevant for German language. These are results for CohereLabs/include-base-44 german only : Gemma 4 12B %61.9
```
  Gemma 4 26B (a4b MoE)    0.647
  Qwen 3 14B               0.621 
  Gemma 4 12B              0.618
  Ministral 14B 2512       0.604 
  Gemma 3 12B              0.547
```
The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.
I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.
[-]
- kordlessagain 14 hours ago
  I just ran a short tool use test and it's doing pretty well.
benbojangles 7 hours ago
I run gemma-4-26b-bf16 in mtp mode and it runs very smooth, spitting out answers in seconds and outputting text 30x faster than i can read.
lxgr 21 hours ago
Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?
[-]
- philipkglass 21 hours ago
  Since ollama has diverged from llama.cpp, it will take a bit of time for ollama to support multi-modality. If you're using plain llama.cpp it looks like a PR has already merged for this model with vision and audio support:
  https://github.com/ggml-org/llama.cpp/pull/24077
  [-]
  - zozbot234 21 hours ago
    They've actually gone back to (a lightly patched) llama.cpp with the 0.30 release a few weeks ago, and have now vendored-in an up to date release. Needless to say this is great news for both projects!
- satvikpendem 20 hours ago
  Just use llama.cpp or Unsloth Studio which wraps it, I don't know why anyone use Ollama anymore.
  [-]
  - verdverm 19 hours ago
    I switched from llama.cpp to vLLM because of prompt cache bugs in qwen/gemma models
    This is a good starting issue with a bunch of linked/related
    https://github.com/ggml-org/llama.cpp/issues/22746
- lxgr 17 hours ago
  To anybody else wondering: Seems like the models supporting image input are just starting to show up. https://ollama.com/library/gemma4:12b-mlx now shows as supporting it, but curiously the overview on https://ollama.com/library/gemma4/tags still lists it as text only. Cache invalidation remains difficult :)
  [-]
  - kordlessagain 14 hours ago
    Yup, the new version of Ollama dropped. Time to update.
- Jabrov 18 hours ago
  Stop using ollama
- thot_experiment 18 hours ago
  Ollama is a shitty project that steals from the open source community, don't use it, use llama.cpp instead.
wuyunhuo 11 hours ago
The optimal small-model solution, delivering multimodal, reasoning, and coding experiences on affordable hardware that were remarkably close to those of mid-to-large models at the time.
Zambyte 21 hours ago
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.
[0] https://ollama.com/library/gemma4/tags
Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
[-]
- embedding-shape 21 hours ago
  MLX is quite literally macOS-specific technology, for other platforms you want non-MLX.
  I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
  Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.
  [-]
  - Patrick_Devine 18 hours ago
    I realize this is a little confusing; we're working w/ the MLX team to bring MLX to other platforms, but we're not quite there yet. The `gemma4:12b-nvfp4` model is specifically for the MLX engine.
    For the GGUF 4bit variant (i.e. non-macs) you'll need `gemma4:12b-it-q4_K_M` which I just pushed. You'll also need to upgrade to version 0.30.4 which we're just about to release (it's in prerelease and we're running through our last regression tests).
    [-]
    - spicySpy 3 hours ago
      Would you mind to share the link to `gemma4:12b-it-q4_K_M`?
    - embedding-shape 16 hours ago
      I gotta say, having both "gemma4:12b-mlx-bf16" and "gemma4:12b-nvfp4" be MLX-specific, and not labeling all of the MLX-specific ones as such, is a bit different than "little confusing" and more "set up to be confusing" :)
      > You'll also need to upgrade to version 0.30.4 which we're just about to release
      Interesting, wasn't Google coordinating today's release with you? Considering the blog post seems to have gone out way before the release even been cut.
      [-]
      - Patrick_Devine 16 hours ago
        Given the model was just republished by Google 15 minutes ago and we're going to have to redo everything (and everyone will have to redownload for all platforms -- not just Ollama), I'll just say that sometimes things don't work out exactly the way you want them to. :-D
        That said, I think the gemma4:12b-nvfp4 model is pretty solid. It's been tuned with Nvidia's model optimizer. I've been waiting on the results for MMLU-Pro, but I'll have to retrigger that after reconverting.
        [-]
        embedding-shape 15 hours ago
        > Given the model was just republished by Google 15 minutes ago
        Hah, missed that! Guess that's slightly neat though, you get a second chance ;) NVFP4 been a blast to use across a wide range of models, seems to work really well, at least with vLLM and a nvidia card.
  - sambaumann 21 hours ago
    I still get "this model requires macOS" when trying to pull that one
    [-]
    - embedding-shape 21 hours ago
      I don't use Ollama myself anymore, but seems others been having similar issues for quite some time, maybe one of these fit your environment exactly? https://github.com/ollama/ollama/issues?q=is%3Aissue%20state...
- jw1224 21 hours ago
  MLX is Apple’s own machine learning framework, designed for Apple Silicon: https://opensource.apple.com/projects/mlx/
- Zambyte 18 hours ago
  The non-MLX versions just dropped on Ollama. gemma4:12b-it-q8_0, gemma4:12b-it-bf16, etc.
- accountrequired 20 hours ago
  https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...
- jasonjmcghee 21 hours ago
  There's a CUDA backend for MLX now. Not sure about the maturity.
thomasjb 19 hours ago
Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...
[-]
- kristjansson 19 hours ago
  I think MTP Gemma4 support is still WIP https://github.com/ggml-org/llama.cpp/pull/23398 ?
  [-]
  - dofm 19 hours ago
    This has been my impression.
    The underlying LiteRT-LM framework used in the edge gallery does support the MTP drafters for the smaller models, but according to:
    https://developers.google.com/edge/litert-lm/models/gemma-4
    > Note: LiteRT-LM supports E2B and E4B models today, with support for larger models coming soon.
    So even Google aren't shipping MTP support for the 26B and 31B models yet.
  - thot_experiment 19 hours ago
    [dead]
christina97 18 hours ago
It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?
Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
jamwise 16 hours ago
"Small enough to run locally with just 16GB of VRAM or unified memory"
With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.
benbojangles 6 hours ago
why combine audio & image analysis into an llm though, why not allow the user to choose their own audio & image analysis alongside their own llm choice?
RandyOrion 20 hours ago
A small dense multimodal model with audio support, interesting.
Wait, *Excluding Chinese language.
This is ... curious.
P.S. Where is gemma 4 124b?
[-]
- kylehotchkiss 20 hours ago
  Where are the computers we could purchase to run 124b models :’(
  [-]
  - thot_experiment 18 hours ago
    You can get SXM V100s for like $100 off ebay, if you're willing to do the troubleshooting work to get em running with adapters you can build a computer capable of fitting a Q4 quant of a 120b model in VRAM for something like fifteen hundred dollars. (assuming you already have some RAM sticks laying around T___T)
randomNumber7 21 hours ago
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
[-]
- spott 21 hours ago
  https://newsletter.maartengrootendorst.com/p/a-visual-guide-... (in a link from here: https://developers.googleblog.com/gemma-4-12b-the-developer-..., which was linked in the text of the post, but not the linkdump at the end).
- toldnotmywrath 20 hours ago
  My understanding is that early (and most extant) visual language models have a component module (called the image encoder) that transforms images into representations (called embeddings) the model's inner layers can process.
  This is often a separate module grafted onto the main model, and further pre-trained (e.g. OpenAI's CLIP, SigLIP used in the Gemma 3 and PaliGemma series).
  The image encoder approach has a few problems.
  One problem is that many like Gemma 3's encoder have fixed image resolution constraints and inputs must be resized with all the attendant distortions that causes with spatial understanding. However, the Gemma 4 series image encoders overcame this and can handle variable-dimension inputs.
  Two, these image encoders are somewhat large (ranging from 300-500M parameters) requiring extra memory and FLOPs to run.
  Three, say we need to fine-tune a vision language model, updates to its weights, may affect its understanding of the representations generated by the image encoder if we don't fine-tune both together.
  The new Gemma-4-12B replaces the encoder (with its many attention layers and large parameter count) with a simple linear projection to generate the embeddings for images. That reduces the computational requirements and simplifies the input pipelines for image processing.
  I don't have any expertise on the topic though and might very well be wrong on some details.
anonova 20 hours ago
Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.
[-]
- mchusma 19 hours ago
  Gemma 4 31b outperformed Gemini 3.1 Flash-Lite in our app benchmarks (agentic tool use via api in our application as a part of various workflows). But google won't let you pay to use Gemma models, you have to go elsewhere, I think this may be because it would cannabilize Flash-lite.
  [-]
  - dTal 15 hours ago
    Curious logic. Does Google want you to use it or not? Do they want to be paid for tokens or not? why segregate open and closed?
    It's not parameter size - there is apparently such a thing as "Gemini Nano", which famously is downloaded automatically by Chrome. How similar is it to Gemma E4B? And how strange - you have the weights, but you don't "have" them?
  - verdverm 19 hours ago
    You can actually get the gemma-4 models on a per-token API basis, you just have to click some extra buttons (in GCP). Not the same for other open weight models. For those they make you run your own hardware.
    Use OpenCode Go instead: https://opencode.ai/go
    [-]
    - _puk 18 hours ago
      That doesn't have the Gemma models by the looks
      [-]
      - verdverm 15 hours ago
        They only host models they have evaluated and found good at coding
Havoc 21 hours ago
Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE
[-]
- dist-epoch 21 hours ago
  The un-quantized MoE outperforms it.
  But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.
  All the launch benchmarks are at 16 bit.
__natty__ 19 hours ago
It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines
[-]
spott 21 hours ago
Is there a paper on this?
I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.
I wonder how hard it would be to add it back on.
[-]
- joaogui1 21 hours ago
  I mean Claude is multimodal on input but not output, why couldn't this also be?
foota 16 hours ago
It feels like this would be beneficial to give the model more of a deep understanding of visual knowledge.
SubiculumCode 18 hours ago
"Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory." I wish. I just have 12.
zuminator 22 hours ago
How does it compare with e4b, aside from being larger?
[-]
- anonova 21 hours ago
  There's a comparison of all the Gemma 4 models (+ Gemma 3 27B) on the Huggingface model card: https://huggingface.co/google/gemma-4-12B-it#benchmark-resul...
- thomasjb 21 hours ago
  That's what I want to know too. A smarter E4B that's happy in opencode would be a good selfhosted model for me
zkmon 20 hours ago
It's quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.
BiraIgnacio 21 hours ago
using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.
semiinfinitely 20 hours ago
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
comma_at 20 hours ago
Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?
4k4 16 hours ago
I'm actually thinking how much this is bett3r (besides multimedia) over prismml's 1.5bit model based on qwen2.5 or sth.
zkmon 20 hours ago
I'm waiting for FP8 quant, preferably from Google.
[-]
- mdp2021 16 hours ago
  If you accept the "ggml-org":
  https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...
  https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/blob/mai...
  [-]
  - zkmon 9 hours ago
    Do they run well on vLLM?
easygenes 15 hours ago
I want to like the vision capabilities of the model. However, when I gave it an image which Gemma 26B A4B and Qwen 3.6 35B A3B has no problem correctly describing in detail, including identifying the Taj Mahal in the background it utterly failed. Its sense of the image was that it was a "distorted wide panorama" and even when I asked directly if it was the Taj Mahal it said no. The reference models saw it correctly as a normal square image taken from a fairly rectilinear lens (iPhone main camera).
[-]
- easygenes 15 hours ago
  I have now also tried it on this scatter plot: https://3215535692-files.gitbook.io/~/files/v0/b/gitbook-x-p...
  Similarly, the 26B A4B Gemma 4 and the 35B A3B Qwen 3.6 identify it clearly, give me the title and trends analysis fairly accurately. While this 12B spits out gobbledygook about it having something to do with hard-drive capacity. It's like it can barely see, gets the very broad strokes (knows it's looking at some kind of chart), but can't identify any details clearly.
adt 18 hours ago
https://lifearchitect.ai/models-table/
SuperV1234 19 hours ago
How does this compare to frontier models?
alienjesus 15 hours ago
good one, wanna try on Cerebras inference in Agentic Browsing
claysmithr 21 hours ago
I don’t see the download in lm studio
[-]
- corgihamlet 20 hours ago
  https://lmstudio.ai/models/gemma-4
  [-]
  - claysmithr 20 hours ago
    Thanks looks like they just added it 1 hour ago
- deckar01 21 hours ago
  It also says it is supposed to be available in their own Edge Gallery app and it’s not there (on iOS).
dyauspitr 17 hours ago
Just tried this out. Jesus Christ. Google does some things so well.
[-]
- SwellJoe 16 hours ago
  I mean, they did invent the technology. It's actually kinda surprising they're not the leader in the space. They kinda got Kodaked (though the story is still playing out, and I guess they're still somewhat competitive in the space even if Anthropic and OpenAI are the leaders).
keyle 7 hours ago
Not terribly impressed with this one. I asked it for recommendation between Paris to Berlin and option 3 was Rome... and option 4 was Tokyo.
mmmkay.
powera 19 hours ago
I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.
It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.
I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.
t0lo 10 hours ago
Asked it to name the director who wears a rolex and likes submarines. It said christopher nolan.
jdelman 21 hours ago
I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.
mlmonkey 21 hours ago
Is there some place where we can try it before downloading the gigabytes of weights?
kordlessagain 20 hours ago
Cool!
Miles_Stone 4 hours ago
[flagged]
Lapsa 19 hours ago
[dead]
digdugdirk 21 hours ago
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?
[-]
- vehemenz 21 hours ago
  This comment has me a bit confused.
  Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.
  [-]
  - NekkoDroid 21 hours ago
    I think you are mixing up RAM and VRAM.
    [-]
    - Schiendelman 21 hours ago
      On a Mac they are the same thing; they're shared. Of course you need some amount for the OS, but if you have an Apple Silicon Mac with 24GB of RAM, you can likely run a 16GB model.
    - crims0n 20 hours ago
      They are effectively one and the same on Apple Silicon.
      [-]
      - NekkoDroid 20 hours ago
        Which most people as a matter of fact don't use. A majority of people with laptop have separate memory pools and the VRAM of them is nowhere near that and even on most gaming laptops you aren't getting 16GB VRAM.
        [-]
        fredzel 19 hours ago
        > A majority of people with laptop have separate memory pools
        Majority of people with laptop have RAM and igpu using some of that as VRAM.
        mrkstu 20 hours ago
        I would say on this forum it wouldn’t be suprising for commentors to be near or above 50% that have access to an M Series Mac…
  - utternerd 21 hours ago
    Unified Memory or VRAM, not just RAM.
- SwellJoe 16 hours ago
  They already provide E2B and E4B that run on (much) smaller devices, including tablets and phones. This fills the gap in the middle. The bigger Gemma 4 models are excellent for their size, but at 8-bit quantization they need about 64GB of VRAM or unified memory. 48GB for 6-bit. Any lower quantization than that, they start to get notably dumber. So, a 12B is interesting for that middle ground.
- claysmithr 21 hours ago
  I have 24 gb unified memory so it’s a good model for me
- mdp2021 16 hours ago
  Surely they must know the current hurdles, but clearly they know that all the relevant people are monitoring the market for the proper hardware to get and 16GB will be an entry point.