3 comments

  • DiabloD3 34 minutes ago
    I suspect this person didn't read the README or the `--help` at all.

    For example, `--cpu-moe` makes it not offload MoE layers to the GPU, which drops performance about a quarter, but only keeps the dense and important layers on the GPU, allowing you to have MoE models bigger than VRAM almost for free, but also free up room in VRAM for more KV cache. It does nothing on CPU-only.

    `--no-kv-offload` also does nothing here: it makes it not offload KV cache to VRAM... he doesn't have a GPU to offload to, and this is the default there.

    Again, `-sm` is only for multi-GPU. No GPUs here.

    `--mla-use` is for models that use Deepseek's Multi-Head Latent Attention. Gemma 4 is not one of them.

    `--merge-up-gate-experts` reduces matrix math complexity around ffn_up and ffn_gate tensors; CPUs do not have tensor units and this is unlikely to actually help.

    MTP is also never faster on CPU-only, and this is documented. ngram-mod, however, may help, which it doesn't look like he tried.

    This whole screed also reads like it was written by AI.

  • usernamed7 1 hour ago
    > I am telling you the count because the count is the point.

    > The honest caveat, because it matters:

    > This one I got right in the original, and now I have the number to back it.

    Thanks Claude.

  • cafkafk 2 days ago
    [dead]