> The really neat thing about GGUF is that it's just one file. Compare this to a typical safetensors repo on huggingface, where there's a pile of necessary JSON files scattered around [...]
Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.
> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++
As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).
But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P
I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).
7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.
Try both in lm studio, they really are surprisingly capable
Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.
What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.
I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.
Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.
As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).
But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P
[0] https://i.imgur.com/GiTBE1j.png
it’s good at writing, coding, decently intelligent
you can try it on nvidia nim
Try both in lm studio, they really are surprisingly capable
Tried all the stuff bios, volting
I love TheBloke I wish he still made stuff
I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.
hmmm...