Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

(github.com)

112 points | by yu3zhou4 9 hours ago

10 comments

yu3zhou4 8 hours ago
README is in my opinion (author here) the most interesting - I wrote it to help others build useful mental model to be able to recreate the project yourself, without need to even read my code
[-]
- janalsncm 2 hours ago
  Really practical teaching approach. I clicked in to see how safetensors are loaded and just kept reading. Thanks for sharing.
GoldenJade 1 hour ago
Thanks for sharing this. As someone currently researching LLMs, I'm sure I'll be referencing this quite a bit going forward.
xuanlin314 2 hours ago
The lesson-style README is a great approach. Breaking down LLM inference into digestible steps makes the codebase approachable even for people who haven't touched CUDA before.
dwa3592 6 hours ago
Very nice job on read me.
>>Physically, LLM is a file which contains a lot of float numbers.
aka atoms of the LLM.
[-]
- cyanydeez 6 hours ago
  the universe is just atomic if statments
juancn 7 hours ago
Looks interesting, it reminds me of the first llama.cpp, but better documented.
nazgulsenpai 8 hours ago
I love the documentation formatted in lessons. I can't wait to read through it.
cookiengineer 6 hours ago
Wanted to add that the author has an amazing blog with lots of interesting papers: https://jedrzej.maczan.pl/
einpoklum 6 hours ago
It seems the author believes checking the return values of CUDA API calls is not "tiny" enough :-(
alexpandey 1 hour ago
[dead]
harshuljain13 6 hours ago
[dead]