Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

(systalyze.com)

45 points | by ManyaGhobadi 5 hours ago

11 comments

Cynddl 7 minutes ago
This sounds super interesting and relevant. I run a small cluster with H100s (often research projects with vLLM) and being able to see not just usage but efficiency would be great.
I don't fully get the 100% utilisation vs. 1-10% real compute. Given you rely on telemetry from users to add new models, are you trying to predict how fast a model should be on vLLM, compared to how it runs in practice? What if users tweak some hyperparameters?
uberduper 2 hours ago
There's a few dimensions you can look at for gpu load. Probably the easiest indirect metric to watch for gpu load is power usage.
But if you really care about this, you should actually profile your application. nsight systems makes this pretty simple to do. Dunno how many actually care about having a TUI.
[-]
- ManyaGhobadi 1 hour ago
  Power is useful as a second-order metric and can help catch drastic underutilization, but it has similar problems to SM Active (DCGM) -- it tends to overestimate utilization and doesn't distinguish between useful compute and memory traffic. It's very possible to be in a memory-bound workload with high power even though underutilizing compute utilization. Our goal was to separate these bottlenecks out so there's more visibility into where to optimize.
  On nsys, agreed it's great, but we wanted something that could run continuously instead of an offline analysis tool. We think there's room for both to be useful.
SilentM68 16 minutes ago
Great tool.
Just testing for now.
Any removal instructions or function for utilyze beyond the manual removal of utilyze & utlz binaries from ~/.local/bin & /usr/local/bin & PATH cleanup for ~/.profile, in particular CAP_SYS_ADMIN capability and reversal for any other changes made?
jhgg 2 hours ago
We just track power utilization.
xtimecrystal 3 hours ago
One small suggestion: add more GPU stats to your tool.
At the moment (v0.1.3) it is more helpful for compute visualization but keeping track of memory usage/processes/temperature/fan speed/etc. prevent this from becoming a full-on drop-in replacement for `nvidia-smi` for me.
[-]
- ManyaGhobadi 2 hours ago
  We agree! We are planning a "process" or "advanced" view with temp/power usage and per-process breakdowns. Would a separate full page view or fitting everything onto one view be more useful for your workflows? Just thinking about fitting everything in because it is a lot
nawi 2 hours ago
Hi, many thx, does the os can run on nvidia jetson and orin? Or just for server gpu?
[-]
- ManyaGhobadi 1 hour ago
  Currently just server GPU, but theoretically it should be easy to link against the ARM64 CUDA libraries for Jetson/Orin. The only challenge would be to check if it supports all the metrics we're sampling, though anything Ampere or newer should have reasonable support.
latchkey 1 hour ago
You mention rocm-smi in your blog post, but you don't actually support AMD gpus?
johnwhitman 1 hour ago
[dead]
throwawaycbb7 2 hours ago
[dead]
Rekindle8090 2 hours ago
[dead]