The sanitised optimism problem mentioned upthread is the real gap here. Event stream logging tells you what tools were called and in what order, but it doesn't tell you whether the agent's self-reported outcome matches reality.
The hooks performance finding matches what I've seen. I run multiple Claude Code agents in parallel on a remote VM and the first thing I learned was that anything blocking in the agent's critical path kills throughput. Even a few hundred milliseconds per hook call compounds fast when you have agents making dozens of tool calls per minute.
The docker-based service pattern is smart too. I went a different direction for my own setup -- tmux sessions with worktree isolation per agent, which keeps things lightweight but means I have zero observability into what each agent is actually doing beyond tailing logs manually. This solves that gap in a way that doesn't add overhead to the agent itself, which is the right tradeoff.
Curious about one thing -- how does the dashboard handle the case where a sub-agent spawns its own sub-agents? Does it track the full tree or just one level deep?
Sub-agent trees are fully tracked by the dashboard. When an agent is spawned, it always has a parent agent id - claude is sending this in the hooks payload. When you mouse over an agent in the dashboard, it shows what agent spawned it. There currently isn't a tree view of agents in the UI, but it would be easy to add. The data is all there.
[Edit] When claude spawns sub-agents, they inherit the parent's hooks. So all sub-agents activity gets logged by default.
I tried using hooks for setting up my DIYed version of what channels is now in Claude. I had Claude writing them and not really looking at the results cause the vibes are strong. It struggled with odd behaviors around them. Nice to see some of the possible reasons, I ended up killing that branch of work so I never figured out exactly what was happening.
Now I'm regretting not going deeper on these. This is the type of interface that I think will be perfect for some things I want to demonstrate to a greater audience.
Now that we have the actual internals I have so many things I want to dig through.
Right on. Good luck! You might also want to play around with https://github.com/simple10/agent-super-spy if you want to see the raw prompts claude is sending. It was really helpful for me to see the system prompts and how tool calls and message threads are handled.
Are you guys spending hundreds (or thousands) of dollars a day on Claude tokens? Holy crap. I can't get more than one or two agents to do anything useful for very long before I'm hitting my usage limits.
I'm in a great situation where I've been piloting Claude for the company among a small group of others. I've been obsessed with pushing the limits of how many sessions and agents I can working at a time. We threw some work at Gas Town and another Orchestrator but they felt too rigid and opinionated for my liking. But I'm biased, I want to make my own eventually.
When I go home to my $20 plan I am sad and annoyed but I don't want to put more in for what is a good enough for me to work a bit at a time, a good pomodoro timer for me personally.
Something like this is perfect for some of the issues that I've wanted to solve as a command and control tool with malleable visuals.
I hit a lot of limits on Pro plan. Upgraded to Max $200/mo plan and haven't hit limits for awhile.
It's super important to check your plugins or use a proxy to inspect raw prompts. If you have a lot of skills and plugins installed, you'll burn through tokens 5-10x faster than normal.
Also have claude use sub-agents and agent teams. They're significantly lighter on token usage when they're spawned with fresh context windows. You can see in Agents Observe dashboard exactly what prompt and response claude is using for spawning sub-agents.
This is what I've been missing running multi-agent ops through OpenClaw.
The opacity problem is the one I hit hard: when a coordinator spawns 3-4 agents in parallel (builder, reviewer, tester, each with their own tool calls), the only visibility you have is what they choose to report back. Which is often sanitised and … dangerously optimistic.
The role separation / independent verification structure I run helps catch bad outputs, but it doesn't give me the live timeline of HOW an agent got to a conclusion. That's why I find this genuinely useful.
Noticed OpenClaw is already on the roadmap - had my hands tingling to fork and adapt it. Starring it for now and added to my watchlist. The hook architecture should translate … OpenClaw fires session events that could feed the same pipeline. Looking forward to seeing that happen.
How are you handling the gap between what an agent reports and what it actually did? The sanitised optimism problem you mention is something I keep running into -- agents will confidently say they fixed something when they actually just suppressed the error. Are you doing any diff-level verification or is it mostly the reviewer agent catching it?
The structural fix is the obsession about separating roles: the agent that builds is never the one that verifies. I run a reviewer agent (I call her Iris), and a tester (Rex) — they live in separate sessions with no shared context with the builder. Iris' brief explicitly says "we require a live browser test, code review is not enough" — and that is where role separation was key; agents reviewing their own output tend to confirm what they already believe.
The explicit result/verdict format helps too. Each acceptance criteria gets a PASS/FAIL/UNKNOWN verdict, attached with evidence. Unknown is the one with gravitas — you force the agent to say "I could not verify this" rather than it quietly pretending it was a PASS.
But diff-level verification is where it still leaks. I don't have a systematic diff check yet. It's mostly Iris catching "agent replaced the whole file rather than extending it" by noticing the git diff is suspiciously clean. That's still more pattern matching than proper instrumentation — room for improvement... when I figure out how. Not there yet, to be honest.
The sanitised optimism problem is deep — it's not always dishonesty, but quite often a genuine model confusion about whether a suppressed error counts as a fix. The agent believes... voila, success. The only way around it I've found is that the verifier has to be skeptical by default, not reviewing in good faith.
This tool's live timeline is the missing piece in that loop. Being able to see the actual tool calls rather than the curated (and falsely optimistic) summary could change verdict quality rather significantly.
Good to know background hooks make that much of a difference. How are you handling the case where multiple agent teams are writing to the same jsonl files simultaneously?
I'm not actually reading the jsonl files. Agents Observe just uses hooks and sends all hook data the server (running as a docker container by default).
Basic flow:
1. Plugin registers hooks that call a dump pipe script that sends hook events data to api server
2. Server parses events and stores them in sqlite by session and agent id - mostly just stores data, minimal processing
3. Dashboard UI uses websockets to get real-time events from the server
4. UI does most of the heavy lifting by parsing events, grouping by agent / sub-agent, extracting out tool calls to dynamically create filters, etc.
It took a lot of iterations to keep things simple and performant.
You can easily modify the app/client UI code to fully customize the dashboard. The API app/server is intentionally unopinionated about how events will be rendered. This was by design to add support for other agent events soon.
The hooks approach seems much cleaner for real-time. Did you run into any issues with the blocking hooks degrading performance before you switched to background?
Sort of. It wasn't really noticeable until I did an intentional audit of performance, then noticed the speed improvements.
Node has a 30-50ms cold start overhead. Then there's overhead in the hook script to read local config files, make http request to server, and check for callbacks. In practice, this was about 50-60ms per hook.
The background hook shim reduces latency to around 3-5ms (10x improvement). It was noticeable when using agent teams with 5+ sub-agents running in parallel.
But the real speed up was disabling all the other plugins I had been collecting. It piles up fast and is easy for me to forget what's installed globally.
I've also started periodically asking claude to analyze it's prompts to look for conflicts. It's shockingly common for plugins and skills to end up with contradictory instructions. Opus works around it just fine, but it's unnecessary overhead for every turn.
the blocking hooks observation matches what I would expect -- anything synchronous in the critical path has multiplicative effect when agents run 20-30 tool calls per task. even a 100ms write per call adds 2-3 seconds to a task, and that compounds across parallel agents fast.
Thanks! This was step one in my daily driver stack - better observability. I also bundled up a bunch of other observability services in https://github.com/simple10/agent-super-spy so I can see the raw prompts and headers.
The next big layer for my personal stack is full orchestration. Something like Paperclip but much more specialized for my use cases.
The docker-based service pattern is smart too. I went a different direction for my own setup -- tmux sessions with worktree isolation per agent, which keeps things lightweight but means I have zero observability into what each agent is actually doing beyond tailing logs manually. This solves that gap in a way that doesn't add overhead to the agent itself, which is the right tradeoff.
Curious about one thing -- how does the dashboard handle the case where a sub-agent spawns its own sub-agents? Does it track the full tree or just one level deep?
[Edit] When claude spawns sub-agents, they inherit the parent's hooks. So all sub-agents activity gets logged by default.
Now I'm regretting not going deeper on these. This is the type of interface that I think will be perfect for some things I want to demonstrate to a greater audience.
Now that we have the actual internals I have so many things I want to dig through.
When I go home to my $20 plan I am sad and annoyed but I don't want to put more in for what is a good enough for me to work a bit at a time, a good pomodoro timer for me personally.
Something like this is perfect for some of the issues that I've wanted to solve as a command and control tool with malleable visuals.
OP: This is cool, thank you for sharing.
It's super important to check your plugins or use a proxy to inspect raw prompts. If you have a lot of skills and plugins installed, you'll burn through tokens 5-10x faster than normal.
Also have claude use sub-agents and agent teams. They're significantly lighter on token usage when they're spawned with fresh context windows. You can see in Agents Observe dashboard exactly what prompt and response claude is using for spawning sub-agents.
The opacity problem is the one I hit hard: when a coordinator spawns 3-4 agents in parallel (builder, reviewer, tester, each with their own tool calls), the only visibility you have is what they choose to report back. Which is often sanitised and … dangerously optimistic.
The role separation / independent verification structure I run helps catch bad outputs, but it doesn't give me the live timeline of HOW an agent got to a conclusion. That's why I find this genuinely useful.
Noticed OpenClaw is already on the roadmap - had my hands tingling to fork and adapt it. Starring it for now and added to my watchlist. The hook architecture should translate … OpenClaw fires session events that could feed the same pipeline. Looking forward to seeing that happen.
The structural fix is the obsession about separating roles: the agent that builds is never the one that verifies. I run a reviewer agent (I call her Iris), and a tester (Rex) — they live in separate sessions with no shared context with the builder. Iris' brief explicitly says "we require a live browser test, code review is not enough" — and that is where role separation was key; agents reviewing their own output tend to confirm what they already believe.
The explicit result/verdict format helps too. Each acceptance criteria gets a PASS/FAIL/UNKNOWN verdict, attached with evidence. Unknown is the one with gravitas — you force the agent to say "I could not verify this" rather than it quietly pretending it was a PASS.
But diff-level verification is where it still leaks. I don't have a systematic diff check yet. It's mostly Iris catching "agent replaced the whole file rather than extending it" by noticing the git diff is suspiciously clean. That's still more pattern matching than proper instrumentation — room for improvement... when I figure out how. Not there yet, to be honest.
The sanitised optimism problem is deep — it's not always dishonesty, but quite often a genuine model confusion about whether a suppressed error counts as a fix. The agent believes... voila, success. The only way around it I've found is that the verifier has to be skeptical by default, not reviewing in good faith.
This tool's live timeline is the missing piece in that loop. Being able to see the actual tool calls rather than the curated (and falsely optimistic) summary could change verdict quality rather significantly.
Basic flow:
1. Plugin registers hooks that call a dump pipe script that sends hook events data to api server
2. Server parses events and stores them in sqlite by session and agent id - mostly just stores data, minimal processing
3. Dashboard UI uses websockets to get real-time events from the server
4. UI does most of the heavy lifting by parsing events, grouping by agent / sub-agent, extracting out tool calls to dynamically create filters, etc.
It took a lot of iterations to keep things simple and performant.
You can easily modify the app/client UI code to fully customize the dashboard. The API app/server is intentionally unopinionated about how events will be rendered. This was by design to add support for other agent events soon.
Node has a 30-50ms cold start overhead. Then there's overhead in the hook script to read local config files, make http request to server, and check for callbacks. In practice, this was about 50-60ms per hook.
The background hook shim reduces latency to around 3-5ms (10x improvement). It was noticeable when using agent teams with 5+ sub-agents running in parallel.
But the real speed up was disabling all the other plugins I had been collecting. It piles up fast and is easy for me to forget what's installed globally.
I've also started periodically asking claude to analyze it's prompts to look for conflicts. It's shockingly common for plugins and skills to end up with contradictory instructions. Opus works around it just fine, but it's unnecessary overhead for every turn.
th
The next big layer for my personal stack is full orchestration. Something like Paperclip but much more specialized for my use cases.