Skip to content

Agent Overview

Most tools that ship a model want you to trust them without looking under the hood. They ship with a chat interface, the model responds, and if the output looks fine you assume the system is working. The failure mode is that when the system isn’t working — when a model regression silently drops GameMaster’s success rate from 92% to 71%, when Hawken’s latency has been creeping up for days, when Lagan is failing silently on research requests because of a missing API key — you don’t find out until you feel it in the output quality, and by then you’ve already shipped a chapter that wasn’t as good as it should have been. Agent Overview is the panel that makes the agent pipeline’s own performance visible so you can catch regressions at the numbers before they show up in the prose.

Every agent tracks its own operations. Every operation records success or failure, duration, a short summary, and an error message if something went wrong. The data is persistent — across restarts, across sessions, across months — and the Agent Overview panel is the dashboard that reads it.

The panel is one row per agent — Ishvana, Hawken, Lagan, GameMaster, WorldKnowledge, Lorekeeper — plus a few system-level entries for background services. Each row shows:

  • Agent name and status. “Awake” means the agent is loaded and ready to accept requests. “Asleep” means the agent has been initialized at some point but isn’t currently loaded.
  • Operation count. Total operations this session, this day, this week, or this month — toggleable.
  • Success rate. Percentage of operations that completed without error.
  • Average duration. How long the average operation takes.
  • Last activity. Timestamp of the most recent operation.
  • Trend indicator. Improving, steady, or degrading — a short-term trend based on the last N operations versus the session average.

Click any row and the detail view opens with a breakdown by operation type. Instead of “Hawken did 47 operations,” you see “Hawken did 18 full-mode generations, 12 quick-mode generations, 9 style analyses, 6 rewrite transforms, and 2 task analyses.” Each operation type gets its own success rate, average duration, and last-executed timestamp.

A single agent can do multiple distinct jobs, and those jobs can have very different reliability profiles. Hawken, for example, runs both “full mode” generations (which go through the multi-stage workflow with lore validation) and “quick mode” generations (which bypass the workflow). Full mode might be at 95% success while quick mode is at 70% — but if you only look at Hawken’s overall success rate, you see 82% and don’t know which one is the problem.

The per-operation-type breakdown is the thing that makes the panel diagnostic rather than just informational. “Hawken’s success rate dropped” is vague. “Hawken’s quick-mode generation success rate dropped from 88% to 62% starting three days ago” is actionable — you can go look at what changed three days ago and figure out what broke.

Every agent row shows a short-term trend computed from recent operations versus the longer-term average. The indicator is qualitative — improving, steady, degrading — and it’s derived from a simple comparison: the success rate of the last N operations versus the success rate of the last N*5 operations. If recent is noticeably worse than longer-term, the trend is degrading.

The trend matters because it’s the early warning system for model regressions and configuration drift. If you switched Etherforce to a different model yesterday and today GameMaster’s trend is showing degrading, the two are probably related and you should check. If Lagan’s trend flips to degrading after you changed an API key, same thing. The trend won’t tell you why — just that something changed and it’s worth investigating.

When an operation fails, the error message is captured. Click any agent’s “recent failures” view to see the list of errors with their timestamps, operation types, short summaries, and full error messages. This is where you diagnose why things are breaking, not just that they’re breaking.

A few common things you’ll see in the failure view:

  • Timeouts. The agent was still working when the operation hit its timeout. Usually means the model is slow or the context is too large. Fix is either waiting longer or reducing the context.
  • Context too large. The request exceeded the model’s context window. Fix is compressing or trimming context, or switching to a model with a larger window.
  • API errors. The upstream LLM provider rejected the request — rate-limited, authentication failed, quota exhausted. Fix is usually in the Etherforce settings or the provider’s own dashboard.
  • Parse failures. The model returned output that couldn’t be parsed as the expected structure (stat block, JSON response, etc.). Usually a sign the model picked is worse at structured output than the one you expected. Fix is switching models for that operation type.

Most failures are self-explanatory once you see the error message. The value of the failure view is that the messages are captured in one place instead of being buried in a log file somewhere.

The panel can display data at four time scales:

  • Current session — everything since you opened the app. Good for spot-checking during a writing session.
  • Today — last 24 hours. Good for “how have the agents been doing today?”
  • Last 7 days — the short-term view. Good for catching trends that take a few days to develop.
  • Last 30 days — the long-term view. Good for month-over-month comparisons and catching slow drift.

Switching between time scales doesn’t re-query the backend — the data is already loaded for all four scales, and switching is instant. This matters because it lets you look at the same question at multiple zoom levels without waiting between each view.

A few things Agent Overview deliberately doesn’t do:

  • It’s not a leaderboard. There’s no ranking of “best agent.” Each agent does a different job, and comparing them against each other doesn’t make sense. Hawken’s 94% success rate on prose generation doesn’t mean Hawken is “better” than Lagan’s 89% on research pipelines — they’re measuring different things.
  • It’s not a billing view. Operation count doesn’t translate to dollars spent. For that, use the Etherforce Observability panel, which tracks per-operation token cost and aggregates by provider.
  • It’s not real-time. The panel updates on a refresh interval (or on manual refresh), not live. An operation that just finished might take a few seconds to show up in the panel. This is deliberate — polling the metrics every second would be wasteful for a diagnostic view that’s usually only open for a few minutes at a time.
  • It’s not a replacement for real error logs. The failure view is a summary. If you need the full stack trace of a specific failure, that lives in Error Tracking.

This panel exists because Ishvana treats agent transparency as a product feature rather than a developer setting. You’re paying for a tool that runs agents on your behalf, and you deserve to know whether those agents are doing their job. The Agent Overview panel is how that promise gets kept in practice — not as marketing copy, but as real numbers you can check any time.

The more subtle benefit is that it changes how you talk about problems. “Something felt off today” is a vague complaint you can’t act on. “GameMaster’s stat generation success rate dropped 15% after I switched to a cheaper model yesterday” is a specific claim backed by data, and now you know exactly what to do about it. The panel turns vibes into metrics.