Legendry Bench

Legendry Bench is the benchmark surface for Ishvana’s local handlers. It no longer compares cloud models or token usage. As of v1.2.0 on May 1, 2026, the benchmark runner executes Divinity Engine handlers against controlled lore fixtures so you can see whether a code change improved, preserved, or regressed deterministic behavior.

The goal is practical: if a Lorekeeper or Hawken handler used to catch a contradiction and a new revision misses it, the benchmark should make that drift visible before the release ships.

How it works

The workflow has four phases.

Phase 1: Fixture selection. Pick the handler and fixture set you want to run. Fixtures are small, known-good scenarios built from lore, manuscript, or Magic System examples.

Phase 2: Baseline capture. Save the current handler output as the comparison point. Baselines include findings, items, metadata, and the handler result envelope.

Phase 3: Execution. Run the selected fixtures against the current handler implementation. Progress streams so you can watch each fixture finish.

Phase 4: Scoring. The scorer compares current output to the baseline and marks each fixture as pass, improved, regressed, or changed for review. The report shows what changed and where.

What it tests

Legendry Bench is built for deterministic engine behavior:

Contradiction detection — scene and lore consistency handlers still flag the expected conflicts.
Entity extraction — entity handlers keep identifying the expected people, places, factions, and concepts.
Relationship inference — relationship-oriented handlers preserve expected links and warnings.
ProseGuard compliance — prose lint handlers keep flagging configured style rules.
Factual lookup shape — WorldKnowledge and Wikipedia handlers preserve structured result shape.
Magic System adherence — GameMaster handlers preserve stat, formula, and ability constraints.

The exact fixture set depends on the handler you choose.

Context tiers

Fixture suites can still be small, medium, or large, but the tiers now describe local workload size rather than context-window cost:

Small — a handful of entries or a short scene. Fast sanity checks.
Medium — a chapter or representative lore slice. Useful before merging handler changes.
Large — full-project or manuscript-scale fixtures. Best for release validation.

What the results show

Each run produces:

Per-fixture status — pass, improved, regressed, or changed.
Finding and item diffs — what the handler added, removed, or changed.
Latency distribution — mean, median, and p95 runtime for local execution.
Failure details — original error text when a handler fails.
Baseline links — the captured result you compared against.

Reports are saved to the project so later runs can be compared against known-good engine behavior.

When to run it

Run Legendry Bench when:

You changed a handler and want to catch output drift.
You updated authored-library packs that affect rendered findings.
You changed lore, Magic System, or manuscript analysis primitives.
A release is close and you want confidence that key handlers still behave.

Most authors won’t need this daily. It is most useful for serious projects that rely heavily on consistency checks, or for release validation.

What it doesn’t test

It doesn’t compare external models. There are no external providers in v1.2.0.
It doesn’t test creative prose quality. It checks handler behavior and structured output, not whether a sentence is beautiful.
It doesn’t rank by cost. There is no usage billing to optimize.
It doesn’t replace human review. A pass means the handler matched or improved against its fixture; you still decide what belongs in the book.

What’s next

Etherforce Observability The panel that hosts handler dispatches, analytics, and benchmarks.

Lorekeeper The agent whose consistency handlers are common benchmark targets.

Agent Overview The complementary view — how agents perform on real requests, not controlled fixtures.