Skip to content

Legendry Bench

Generic LLM benchmarks tell you how a model does on a canned test — HumanEval, MMLU, GSM8K, the usual suspects. Those benchmarks are useful if you’re building a general-purpose tool, but they don’t tell you the thing a fiction author actually wants to know: which model is best at catching contradictions in my specific project’s lore? A model that aces a history quiz might be terrible at remembering that the Crimson Hyenas operate in Middels Valley and not in Lowplek. A model that writes beautiful prose might fail to notice when a character’s hometown quietly changes between chapters. Generic benchmarks can’t predict any of this, because they don’t know your project.

Legendry Bench is a benchmarking framework that tests models on your actual lore. It mutates real lore entries with tracked, deliberate changes — shifting an attribute, flipping a relationship, inserting a contradiction — then runs the mutated lore through the model you’re evaluating and scores how well the model catches the changes. The result is a score that’s specific to your project, not a score from someone else’s test set. If you’re picking between models for your Ishvana setup, this is the benchmark that tells you the real answer.

The workflow has four phases.

Phase 1: Mutation. The engine picks a set of real lore entries from your Legendry and applies mutations to them. Each mutation is tracked — before state, after state, and a description of what changed. Mutations are deliberate and small: one change per entry, mostly, so the scoring is unambiguous.

Phase 2: Scenario assembly. The mutated lore gets wrapped in a prompt template. The template varies by benchmark category — a contradiction-detection scenario is wrapped in “identify any contradictions in the following lore,” a relationship-inference scenario is wrapped in “what is the relationship between these two entities,” and so on. The scenarios are assembled into test suites ready to run.

Phase 3: Execution. The scenarios run against the model you’re evaluating. Each scenario sends the prompt, captures the model’s response, and records latency and token usage. The execution is streamed so you can watch progress in real time.

Phase 4: Scoring. The scoring engine parses each response, matches the model’s output against the answer key (the mutations that were actually applied), and computes precision, recall, F1, and per-category breakdowns. The final report shows how the model did overall and how it did on each category of test.

The engine supports several kinds of mutation, each testing a different capability of the model being evaluated.

  • Attribute swap. Change a specific attribute on an entry — swap a character’s hair color, swap a city’s founding date, swap a faction’s primary territory. Tests whether the model catches direct factual substitutions.
  • Timeline shift. Move an event’s date or duration. Tests whether the model can hold the timeline in memory and notice when it drifts.
  • Relationship conflict. Flip a relationship — ally becomes enemy, family member becomes stranger, mentor becomes rival. Tests whether the model can track social structures.
  • Name substitution. Replace an entity name with a similar-but-wrong one. “Kent Musa” becomes “Keith Musa.” Tests whether the model is paying attention to the exact names or just pattern-matching.
  • Section contradiction. Add a new section to a lore entry that directly contradicts an existing section. Tests whether the model catches internal contradictions within a single entry, not just across entries.
  • Faction membership. Reassign a character to a different faction. Tests whether the model can track membership structures and flag inconsistencies.
  • Personality inversion. Flip a character’s stated personality (cheerful becomes dour, cautious becomes reckless). Tests whether the model catches softer, more narrative-level contradictions.

Each mutation type is registered per entry type, so a “character” mutation set is different from a “location” mutation set. The engine picks mutations that are appropriate for the entry type being mutated.

The tests fall into seven categories, and you pick which ones to run for any given benchmark. Running all seven is a thorough audit; running just contradiction detection is a quick sanity check.

  1. Contradiction detection. Can the model spot direct contradictions between pieces of lore? The core test — if a model can’t do this, it can’t serve as a reliable Lorekeeper.
  2. Entity extraction. Can the model correctly identify the entities in a piece of lore? Tests the foundation that everything else is built on.
  3. Relationship inference. Given two entities, can the model infer their relationship correctly from the lore context?
  4. ProseGuard compliance. Given a set of style rules (from ProseGuard), can the model respect them when generating or transforming text?
  5. Factual recall. Given a factual question about the lore, can the model answer correctly from long context?
  6. Context stress. How does the model’s accuracy degrade as the context grows? Tests whether the model can maintain quality on large projects.
  7. Mechanics adherence. Given a ruleset from the Magic System, can the model respect the stat definitions, formulas, and abilities when generating output?

Scenarios run at three context sizes to test how the model scales:

  • Small — around 5 lore entries, roughly 2,000 tokens of context. Fast, cheap, tests basic capability.
  • Medium — around 50 lore entries, roughly 20,000 tokens. Tests whether the model can handle a typical mid-project context load.
  • Large — 200+ lore entries, the full Legendry of a serious project. Tests whether the model can maintain accuracy under real-world load.

Running a benchmark at all three tiers tells you something the single-tier run can’t: whether a model that looks great at Small degrades badly at Large. Some models are consistent across tiers; others drop sharply once context exceeds a threshold. The tier comparison is the thing that tells you which models will actually scale to your full project.

Before any benchmark runs, the engine checks whether your Legendry has enough data for meaningful testing. Each category has minimum requirements — relationship inference needs a minimum number of defined relationships, mechanics adherence needs an active ruleset, contradiction detection needs enough entries to create realistic mutations. If your project doesn’t meet the requirements for a category, that category is grayed out and the benchmark runner refuses to pretend it can measure something it can’t.

This is part of why the benchmark is trustworthy. You don’t get meaningless scores on under-populated categories.

Each benchmark run produces a report with:

  • Overall score. A composite across all categories and scenarios.
  • Per-category scores. Precision, recall, F1 for each category you ran.
  • Per-tier scores. How the model did at Small, Medium, Large.
  • Latency distribution. Mean, median, p95 latency for the scenarios.
  • Success rate. Percentage of scenarios where the model produced parseable output.
  • Failure examples. Actual scenarios where the model got it wrong, so you can see what kinds of mistakes it made.

The reports are saved to the project and can be compared across runs. Run the same benchmark against a different model and the comparison view shows you side-by-side results — Model A vs. Model B, category by category, with deltas highlighted.

A few specific moments where running Legendry Bench is worth the time:

When you’re setting up Ishvana for a new project. The default model assignments might not be optimal for your specific lore. Run a benchmark on the defaults, then on a couple of alternatives, and pick the one that scores best on the categories you actually care about.

When you’re considering switching models. A new model just came out on OpenRouter. Before you switch your reasoning agent to it, run a benchmark and compare it against your current model. Sometimes “better” on generic benchmarks is worse on your specific lore, and Legendry Bench is the only way to know.

When you feel like agent quality is slipping. Your Agent Overview is showing a success-rate dip, and you’re not sure why. Run a benchmark on the current model and compare it against a historical benchmark result from when things were working. If the scores have dropped, something about the current model or the ruleset has regressed.

When you add a lot of new lore. The shape of your Legendry has changed since the last time you benchmarked. Re-run to see if your current model still handles the new scale.

Most projects don’t need to run Legendry Bench frequently. Running it once when you set up, once when you’re about to ship a major project milestone, and any time you change models is usually enough.

A few deliberate limits:

  • It doesn’t test creative writing quality. Legendry Bench is about accuracy and consistency, not about prose. A model that scores well on Legendry Bench might still write boring sentences. For creative quality, you need human review or Hawken’s style analysis.
  • It doesn’t test speed. Latency is captured as a side effect, but the benchmark’s primary output is correctness. A fast-but-wrong model still fails. A slow-but-right model still passes.
  • It doesn’t test cost. Token usage is recorded but not optimized for. If you want to pick a model based on cost-per-correct-answer, you can compute that from the benchmark results, but the tool itself doesn’t rank models on cost.
  • It doesn’t test behaviors that require multi-turn interaction. Legendry Bench runs single-turn scenarios. An agent that performs badly on single-turn queries but excels in multi-turn conversations won’t get credit for the multi-turn skill.