Cache & Performance
Performance in a long-session writing app isn’t about being fast on the first click. It’s about staying fast on click number 2,000 when you’ve been working for five hours, the Legendry has been queried hundreds of times, the editor has autosaved dozens of documents, the cache should have filled up with useful data by now, and the whole thing should feel exactly as responsive as it did when you first opened the app. Most desktop apps get slower over a long session because they accumulate memory, cache stale data, or hit background services that hold locks. Ishvana is built to not do that — not because it’s magic, but because the performance architecture is deliberate about what gets cached, for how long, and under what conditions. This page explains how that works. Not to make you worry about it, but to give you the mental model for why the app feels the way it feels, and for when something does go wrong, where to look.
The cache: what’s actually cached
Section titled “The cache: what’s actually cached”Ishvana’s backend runs a cache service as a first-class subsystem. It’s in-memory (not disk-backed for the main cache — though there’s a separate disk cache for expensive computations that should survive restarts), and it stores frequently-accessed results so the same computation doesn’t have to run twice in a session.
What gets cached:
- Lore entries by ID. When a document references a character, Ishvana looks up that character once and caches the entry for the rest of the session. Subsequent references in the same document — or in other documents opened later — hit the cache instead of querying the database.
- Lore search results. Semantic search queries return a ranked list of lore entries. The results get cached keyed by the query, so running the same search again returns the same results instantly.
- Entity extraction results. Running the entity extractor on a paragraph is moderately expensive. The cache stores the extracted entity list keyed by the paragraph’s content hash, so re-running on unchanged text is free.
- Formatted document content. When a document is loaded and parsed for display, the parsed structure is cached. Reopening the same document in the same session is instant.
- Stat block computations. Once a stat block’s formulas have been evaluated, the computed values are cached until the stat block changes.
- ML pipeline intermediate results. Clustering, similarity, and graph analysis pipelines cache their intermediate data so a second run on the same corpus can skip expensive steps.
- LLM response cache (short-lived). Identical LLM requests made within a very short window (seconds) can be served from a small cache to prevent accidental duplicate calls. This is more of a safety net than an optimization.
What doesn’t get cached:
- Document content being edited. The editor has its own live state. The cache doesn’t interfere with it.
- Write operations. Every save goes to disk. The cache is read-through — it stores the result of a read, not a write.
- Real-time data. Desktop activity feed, recent edit timestamps, session analytics — these are computed fresh on every request because their freshness is part of their value.
Cache invalidation
Section titled “Cache invalidation”The hard problem with caching is knowing when to throw the cache away because the data has changed. Ishvana handles this through scope-aware invalidation:
- Lore entry cache gets invalidated when you edit that specific entry. Only the one entry’s cached data is cleared; other entries stay cached.
- Lore search cache gets invalidated when the Legendry’s collection of entries changes in a way that might affect search results. Adding a new entry or deleting an entry clears all search caches; editing an existing entry only clears caches whose queries might have matched the edited entry.
- Entity extraction cache is content-hashed, so it invalidates automatically — the same text produces the same extraction, so there’s no “stale” state to worry about.
- Stat block cache invalidates when the stat block or its source ruleset changes.
The invalidation is granular enough that heavy work in one area (editing a single lore entry repeatedly) doesn’t invalidate caches in unrelated areas (document search, ML pipeline state). This is the reason the cache stays useful over a long session instead of getting thrown away constantly.
Cache size and eviction
Section titled “Cache size and eviction”The cache has a configurable size limit (default is reasonable for consumer hardware, usually a few hundred megabytes of in-memory data). When it fills up, the cache evicts least-recently-used entries to make room for new ones.
LRU eviction is good enough for most workflows. The exceptions:
- Pre-loaded cache state. On app startup, the cache isn’t warm yet, so the first operations of a session are slower. There’s a configurable preload that can warm common entries at startup, but it’s off by default because most users don’t want to pay a startup cost for first-session data.
- Working-set mismatch. If your working set genuinely exceeds the cache size (you’re actively using 1,000 lore entries in a session), LRU can’t keep up. The solution is to raise the cache size limit or accept the cache misses. On modern hardware, the default limit is almost always plenty.
You can see cache statistics in the Analysis workspace if you’re curious about hit rates and eviction frequency. Most users never look at it because the cache just works.
Hardware tier detection
Section titled “Hardware tier detection”Ishvana is built to run across a wide range of hardware — from modest laptops to high-end workstations. Instead of assuming everyone has a 16-core CPU and 64 GB of RAM, the backend detects your machine’s capabilities at startup and picks a performance tier:
- High-end. GPU + 16 GB+ RAM + 8+ CPU cores. Maximum optimization enabled. Larger cache sizes, more parallel workers, richer ML pipeline configurations.
- Balanced. 16 GB+ RAM + 6+ CPU cores, no GPU. High optimization with most features enabled at normal levels. The target tier for most serious authors working on consumer hardware.
- Standard. Moderate RAM and cores. Conservative configurations. Reduced parallelism, smaller cache, simpler ML pipeline defaults.
- Basic. Below-standard hardware. Minimal concurrency, small caches, conservative model sizes.
The tier affects things you might not expect:
- Thread pool size. Higher tiers run more parallel workers, which makes ML pipelines and entity extraction finish faster.
- Maximum concurrent tasks. Higher tiers allow more operations to run simultaneously without blocking.
- Memory limits for caches and buffers. Higher tiers allocate more memory to each subsystem.
- GPU usage. When a GPU is detected, the local embedding model and any other GPU-capable operations use it. Without a GPU, they fall back to CPU.
- Recommended LLM model sizes. Lower tiers prefer smaller local models; higher tiers default to larger ones if using Ollama.
Tier detection runs once at startup and caches the result. You don’t override it directly — the detection is considered authoritative. What you can override is specific tunables (cache sizes, alert thresholds, cache TTLs, retention policies) in Settings → Performance, which gives you fine-grained control without overriding the tier itself.
The performance tab in settings
Section titled “The performance tab in settings”Settings → Performance is where you see the detected hardware tier and can tune the specific knobs without forcing a different tier. The visible settings include:
- CPU and memory alert thresholds. When to warn you that the system is under pressure.
- Bottleneck detection. Enable or disable the optimization recommendations that come out of the intelligence subsystem.
- Cache TTL and size. How long cached data stays valid and how much memory the cache can use.
- Metrics retention. How long historical performance metrics are kept.
- Undo depth. How many undo steps the editor keeps in memory.
- Toast duration. UI timing for notifications (not technically performance but lives next to the other knobs).
- GPU probe. An explicit test button that runs a GPU check. Shows pass / fail / unavailable / crash, with a retest button.
Most users never touch these. The tier detection’s defaults are tuned to work well on each tier. The knobs exist for the occasional author with an unusual hardware setup or a specific complaint about how the app feels.
Real-time responsiveness guarantees
Section titled “Real-time responsiveness guarantees”A few specific operations that are guaranteed to be fast regardless of what else is happening:
- Keystrokes in the editor. Typing should never feel like it has lag. The editor runs in the Electron renderer, not the backend, so backend performance doesn’t affect it.
- Navigation between tabs. Switching from the editor to the Legendry to the Research panel should be instant because the views are kept alive in memory.
- Saves. Document saves complete within milliseconds because the save path is direct — the editor writes to the DOCX file, the backend updates the metadata row, and that’s it. No intermediate queues or transactions.
If any of these ever feel slow, it’s worth opening Error Tracking to see if a background service is logging errors that might be blocking the responsive path. Normal operation should not produce lag on the fast paths even on the most modest hardware.