Content Extract
Sometimes you don’t want Lagan to summarize a page. You want the raw text. The actual article content, cleaned up from the navigation and sidebars and ads, copyable to wherever you need it. The Content Extract tab is the view that shows you that raw output — whatever the browser’s content extraction cascade pulled from the current page, displayed as plain text with a word count, a character count, a truncation indicator if the content hit the 50,000 character cap, and action buttons to copy the content or send it to Lore. It’s the least glamorous tab in the Research module and also the one most authors use most often, because “give me the text of this page so I can do something with it” is a more common need than anyone acknowledges.
The tab is populated automatically when a page finishes loading in the browser. You don’t have to click Extract — the content extraction runs on every page load (using the cascade described in Embedded Browser) and the result lives in this tab. You open the tab to see what the extraction produced.
What the tab shows
Section titled “What the tab shows”Four things:
- The extracted content. The raw text of whatever the extraction cascade pulled from the current page. Formatted as plain paragraphs, with paragraph breaks preserved. No HTML, no CSS, no images.
- Word count and character count. Visible at the top. Useful for knowing roughly how much content is in the page without reading it.
- Truncation indicator. If the extraction hit the 50,000 character cap, a visible badge says “truncated — 50,000/~60,000 characters.” Rare in practice — most article pages are well under the cap — but important when it happens because it tells you the content you’re looking at isn’t the full thing.
- Action buttons.
- Copy to clipboard. Copies the full extracted content in one click.
- Send to Lore. Creates a new Legendry entry in the Reference category with the page’s title and extracted content as the entry’s body, and the URL as source metadata. Turns a research page into a citable lore entry in one click.
- Open in editor. Some authors want to paste the content into the editor to work with it there. This action opens a new document with the content pre-populated.
- Save as research note. A lightweight save that stores the content as a plain text note without creating a full bookmark.
Why raw text matters
Section titled “Why raw text matters”Three specific use cases where raw text is more useful than a generated summary:
Direct quotation. You’re writing a historical essay (in your manuscript) and you want to quote a specific sentence from a primary source. You need the exact text, not a summary. The Extract tab gives you the sentence verbatim.
Manual note-taking. You prefer to read and take your own notes rather than trust Lagan’s summary. The Extract tab gives you the raw content to work with. You read, you highlight mentally, you type your notes elsewhere.
Copy-to-prompt. You want to feed the page’s content into Hawken or Ishvana as context for a generation or a question. The Extract tab gives you the cleanest version of the content to paste into a chat.
In all three cases, the generated summary is either insufficient or unnecessary. Raw text is the unit.
The extraction cascade (refresher)
Section titled “The extraction cascade (refresher)”Content extraction runs whenever a page finishes loading in the browser. The cascade is:
- Look for
<article>— usually the main content on a well-structured site. - Look for
<main>or[role="main"]. - Look for common content class selectors (
.post-content,.entry-content,.article-body, and several others). - Fall back to the largest text-heavy
<div>or<section>(excluding blocks under 500 characters or where more than 50% of the text is links). - Last resort:
document.body.innerText.
The cascade is deliberately defensive. Most modern sites have <article> or <main>, which produces clean extractions. Sites that don’t get processed through the lower fallbacks and the result is still usually good, just sometimes with some navigation text mixed in.
The extraction runs in a small JavaScript injected into the webview via executeJavaScript after did-stop-loading. It’s fast — typically a few milliseconds — and doesn’t affect the rest of the page’s behavior.
The 50,000 character cap
Section titled “The 50,000 character cap”There’s a hard cap on how much text the extraction returns, set to 50,000 characters (roughly 8,000-10,000 words depending on language and formatting). Almost no single web page hits this — a very long-form essay might be 5,000 words, and a typical article is 800-2,000 words — so the cap is a safety net, not a constraint.
When the cap does get hit (very long pages, single-page books, multi-chapter articles), the extraction returns the first 50,000 characters and flags the truncation in the UI. You know you’re looking at partial content. If you need the full thing, you’d have to use a different tool — browser print, PDF download, or external content extraction.
Sending to Lore
Section titled “Sending to Lore”The “Send to Lore” action is the most useful workflow from this tab. It’s specifically for the case where you find research content that should become part of your project’s canonical knowledge — a detailed historical account, a scientific explanation, a real-world reference you want to cite — and you want to turn it into a Legendry entry without leaving the Research module.
Clicking the action:
- Creates a new Legendry entry of type “Reference” (or another configurable type).
- Populates the entry’s title from the page’s
<title>element. - Populates the entry’s main content from the extracted text.
- Adds the page URL as source metadata.
- Saves the entry to your Legendry.
- Shows a confirmation and optionally opens the new entry for editing.
After the send, the entry is searchable from the Legendry, referenceable from your prose (via entity detection), and available to the Lorekeeper for consistency checking. It’s a legitimate lore entry, not a bookmark. The distinction matters: bookmarks are your research library, lore entries are your project’s canon.
What the Extract tab isn’t
Section titled “What the Extract tab isn’t”- Not a web scraper. The tab works on whatever’s currently in the browser, not on URLs you haven’t visited. For bulk extraction across multiple URLs, you’d need a dedicated scraping tool.
- Not a content archiver. The tab shows you what’s currently extracted. It doesn’t preserve old extractions if you re-visit a page later. For archival, use Smart Bookmarks — bookmarks store the extracted content at save time, so the snapshot persists even if the page changes.
- Not a format converter. The tab gives you plain text. If you need formatted HTML, Markdown with structure preserved, or any other format, you’d have to convert elsewhere.
- Not a read-later queue. For reading lists, save as Smart Bookmarks. The Extract tab is for immediate work on the current page, not for queueing pages to come back to.