DAPT and precleaning for DAPT

2026-03-29 20:33:39 -04:00 · 2026-03-29 20:33:39 -04:00 · 9d41dd199f
commit 9d41dd199f
parent c4d7732c87
31 changed files with 7350 additions and 61 deletions
--- a/docs/CODEBOOK-RATIONALE.md
+++ b/docs/CODEBOOK-RATIONALE.md
@ -0,0 +1,87 @@
 # Codebook Rationale & Interpretive Guide
 Companion to `LABELING-CODEBOOK.md`. Covers the "why" behind design decisions and common interpretive pitfalls that aren't obvious from the codebook itself.
 ---
 ## Category Design: Mapping to SEC Regulation S-K Item 106
 The six substantive categories map directly to the structure of the SEC's cybersecurity disclosure rule (adopted July 2023):
 | Codebook Category | SEC Basis | What the SEC is asking |
 |---|---|---|
 | Board Governance | Item 106(c)(1) | How does the board oversee cyber risk? |
 | Management Role | Item 106(c)(2) | Who in management is responsible, and what qualifies them? |
 | Risk Management Process | Item 106(b) | What processes do you use to assess, identify, and manage cyber risk? |
 | Third-Party Risk | Item 106(b) | How do you handle vendor/supply chain cyber risk? |
 | Strategy Integration | Item 106(b)(2) | Has cyber risk materially affected your business or financials? |
 | Incident Disclosure | 8-K Item 1.05 | What happened in an actual cybersecurity incident? |
 | None/Other | N/A | Classifier catch-all for non-substantive content |
 ### Editorial choice: Third-Party Risk as a separate category
 The SEC does not give Third-Party Risk its own subsection — vendor/supply chain oversight is part of 106(b) alongside general risk management. The codebook carves it out as a distinct class because it represents a sufficiently different disclosure pattern to be analytically useful.
 ### "Risk Management" is broader than it sounds
 The SEC's 106(b) definition of risk management encompasses the full lifecycle: assessing, identifying, **and managing** cybersecurity risks. Under frameworks like NIST CSF (which the SEC references), "managing" includes Respond and Recover functions — not just preventive controls.
 This means incident response **procedures** (escalation chains, playbooks, notification workflows, materiality determination processes) are Risk Management Process, not Incident Disclosure. The test:
 | What the paragraph describes | Category |
 |---|---|
 | Pre-established process for handling incidents (playbooks, escalation chains, "in the event of...") | **Risk Management Process** |
 | An actual incident that occurred (dates, scope, remediation of a real event) | **Incident Disclosure** |
 Conditional language ("in the event of," "if necessary," "if and when") is a strong signal that the paragraph describes a process, not an event.
 ### "Strategy Integration" is narrower than it sounds
 Strategy Integration does not mean "strategic approach to cybersecurity." It specifically covers the **business and financial consequences** of cyber risk — the SEC 106(b)(2) question of whether cyber risk hit the bottom line or changed business strategy.
 What qualifies:
 - Materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition")
 - Cybersecurity spending and investment (budgets, dollar amounts, year-over-year changes)
 - Insurance coverage (carriers, limits, deductibles)
 - Financial impact of incidents (costs, revenue loss, insurance claims)
 What does not qualify:
 - Describing a sophisticated incident response process (that's Risk Management Process even though it's "strategic" in the colloquial sense)
 - Describing a materiality **determination process** (the process for deciding if something is material is Risk Management Process; the actual materiality **conclusion** is Strategy Integration)
 ---
 ## Specificity Scale: Design Rationale
 ### The four levels measure disclosure quality progression
 | Level | What it tells you |
 |---|---|
 | 1 — Generic Boilerplate | Company said nothing substantive. Could paste into any filing unchanged. |
 | 2 — Sector-Adapted | Company name-dropped a recognized standard (NIST, ISO 27001, SOC 2, etc.) but nothing unique to their organization. |
 | 3 — Firm-Specific | Company disclosed at least one fact unique to their organization. |
 | 4 — Quantified-Verifiable | Company disclosed two or more independently verifiable hard facts. |
 ### "Sector-Adapted" refers to the cybersecurity sector, not the company's industry
 The name is misleading. "Sector-Adapted" does not mean "the company adapted its disclosure to its industry" (e.g., a bank discussing financial-sector cyber risks). It means the company referenced a recognized **cybersecurity** standard or framework — NIST CSF, ISO 27001, SOC 2, PCI DSS, HIPAA, etc. The "sector" is cybersecurity itself. A utility company mentioning NERC CIP and a retailer mentioning PCI DSS both qualify for Level 2 the same way — they named a standard. The company's own industry is irrelevant to the specificity score.
 ### Level 2 is intentionally narrow
 Level 2 requires naming a recognized standard but having zero firm-specific facts. In practice this is uncommon — most filings either say nothing specific (Level 1) or name a framework alongside a CISO or named committee in the same paragraph (Level 3).
 This is a feature, not a bug. The analytically interesting distinction is between Level 1 (boilerplate box-checking) and Level 3/4 (substantive disclosure). Level 2 is a real but thin middle ground. A mushier middle would make the classifier's job harder without adding research value.
 ### The research contribution is the specificity dimension itself
 The SEC requires cybersecurity disclosure but does not grade its quality. The 1-4 specificity scale measures something the SEC doesn't: how much substance is actually in the disclosure versus boilerplate. The core research question is whether companies are genuinely disclosing or just filling the regulatory box.
 ### Common specificity pitfalls
 **Generic practices are not specific.** Penetration testing, vulnerability scanning, tabletop exercises, phishing simulations, security awareness training, encryption, logging and monitoring — all Level 1. These are standard activities that appear in nearly every filing.
 **Long paragraphs can still be Level 1.** A paragraph can list ten generic security practices and still be boilerplate. Length and detail are not the same as specificity.
 **Cross-references and section titles don't add specificity.** Quoting a long Risk Factors section title with specific-sounding language ("collaborators, contract research organizations, third-party logistics providers") is just metadata, not disclosure substance.
 **The materiality boilerplate is Level 1.** The phrase "have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition" appears nearly verbatim in thousands of filings. It is Strategy Integration (it makes a materiality assessment) but Specificity 1 (the assessment is template language).
--- a/docs/DAPT-PROCEDURE.md
+++ b/docs/DAPT-PROCEDURE.md
@ -0,0 +1,184 @@
 # DAPT/TAPT Training Procedure
 **Date:** 2026-03-29
 **Hardware:** NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128
 ---
 ## Pre-flight Checklist
 | Check | Status |
 |-------|--------|
 | PyTorch 2.10.0+cu128, CUDA available | Verified |
 | RTX 3090, 25.3 GB VRAM, bf16 supported | Verified |
 | CUDA driver 13.2 / runtime 12.8 forward compatible | Verified (GPU matmul test passed) |
 | ModernBERT-large loads: 396M params, max_position_embeddings=8192 | Verified |
 | Corpus: 14,756 docs, ~1.06B tokens, 15 shards | Verified |
 | After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss) | Verified |
 | Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens | Verified |
 | Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30 | Set |
 ## DAPT Corpus Summary
 - **14,568 documents** (after filtering 188 cover pages <10K chars)
 - **~1.056 billion tokens** (ModernBERT tokenizer, 4.72 chars/token)
 - **~136K training sequences** at seq_len=8192
 - **Median document: ~73K tokens** (347K chars) — 90.6% of docs exceed 8192 tokens
 - Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed
 - Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by `ts/scripts/dapt-corpus-prep.ts`
 ## Training Configuration
 **Config file:** `python/configs/dapt/modernbert.yaml`
 | Parameter | Value | Rationale |
 |-----------|-------|-----------|
 | `max_seq_length` | 8192 | Match ModernBERT's pre-training context length |
 | `per_device_train_batch_size` | 1 | Memory-limited at 8192 seq_len on 24GB |
 | `gradient_accumulation_steps` | 32 | Effective batch size = 32 |
 | `num_train_epochs` | 1 | Single pass per Gururangan et al. (2020) and Ponnock (2025) |
 | `learning_rate` | 5e-5 | Standard for continued pre-training |
 | `mlm_probability` | 0.30 | ModernBERT's pre-training masking rate |
 | `warmup_ratio` | 0.05 | ~213 warmup steps |
 | `gradient_checkpointing` | true | Required for 8192 seq_len on 24GB |
 | `bf16` | true | Native RTX 3090 support |
 | `save_steps` | 1000 | Checkpoint every ~1000 steps |
 | `eval_steps` | 1000 | Evaluate every ~1000 steps |
 | `save_total_limit` | 3 | Keep last 3 checkpoints |
 ### Epoch Decision Justification
 We train for 1 epoch (single pass over the corpus), following the empirical consensus:
 - **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks.
 - **Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384):** Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold.
 Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass.
 ### Sequence Length Decision
 ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT.
 The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time.
 For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability.
 ## Step 1: DAPT
 ### Command
 ```bash
 cd python
 bun run py:train dapt --config configs/dapt/modernbert.yaml
 ```
 Equivalent to: `uv run main.py dapt --config configs/dapt/modernbert.yaml`
 ### What happens
 1. Loads ModernBERT-large from HuggingFace (cached after first download)
 2. Loads 14,756 docs from `data/dapt-corpus/`, filters 188 < 10K chars
 3. Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens
 4. Splits 2% validation (~2,700 sequences), 98% train (~133K sequences)
 5. Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing
 6. ~4,257 steps total, logging every 50, checkpoint+eval every 1,000
 7. Saves final model + tokenizer to `checkpoints/dapt/modernbert-large/final/`
 8. Reports final eval loss and perplexity
 ### Expected duration
 ~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing).
 ### Resume if interrupted
 HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically.
 ### Output
 ```
 checkpoints/dapt/modernbert-large/
  checkpoint-1000/
  checkpoint-2000/
  checkpoint-3000/
  final/                  <- final model + tokenizer
    config.json
    model.safetensors
    tokenizer.json
    ...
 ```
 ## Step 2: TAPT
 After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.
 ### Command
 ```bash
 bun run py:train dapt --config configs/dapt/modernbert.yaml \
  --model-path ../checkpoints/dapt/modernbert-large/final \
  --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large \
  --stage tapt
 ```
 ### What happens
 1. Loads the DAPT checkpoint (not the base ModernBERT)
 2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl`
 3. Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
 4. Trains MLM with same hyperparameters
 5. Saves to `checkpoints/tapt/modernbert-large/final/`
 ### Expected duration
 ~2-3 hours (much smaller corpus).
 ### Output
 ```
 checkpoints/tapt/modernbert-large/
  final/                  <- SEC-cyBERT-large (DAPT + TAPT)
 ```
 ## Step 3: Ablation Checkpoints
 The training pipeline produces clean ablation rows for the paper:
 | Model | Checkpoint | Description |
 |-------|-----------|-------------|
 | Base | `answerdotai/ModernBERT-large` | Off-the-shelf, no domain adaptation |
 | +DAPT | `checkpoints/dapt/modernbert-large/final` | After domain pre-training on 14.5K filings |
 | +DAPT+TAPT | `checkpoints/tapt/modernbert-large/final` | After task pre-training on 72K paragraphs |
 Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage.
 ## Monitoring
 During training, the Trainer logs to stderr every 50 steps:
 - `loss` — training MLM loss (cross-entropy on masked tokens)
 - `learning_rate` — current LR (ramps up during warmup, then decays)
 - `epoch` — progress through the epoch
 Every 1,000 steps, it also reports:
 - `eval_loss` — validation MLM loss
 - Perplexity can be computed as `2^eval_loss`
 **What to watch for:**
 - Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
 - Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
 - If loss spikes or goes to NaN, the learning rate may be too high
 ## Artifacts
 | File | Purpose |
 |------|---------|
 | `python/configs/dapt/modernbert.yaml` | DAPT config |
 | `python/configs/dapt/neobert.yaml` | NeoBERT config (if needed) |
 | `python/main.py` | CLI entrypoint |
 | `python/src/dapt/train.py` | Training loop |
 | `python/src/data/corpus.py` | Corpus loading + tokenization |
 | `python/src/common/config.py` | Typed YAML config |
 | `ts/scripts/dapt-corpus-prep.ts` | Corpus preparation from HTML |
 | `ts/scripts/dapt-corpus-analytics.ts` | Corpus analytics |
 | `data/dapt-corpus/shard-*.jsonl` | Cleaned corpus (15 shards) |
--- a/docs/DATA-QUALITY-AUDIT.md
+++ b/docs/DATA-QUALITY-AUDIT.md
@ -0,0 +1,421 @@
 # Data Quality Audit — SEC-cyBERT Corpus
 **Date:** 2026-03-29
 **Scope:** Full audit of DAPT corpus (14,756 docs) and paragraph data (72,045 paragraphs)
 **Method:** 6 automated agents + manual investigation
 ---
 ## 1. Executive Summary
 The data is in better shape than initially feared, but two significant issues were uncovered:
 1. **Inlined section headings affect ~22% of paragraphs** across all generators. These are section titles ("Risk Management and Strategy", "Board Oversight") prepended to paragraph body text with no separator. Consistent across generators = our extraction pipeline's heading detection, not a generator HTML quirk.
 2. **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)** produces severely degraded extraction quality: 36.8% orphan word rate (8x corpus average), 5.9% fragment rate, lowest paragraphs-per-filing. This generator was hidden in a 45% "UNKNOWN" bucket until we identified it. It affects 1,014 filings and 5,779 paragraphs.
 **Decision:** Strip inlined headers from fine-tuning data. Expand orphan word patching to cover EFiling/XDX paragraphs. Tag all paragraphs with generator metadata for quality-aware training.
 ---
 ## 2. Generator Landscape
 ### Identification
 We identified **14 distinct filing generators** covering 99.99% of all 14,759 HTML files. Only 2 files remain unidentified (both 0-byte empty files). Detection used a combination of HTML meta tags, comments, namespace declarations, CSS class patterns, and CIK-based filing agent identification.
 Full reference: `docs/EDGAR-FILING-GENERATORS.md`
 ### Generator Distribution
 | Generator | Files | % | Paragraphs | Quality Tier |
 |-----------|-------|---|------------|-------------|
 | Workiva | 3,592 | 24.3% | 22,407 | Clean |
 | Inline XBRL (unattributed) | 2,417 | 16.4% | 15,233 | Clean |
 | Donnelley Financial Solutions | 2,327 | 15.8% | 13,153 | Clean |
 | EFiling/EDGAR Agent (XDX) | 1,997 | 13.5% | 5,779 | **Bad** |
 | Toppan Merrill | 1,378 | 9.3% | 7,332 | OK |
 | CompSci Transform | 879 | 6.0% | 3,287 | **Degraded** |
 | SEC Publisher | 793 | 5.4% | — | — |
 | ThunderDome | 732 | 5.0% | 3,581 | OK |
 | Broadridge PROfile | 465 | 3.2% | 772 | OK |
 | Certent | 86 | 0.6% | — | — |
 | SGML-wrapped | 58 | 0.4% | — | — |
 | IRIS Carbon | 20 | 0.1% | — | — |
 | RDG Portal | 12 | 0.1% | — | — |
 | PDF to EDGAR | 1 | <0.1% | — | — |
 Note: Not all HTML files produced paragraphs (some lack Item 1C, some are 8-Ks or amendments).
 ### Quality Metrics by Generator
 | Generator | Orphan% | Fragment% | Trunc% | InlHdr% | AvgWC | Paras/Filing |
 |-----------|---------|-----------|--------|---------|-------|-------------|
 | Workiva | 0.6% | 1.2% | 0.5% | 21.9% | 99.7 | 8.4 |
 | Donnelley | 0.5% | 1.4% | 0.5% | 21.8% | 92.7 | 7.9 |
 | Inline XBRL | 0.9% | 1.5% | 0.6% | 21.8% | 98.4 | 8.1 |
 | Toppan Merrill | 3.2% | 3.0% | 1.4% | 23.1% | 84.7 | 8.1 |
 | ThunderDome | 3.0% | 4.3% | 1.8% | 24.4% | 83.0 | 7.7 |
 | Broadridge | 3.4% | 3.5% | 2.1% | 21.5% | 84.4 | 7.8 |
 | **CompSci Transform** | **14.8%** | **5.8%** | 1.7% | 15.4% | 72.1 | 5.6 |
 | **EFiling/XDX** | **36.8%** | **5.9%** | **2.1%** | 16.5% | 69.8 | 5.7 |
 | *Corpus average* | *4.7%* | *2.3%* | *0.9%* | *21.5%* | *91.9* | *7.7* |
 **Bold** = >2x corpus average.
 Key observations:
 - Inlined headers (~22%) are consistent across ALL generators → extraction pipeline issue, not generator-specific
 - Orphan words are highly concentrated: EFiling/XDX (36.8%) and CompSci Transform (14.8%) account for the vast majority
 - Workiva and Donnelley produce the cleanest output (>70% of paragraphs)
 - EFiling/XDX also has the lowest paragraphs-per-filing (5.7 vs 7.7 avg), suggesting extraction misses content
 - CompSci Transform was acquired by Broadridge in July 2024; newer filings may appear as Broadridge PROfile
 ---
 ## 3. Issue Inventory
 ### 3.1 Inlined Section Headings (~22% of paragraphs)
 **What:** Section headings like "Risk Management and Strategy", "Board Oversight", "Cybersecurity Governance" are prepended to paragraph body text with no separator.
 **Example:**
 ```
 Risk Management and Strategy We have designed our cybersecurity risk management program to identify,
 assess, and manage risks from cybersecurity threats...
 ```
 **Cause:** The `extractItem1C()` function in `fast-reparse.ts` extracts the full Item 1C text including sub-section headings, and the paragraph segmenter doesn't strip them. The headings become the first "sentence" of the paragraph.
 **Impact on classification:**
 - The heading is a near-perfect predictor of `content_category` — creates shortcut learning risk
 - The heading tells you nothing about `specificity_level` — model still has to read body text
 - At inference time, heading presence will be inconsistent across filings
 - **Decision: Strip from fine-tuning data.** Headings are consistent across generators, so a single detection heuristic works.
 **Detection heuristic:**
 - Common Item 1C sub-headings: "Risk Management and Strategy", "Risk Management", "Board Oversight", "Governance", "Management('s) Role", "Cybersecurity Governance", "Incident Detection", "Incident Response", "Strategy", "Third Party", "Third-Party"
 - Structural: 2-5 title-cased words at paragraph start, followed by sentence text starting with "We", "Our", "The", a pronoun, or an article
 ### 3.2 Orphan Words (4.7% overall, concentrated in 2 generators)
 **What:** The first word of a paragraph is dropped during extraction, leaving a paragraph that starts with lowercase mid-sentence.
 **Example:**
 ```
 sole executive officer and director is responsible for assessing and managing cybersecurity risks...
 ```
 (should be: "Our sole executive officer...")
 **Cause:** HTML source wraps text at fixed column width. The `<span>` opening tag consumes most of a line, so only the first word fits before a source newline. `stripHtml()` preserves that newline, and downstream processing drops the single-word fragment.
 **Scope by generator:**
 - EFiling/XDX: 36.8% of its paragraphs (2,127 affected)
 - CompSci Transform: 14.8% (487 affected)
 - All others: <3.5%
 - Total: ~3,400 paragraphs corpus-wide
 **Already patched:** 215 paragraphs were surgically patched in `paragraphs-clean.patched.jsonl`. The remaining ~3,185 need the same treatment.
 **Impact on classification:** Meaning is preserved — annotators and models can infer the missing word from context. But systematically missing subjects ("We", "Our") could subtly bias specificity assessment.
 ### 3.3 Orphaned Fragments (2.3% overall)
 **What:** List items split from their parent paragraph, creating very short standalone paragraphs.
 **Example:**
 ```
 the use of external service providers, where appropriate, to assess, test or otherwise assist with
 aspects of our security controls;
 ```
 **Cause:** Semicolon-terminated list items are treated as paragraph boundaries by the segmenter.
 **Scope:** 250 paragraphs identified in the narrower audit; ~1,660 total with <25 words.
 **Impact:** These are classifiable in isolation (the content is clear) but lack the framing context of the parent list. Likely annotated correctly but may have lower model confidence.
 ### 3.4 Truncated Paragraphs (0.37%)
 **What:** Paragraphs ending mid-sentence without terminal punctuation.
 **Two patterns:**
 1. Paragraph absorbed the start of the next section's heading (ends with "Governance", "Identify")
 2. True truncation — cross-reference sentence cut off ("Risk Factors" in this)
 **Scope:** 264 paragraphs.
 **Impact:** Low — 0.37% and meaning is usually recoverable from context.
 ### 3.5 Cross-Filing Boilerplate (53.6%)
 **What:** Paragraphs with identical text appearing in multiple filings. Driven by law firms and compliance consultants providing template language.
 **Scope:** 38,601 paragraphs share text with at least one other filing. 1,705 unique boilerplate texts appear in 3+ filings. The most-duplicated text appears in 138 filings across 84 companies.
 **Impact:** This IS the construct being measured. Boilerplate paragraphs should be classified as Specificity Level 1 (Generic Boilerplate). Not a quality issue — it's the signal.
 ---
 ## 4. DAPT Corpus Audit
 ### 4.1 Corpus Stats
 - **14,756 documents**, 15 shards
 - **~1.06 billion tokens** (ModernBERT tokenizer; chars/4.72, not chars/4.0)
 - **Median doc length:** 347K chars (~73K tokens)
 - **90.8% of docs exceed 8,192 tokens** — chunking is mandatory (handled by training pipeline)
 ### 4.2 Issues Found
 | Issue | Scope | Verdict |
 |-------|-------|---------|
 | 188 docs < 10K chars (cover pages) | 0.04% of tokens | Filter out |
 | XBRL preambles (8% of docs) | 0.18% of chars | Negligible |
 | Financial table fragments (~25% of lines) | Widespread | Acceptable — SEC domain includes numbers |
 | URLs in 80% of docs (~4 per doc) | Low | Optional cleanup |
 | 64 8-K filings mixed in | Tiny | Keep — domain-relevant |
 | 1,470 amendments (median 94K chars) | Substantial content | Keep |
 | 2 single-block docs (no paragraph breaks) | 2 docs | Filter out |
 | 242 near-duplicate cross-year filings | 1.6% | Keep — different content |
 | 0 garbled text, 0 HTML artifacts | | Clean |
 | 0 sentence boundary violations | | Clean |
 ### 4.3 Decision
 Filter <10K char docs and 2 structureless docs. Everything else is acceptable for unsupervised MLM. The model will learn SEC language including financial notation, legal boilerplate, and cybersecurity terminology.
 ---
 ## 5. Patch History
 ### Patch 1: Orphan Word Fix (2026-03-29)
 - **Scope:** 215 paragraphs, 77 filings
 - **Method:** Detect orphan word in raw HTML, prepend to paragraph text
 - **Validation:** All prefix additions, 0 boundary changes, 0 text shrinkages
 - **Files:** `paragraphs-clean.patched.jsonl`, `training.patched.jsonl`
 - **Annotation impact:** 142 annotated paragraphs affected (0.28%), meaning preserved
 ### Patch 2: Expanded Orphan Word Fix (2026-03-29)
 - **Scope:** 2,233 paragraphs (includes Patch 1's 215; net 2,026 new)
 - **Method:** HTML lookback — find paragraph text in stripped HTML, extract preceding word
 - **Top orphan words:** We (632), Our (403), As (152), The (91), To (84), In (78), Cybersecurity (64)
 - **Validation:** 0 false positives after filtering "Table of Contents" artifacts. 1,122 candidates rejected (legitimate list items starting with lowercase).
 - **Annotation impact:** 1,400 annotated paragraphs affected. Label bias detected: Strategy Integration 1.55x over-represented, Management Role 0.49x under-represented in orphan-word paragraphs. **Recommended: re-run Stage 1 on patched text (~$15-20, may resolve conflicts).**
 - **Script:** `ts/scripts/patch-orphan-words.ts`
 - **Patch file:** `data/paragraphs/patches/orphan-word-patches.jsonl`
 ### Patch 3: Heading Stripping (2026-03-29)
 - **Scope:** 7,514 paragraphs (10.4%)
 - **Method:** Explicit pattern matching against known Item 1C sub-section headings (71 unique headings). Validated by confirming body text starts with sentence-starting word.
 - **Top headings stripped:** Risk Management and Strategy (2,453), Cybersecurity Risk Management and Strategy (1,281), Cybersecurity Governance (1,208), Governance (301), Third-Party Risk Management (224)
 - **Annotation impact:** 5,013 annotated paragraphs. Heading removal eliminates shortcut learning risk (heading was near-perfect predictor of content_category).
 - **Script:** Inline Python (see audit process notes)
 - **Patch file:** `data/paragraphs/patches/heading-strip-patches.jsonl`
 ### Patch 4: Colon-Headed Paragraphs (2026-03-29)
 - **Scope:** 370 paragraphs
 - **Method:** Regex match for "Heading Text: Sentence..." patterns. Only fires when colon is followed by known sentence-starting word.
 - **Top headings stripped:** Education and Awareness (97), Safeguards (18), Management (15), Approach (13), Training (11)
 - **Annotation impact:** 227 annotated paragraphs.
 - **Patch file:** `data/paragraphs/patches/colon-heading-patches.jsonl`
 ### Patch 5: Extended Separator Headings (2026-03-29)
 - **Scope:** 184 paragraphs
 - **Method:** Detect headings with period, dash/em-dash, semicolon, or ALL-CAPS separators that Patches 3-4 missed.
 - **Annotation impact:** 133 annotated paragraphs.
 - **Patch file:** `data/paragraphs/patches/heading-strip-v2-patches.jsonl`
 ### Patch 6: HTML-Confirmed Headings (2026-03-29)
 - **Scope:** 343 paragraphs
 - **Method:** Extract bold/underline/h-tag styled text from source HTML (cached in `filing-headings.jsonl`), match against paragraph starts, validate with sentence-start check. Zero false positives — if the HTML says it's bold, it's a heading.
 - **855 ambiguous cases rejected** where styled text was a sentence subject (e.g., bold "Cybersecurity" starting "Cybersecurity is a critical component...")
 - **Annotation impact:** 270 annotated paragraphs.
 - **Scripts:** `ts/scripts/extract-html-headings.ts` (1.7s for 6,341 filings with 32 workers)
 - **Patch file:** `data/paragraphs/patches/heading-strip-html-patches.jsonl`
 - **Cache:** `data/paragraphs/quality/filing-headings.jsonl`
 ### Cumulative Heading Strip Summary
 | Pass | Method | Count | Cumulative |
 |------|--------|-------|-----------|
 | Patch 3 | Explicit heading patterns (space separator) | 7,514 | 7,514 |
 | Patch 4 | Colon separator | 370 | 7,884 |
 | Patch 5 | Period/dash/caps/semicolon | 184 | 8,068 |
 | Patch 6 | HTML bold/underline confirmed | 343 | 8,411 |
 | **Total** | | **8,411** | **11.7% of corpus** |
 ---
 ## 6. Data Integrity Rules
 1. **`paragraphs-clean.jsonl` is FROZEN.** Never modify. It is the original extraction output and the source of truth for reproducibility.
 2. **All fixes go through `.patched.jsonl` files.** The patched file has the same schema and IDs as the original. Text may differ. TextHash is updated.
 3. **Annotations link by paragraph `id` (UUID).** This linkage is stable across patches — IDs never change.
 4. **Never re-run extraction from HTML.** Cascade effects from merge logic changes cause thousands of ripple-effect text changes (documented in `docs/SEC-HTML-CLEANING.md`). Surgical JSONL patching is the only safe approach.
 5. **Every patch is documented** with scope, method, validation, and annotation impact.
 6. **Quality metadata is separate from text data.** Per-paragraph quality scores live in a separate file, not embedded in the paragraph data. This keeps the data schema stable.
 ---
 ## 7. Quality Tier System
 Each paragraph gets a quality tier based on detected issues:
 | Tier | Criteria | Count | % | Training Action |
 |------|----------|-------|---|-----------------|
 | **clean** | No detected issues | 58,165 | 80.7% | Full weight (1.0) |
 | **headed** | Had inlined section heading (now stripped) | 7,402 | 10.3% | Full weight (1.0) — heading removed |
 | **degraded** | Embedded bullets (1,941), invisible merges (222), fragments, truncations, no-cyber | 4,331 | 6.0% | Downweight (0.5) — content preserved but structure degraded |
 | **minor** | Had orphan word (now fixed) | 2,147 | 3.0% | Full weight (1.0) — word restored |
 Note: Tiers reflect the most severe issue. A paragraph can have multiple issues. All "headed" and "minor" paragraphs have been patched — the tier records what WAS wrong, not what IS wrong.
 ### Sample Weighting Strategy
 During fine-tuning, each training sample is weighted by quality tier to reduce the influence of structurally degraded paragraphs without discarding them entirely:
 - **clean + headed + minor (1.0 weight):** Content is correct and text is clean (after patching). These form the reliable training signal.
 - **degraded (0.5 weight):** Content is present but structural issues (concatenated list items, fragments, truncations) may cause the text to misrepresent paragraph-level semantics. The labels are likely correct (models can infer meaning despite structural noise), but the text doesn't match what the model will see at inference time on clean filings. Downweighting reduces overfitting to degraded patterns without losing the content signal.
 Sample weighting is applied via the HuggingFace Trainer's `sample_weight` column or a custom loss function that multiplies cross-entropy by the tier weight.
 ### Additional Findings (from anomaly detection)
 | Finding | Count | Concern |
 |---------|-------|---------|
 | Embedded bullet points mid-text | 1,941 (flagged degraded) | MEDIUM — semicolon-separated list items without bullet markers |
 | Invisible merges (no separators) | 222 (flagged degraded) | MEDIUM — list items concatenated with no trace of structure (e.g., Bancorp 34) |
 | No cybersecurity keywords at all | 528 (348 annotated) | LOW — investigated, keyword filter was too narrow, labels correct |
 | Cross-references to other SEC items | 5,750 | LOW — mostly legitimate "see Item 1A" refs |
 | Dollar amounts in text | 46 | LOW — mostly legitimate incident costs |
 | Paragraphs >400 words | 149 | LOW — possible failed splits |
 | Repeated sentences within paragraph | 9 | LOW — copy-paste artifacts |
 ---
 ## 8. Annotation Impact (Quantified)
 Of 49,795 annotated paragraphs:
 ### Annotated set by generator
 | Generator | Annotated Paras | % of Annotated Set |
 |-----------|----------------|-------------------|
 | Inline XBRL | ~10,500 | 21.1% |
 | Workiva | ~15,300 | 30.7% |
 | Donnelley | ~9,000 | 18.1% |
 | Toppan Merrill | ~5,900 | 11.8% |
 | EFiling/XDX | 3,562 | 7.2% |
 | ThunderDome | ~2,500 | 5.0% |
 | CompSci Transform | 2,288 | 4.6% |
 | Others | ~700 | 1.4% |
 ### Orphan words in annotated set
 **2,178 annotated paragraphs (4.37%)** start with lowercase (non-list) — orphan word candidates.
 | Generator | Orphan Paras | % of Generator's Annotated | % of All Orphans |
 |-----------|-------------|---------------------------|-----------------|
 | EFiling/XDX | 1,389 | 39.0% | 63.8% |
 | CompSci Transform | 401 | 17.5% | 18.4% |
 | All others | 388 | <5% each | 17.8% |
 EFiling/XDX alone accounts for 63.8% of all orphan-word paragraphs in the annotated set.
 ### Label bias in orphan-word paragraphs
 - **Strategy Integration** is over-represented at 1.55x base rate (16.1% of orphan paras vs 10.4% overall)
 - **Board Governance** and **Management Role** are under-represented (0.60x and 0.49x) — likely because governance headings/lead-in sentences get split off, leaving the orphan fragment lacking governance context
 This suggests orphan words may cause subtle category misclassification, not just missing text.
 ### Inlined headers in annotated set
 **4,513 annotated paragraphs (9.06%)** have section headings merged into text. Relatively uniform across generators (~9-10%), but notably lower for EFiling/XDX (5.3%) and CompSci Transform (5.6%) — these generators split at headers rather than merging them.
 ### Combined impact
 **6,691 annotated paragraphs (13.44%)** have either orphan-word OR inlined-header issues.
 Per generator:
 - EFiling/XDX: 1,577 of 3,562 (44.3%) affected
 - CompSci Transform: ~600 of 2,288 (~26%) affected
 - All others: <15% affected
 ---
 ## 9. Summary of Changes to Annotated Data
 | Change | Annotated Paragraphs Affected | Semantic Impact |
 |--------|------------------------------|----------------|
 | Orphan word restored | 1,400 | Label bias detected (Strategy 1.55x, Management 0.49x) |
 | Heading stripped (all passes) | ~5,643 | Removes shortcut learning signal |
 | No-cyber flagged as degraded | 348 | May want to exclude from training |
 | **Total modified** | **~7,100 of 49,795 (14.3%)** | |
 ## 10. Remaining Questions / Next Steps
 - **Re-run Stage 1 on orphan-word paragraphs** (~$15-20 for 1,400 paragraphs). Label bias suggests some misclassification. May resolve conflicts and save Stage 2 judge costs.
 - **Heading-stripped paragraphs:** Existing labels are likely still valid — annotators classified the body text, not the heading. But could re-run if budget allows.
 - **Exclude 348 no-cyber-keyword annotated paragraphs?** If labeled "None/Other" they're fine; if other categories, they're noise from section bleed.
 - **855 ambiguous HTML heading cases** — bold/underline text at paragraph start but also a valid sentence subject. Would need manual review to resolve.
 - **Run DAPT** — filter <10K char docs from DAPT corpus, then start training.
 ---
 ## 11. Artifacts Produced
 ### Data Files
 ```
 data/paragraphs/
 ├── paragraphs-clean.jsonl              ← FROZEN original (72,045 paragraphs)
 ├── paragraphs-clean.patched.jsonl      ← All 6 patches applied (orphan + heading)
 ├── training.patched.jsonl              ← Training subset, all patches applied (49,795)
 ├── patches/
 │   ├── orphan-word-patches.jsonl       ← 2,233 orphan word recovery records
 │   ├── heading-strip-patches.jsonl     ← 7,514 heading strip records (space sep)
 │   ├── colon-heading-patches.jsonl     ← 370 colon-heading strip records
 │   ├── heading-strip-v2-patches.jsonl  ← 184 period/dash/caps/semicolon headings
 │   └── heading-strip-html-patches.jsonl← 343 HTML bold/underline confirmed headings
 └── quality/
    ├── generator-tags.jsonl            ← 14,759 accession → generator mappings
    ├── quality-scores.jsonl            ← 72,045 per-paragraph quality metadata
    ├── filing-headings.jsonl           ← Cached styled headings from HTML (3,459 filings)
    └── ambiguous-filings.txt           ← Filing list used for HTML heading extraction
 ```
 ### Scripts
 | Script | Purpose |
 |--------|---------|
 | `ts/scripts/patch-orphan-words.ts` | Detect and recover orphan words from HTML source |
 | `ts/scripts/tag-generators.ts` | Identify filing generator from HTML signatures |
 | `ts/scripts/extract-html-headings.ts` | Extract bold/underline headings from HTML (32-worker parallel, 1.7s) |
 | `ts/scripts/dapt-corpus-prep.ts` | DAPT corpus preparation (HTML → clean JSONL, 32-worker parallel) |
 | `scripts/detect_generators.py` | Python generator detection (initial analysis) |
 | `scripts/generator_quality_analysis.py` | Generator × quality metrics cross-reference |
 | `scripts/analyze_generator_quality.py` | Annotation impact analysis by generator |
 | `scripts/find_heading_candidates.py` | Creative heading pattern hunt (7 approaches) |
 | `scripts/data_quality_audit.py` | Statistical anomaly detection (content, structure, outliers) |
 | `scripts/audit_corpus.py` | Text corruption checks |
 | `scripts/audit_paragraphs.py` | Boundary audit (per-filing stats, coherence, duplicates) |
 ### Documentation
 | Doc | Content |
 |-----|---------|
 | `docs/DATA-QUALITY-AUDIT.md` | This document — full audit findings, patch history, quality tiers |
 | `docs/EDGAR-FILING-GENERATORS.md` | Generator reference — 14 vendors, signatures, market share, quality issues |
 | `docs/SEC-HTML-CLEANING.md` | HTML cleaning lessons and pitfalls |
--- a/docs/EDGAR-FILING-GENERATORS.md
+++ b/docs/EDGAR-FILING-GENERATORS.md
@ -0,0 +1,490 @@
 # SEC EDGAR Filing Generator Reference
 Reference for identifying which software generated a given SEC 10-K HTML filing.
 Built from direct inspection of EDGAR filings and market research (March 2026).
 ---
 ## 1. Major Vendors and HTML Signatures
 ### Workiva (Wdesk) -- Market Leader for 10-K/10-Q
 **Filing agent CIK:** `0001628280`
 **HTML comment signature (lines 1-3):**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <!--XBRL Document Created with the Workiva Platform-->
 <!--Copyright 2025 Workiva-->
 <!--r:{uuid},g:{uuid},d:{hex-id}-->
 ```
 **Detection heuristics:**
 - HTML comment: `XBRL Document Created with the Workiva Platform`
 - HTML comment: `Copyright \d{4} Workiva`
 - Third comment line contains `r:`, `g:`, `d:` UUIDs (document/generation tracking)
 - `xml:lang="en-US"` attribute on `<html>` tag
 - Body uses inline styles exclusively (no CSS classes on content elements)
 - Heavy use of `<span>` with inline styles containing `background-color`, `font-family`, `font-size`, `font-weight`, `line-height` in every span
 - Div IDs follow pattern: `i{hex32}_{number}` (e.g., `id="i56b78781f7c84a038f6ae0f6244f7dd8_1"`)
 - Tables use `display:inline-table` and `vertical-align:text-bottom`
 - iXBRL fact IDs follow pattern: `F_{uuid}` (e.g., `id="F_d8dc1eb1-109d-445d-a55a-3dde1a81ca63"`)
 - No `<meta name="generator">` tag
 - No CSS classes on body content (purely inline styles)
 **Structural patterns:**
 - Span-heavy: nearly every text fragment wrapped in `<span style="...">`
 - Font specified as `font-family:'Times New Roman',sans-serif` (note: sans-serif fallback, unusual)
 - Line-height specified on every span (e.g., `line-height:120%`)
 - Background color explicitly set: `background-color:#ffffff`
 **Known quality issues:**
 - Extremely verbose HTML; simple paragraphs become deeply nested span trees
 - Text extraction is clean because span boundaries align with word boundaries
 - Large file sizes due to inline style repetition
 ---
 ### DFIN / Donnelley Financial Solutions (ActiveDisclosure)
 DFIN operates under **two distinct CIKs** with **two different HTML output formats**.
 #### DFIN "New" ActiveDisclosure (primary)
 **Filing agent CIK:** `0000950170` (also `0000950130`)
 **HTML comment signature:**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <!-- DFIN New ActiveDisclosure (SM) Inline XBRL Document - http://www.dfinsolutions.com/ -->
 <!-- Creation Date :2025-02-18T12:36:24.4008+00:00 -->
 <!-- Copyright (c) 2025 Donnelley Financial Solutions, Inc. All Rights Reserved. -->
 ```
 **Detection heuristics:**
 - HTML comment: `DFIN New ActiveDisclosure`
 - HTML comment: `http://www.dfinsolutions.com/`
 - HTML comment: `Copyright (c) \d{4} Donnelley Financial Solutions`
 - HTML comment: `Creation Date :` with ISO timestamp
 - Body style: `padding:8px;margin:auto!important;`
 - Inline styles use `font-kerning:none;min-width:fit-content;` on most spans
 - Extensive use of `white-space:pre-wrap` on spans
 - CSS class `item-list-element-wrapper` and `page-border-spacing` present
 - iXBRL fact IDs follow pattern: `F_{uuid}`
 **Structural patterns:**
 - Every text span carries `min-width:fit-content` (distinctive)
 - Uses `&#160;` for spacing extensively
 - Uses `<p>` tags with inline margins for all paragraphs
 - Tables use explicit `padding-top:0in;vertical-align:top;padding-bottom:0in` cell styles
 #### DFIN Legacy (RR Donnelley heritage)
 **Filing agent CIK:** `0001193125`
 **HTML signature:**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <html xmlns:link="..." xmlns:xbrldi="..." ...>
 <head>
 <title>10-K</title>
 <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
 </head>
 <body style="line-height:normal;background-color:white;">
 <h5 style="font-size:10pt;font-weight:bold"><a href="#toc">Table of Contents</a></h5>
 ```
 **Detection heuristics:**
 - No identifying HTML comments (no generator/copyright comment)
 - Accession number prefix `0001193125` is definitive
 - `<body style="line-height:normal;background-color:white;">`
 - Immediately starts with `<h5>` Table of Contents link
 - Uses deprecated namespace aliases: `xmlns:xl`, `xmlns:xbrll`, `xmlns:deprecated`
 - iXBRL fact IDs follow pattern: `Fact_{large_number}` (e.g., `id="Fact_129727210"`)
 - Uses `<FONT>` tags (HTML 3.2 style) in some documents
 - Uppercase HTML tags in older filings (`<P>`, `<B>`, `<DIV>`)
 **Structural patterns:**
 - Cleaner HTML than ActiveDisclosure New
 - Uses semantic `<h5>` for table of contents
 - Inline styles are simpler and more standard
 - File description filenames follow pattern: `d{number}d10k.htm`
 ---
 ### Toppan Merrill (Bridge)
 **Filing agent CIKs:** `0001104659` (primary), `0001558370` (secondary)
 **HTML comment signature:**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 10.9.0.3 -->
 <!-- Based on: iXBRL 1.1 -->
 <!-- Created on: 2/21/2025 8:11:11 PM -->
 <!-- iXBRL Library version: 1.0.9062.16423 -->
 <!-- iXBRL Service Job ID: {uuid} -->
 ```
 **Detection heuristics:**
 - HTML comment: `iXBRL document created with: Toppan Merrill Bridge iXBRL`
 - HTML comment: `iXBRL Library version:`
 - HTML comment: `iXBRL Service Job ID:`
 - Includes version number in comment (e.g., `10.9.0.3`)
 - `<title>` tag contains company name + period end date (e.g., `Sunstone Hotel Investors,&#160;Inc._December 31, 2024`)
 - Uses `xmlns:xs` alongside `xmlns:xsi` (both XML Schema namespaces)
 - Body starts with `<div style="margin-top:30pt;"></div>` (distinctive)
 - iXBRL hidden div uses `display:none;` (no additional styles on the div)
 **Structural patterns:**
 - Context IDs use descriptive names with GUIDs: `As_Of_12_31_2024_{base64-like}`, `From_01_01_2024_to_12_31_2024_{guid}`
 - Hidden fact IDs follow pattern: `Hidden_{base64-like}`
 - Unit ref IDs follow pattern: `Unit_Standard_USD_{base64-like}`
 - No CSS classes used on content elements
 - Relatively clean HTML structure
 ---
 ### RDG Filings (ThunderDome Portal)
 **Filing agent CIK:** `0001437749`
 **HTML signature:**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <html xmlns:thunderdome="http://www.RDGFilings.com" ...>
 <head>
  <title>avpt20241231_10k.htm</title>
  <!-- Generated by ThunderDome Portal - 2/27/2025 6:06:48 PM -->
  <meta http-equiv="Content-Type" content="text/html"/>
 </head>
 <body style="cursor: auto; padding: 0in 0.1in; font-family: &quot;Times New Roman&quot;, Times, serif; font-size: 10pt;">
 ```
 **Detection heuristics:**
 - XML namespace: `xmlns:thunderdome="http://www.RDGFilings.com"`
 - HTML comment: `Generated by ThunderDome Portal`
 - `<title>` contains the filing filename
 - Body style includes `cursor: auto; padding: 0in 0.1in`
 - iXBRL fact IDs prefixed with `thunderdome-` (e.g., `id="thunderdome-EntityCentralIndexKey"`)
 - Context ref IDs use simple date ranges: `d_2024-01-01_2024-12-31`
 - Other fact IDs follow `ixv-{number}` or `c{number}` pattern
 **Market presence:** ~14,000 filings/year, rank #9 among filing agents. About 5% of annual filings.
 ---
 ### Broadridge Financial Solutions (PROfile)
 **Filing agent CIKs:** `0001140361` (primary), `0001133228` (secondary)
 **HTML comment signature:**
 ```html
 <!-- Licensed to: Broadridge
     Document created using Broadridge PROfile 25.1.1.5279
     Copyright 1995 - 2025 Broadridge -->
 ```
 **Detection heuristics:**
 - HTML comment: `Licensed to: Broadridge`
 - HTML comment: `Document created using Broadridge PROfile` with version number
 - HTML comment: `Copyright 1995 - \d{4} Broadridge`
 - CSS classes with `BRPF` prefix: `BRPFPageBreak`, `BRPFPageBreakArea`, `BRPFPageFooter`, `BRPFPageHeader`, `BRPFPageNumberArea`
 - CSS class: `DSPFListTable`
 - CSS class: `cfttable`
 - CSS class: `Apple-interchange-newline` (suggests Mac/WebKit origin)
 - Context ref IDs use XBRL-standard descriptive format: `c20240101to20241231_AxisName_MemberName`
 **Note:** Broadridge acquired CompSci Resources LLC in July 2024 and is integrating CompSci's Transform platform. Filings may transition to Broadridge branding over time.
 ---
 ### CompSci / Novaworks (Transform and GoFiler)
 CompSci Resources produces two tools that leave distinct signatures.
 #### CompSci Transform (now Broadridge)
 **Filed via:** EdgarAgents LLC (`0001213900`) or other agents
 **HTML comment signature:**
 ```html
 <?xml version='1.0' encoding='ASCII'?>
 <!-- Generated by CompSci Transform (tm) - http://www.compsciresources.com -->
 <!-- Created: Mon Mar 17 19:46:10 UTC 2025 -->
 ```
 **Detection heuristics:**
 - HTML comment: `Generated by CompSci Transform`
 - HTML comment: `http://www.compsciresources.com`
 - XML namespace: `xmlns:compsci="http://compsciresources.com"`
 - Body wrapped in: `<div style="font: 10pt Times New Roman, Times, Serif">`
 - Uses `<!-- Field: Rule-Page -->` and `<!-- Field: /Rule-Page -->` HTML comments as structural markers
 - Empty `<div>` tags used as spacers between paragraphs
 - iXBRL context refs use simple sequential IDs: `c0`, `c1`, `c2`, ...
 - iXBRL fact IDs follow `ixv-{number}` pattern
 - Uses shorthand CSS: `font: 10pt Times New Roman, Times, Serif` (combined property)
 - Margin shorthand: `margin: 0pt 0`
 **Known quality issues:**
 - Words can be broken across `<span>` tags mid-word
 - Heavy use of `&#160;` for spacing
 - Empty divs between every paragraph create parsing noise
 - `<!-- Field: ... -->` comments interspersed throughout document body
 #### Novaworks GoFiler (XDX format)
 **Filed via:** SECUREX Filings (`0001214659`) or self-filed
 **HTML signature:**
 ```html
 <head>
     <title></title>
 <meta http-equiv="Content-Type" content="text/html"/>
 </head>
 <!-- Field: Set; Name: xdx; ID: xdx_021_US%2DGAAP%2D2024%2D... -->
 <!-- Field: Set; Name: xdx; ID: xdx_03B_... -->
 ```
 **Detection heuristics:**
 - HTML comments with pattern: `<!-- Field: Set; Name: xdx; ID: xdx_{code}_{data} -->`
 - XDX comments appear between `</head>` and `<body>` (unusual placement)
 - Body style: `font: 10pt Times New Roman, Times, Serif` (same shorthand as CompSci)
 - Empty `<title></title>` tag
 - iXBRL fact IDs use `xdx2ixbrl{number}` pattern (e.g., `id="xdx2ixbrl0102"`)
 - Standard fact IDs use `Fact{number:06d}` pattern (e.g., `id="Fact000003"`)
 - Context refs use `From{date}to{date}` or `AsOf{date}` format (no separators within date)
 **XDX explained:** XDX (XBRL Data Exchange) is GoFiler's proprietary format that uses HTML tag ID attributes ("engrams") to embed XBRL metadata. The `xdx_` comments carry taxonomy, entity, period, and unit definitions that GoFiler uses to generate the final iXBRL.
 ---
 ### Discount EDGAR / NTDAS (XBRLMaster / EDGARMaster)
 **Filing agent CIK:** `0001477932`
 **HTML signature:**
 ```html
 <head>
  <title>crona_10k.htm</title>
  <!--Document Created by XBRLMaster-->
  <meta http-equiv="Content-Type" content="text/html"/>
 </head>
 <body style="text-align:justify;font:10pt times new roman">
 ```
 **Detection heuristics:**
 - HTML comment: `Document Created by XBRLMaster`
 - Body style: `text-align:justify;font:10pt times new roman`
 - Hidden iXBRL div has `id="XBRLDIV"`
 - Additional body styles include `margin-left:7%;margin-right:7%`
 - Uses lowercase `times new roman` (no capitalization)
 - iXBRL fact IDs use `ixv-{number}` pattern
 ---
 ### EdgarAgents LLC
 **Filing agent CIK:** `0001213900`
 EdgarAgents is a filing agent service, not a document creation tool. The HTML they submit is typically generated by CompSci Transform, GoFiler, or other tools. Check the HTML comments to identify the actual generator.
 ---
 ### DFIN Legacy (pre-iXBRL / SGML-era)
 **Filing agent CIK:** `0001193125`
 Older filings (pre-2019) from this CIK may appear in `<DOCUMENT>` SGML wrapper format:
 ```html
 <DOCUMENT>
 <TYPE>10-K
 <SEQUENCE>1
 <FILENAME>d913213d10k.htm
 <DESCRIPTION>10-K
 <TEXT>
 <HTML><HEAD>
 <TITLE>10-K</TITLE>
 </HEAD>
 <BODY BGCOLOR="WHITE" STYLE="line-height:Normal">
 <Center><DIV STYLE="width:8.5in" align="left">
 ```
 **Detection heuristics:**
 - Uppercase HTML tags: `<HTML>`, `<HEAD>`, `<BODY>`, `<P>`, `<B>`
 - `BGCOLOR="WHITE"` attribute (deprecated HTML)
 - `<Center>` tag with capital C
 - `<DIV STYLE="width:8.5in"` (page-width container)
 - `<FONT>` tags for styling
 - Filename pattern: `d{number}d10k.htm`
 ---
 ## 2. Filing Agent Market Share
 Based on [secfilingdata.com](https://www.secfilingdata.com/top-filing-agents/) total filings across all form types:
 | Rank | Filing Agent | CIK | 2025 Filings | Total (All Time) |
 |------|-------------|-----|-------------|-----------------|
 | 1 | Donnelley Financial (DFIN) | 0001193125 | 65,180 | 1,872,890 |
 | 2 | EdgarAgents LLC | 0001213900 | 48,021 | 367,211 |
 | 3 | Quality Edgar (QES) | 0001839882 | 38,017 | 151,031 |
 | 4 | Toppan Merrill | 0001104659 | 48,260 | 988,715 |
 | 5 | WallStreetDocs Ltd | 0001918704 | 22,387 | 56,431 |
 | 6 | Workiva (Wdesk) | 0001628280 | 21,606 | 141,795 |
 | 7 | M2 Compliance LLC | 0001493152 | 13,810 | 164,603 |
 | 8 | Davis Polk & Wardwell LLP | 0000950103 | 16,231 | 326,359 |
 | 9 | RDG Filings (ThunderDome) | 0001437749 | 14,209 | 187,270 |
 | 10 | Morgan Stanley | 0001950047 | 12,822 | 56,468 |
 | 11 | Broadridge | 0001140361 | -- | 597,664 |
 | 14 | SECUREX Filings | 0001214659 | -- | 115,218 |
 | 19 | Blueprint | 0001654954 | -- | 62,250 |
 | 20 | FilePoint | 0001398344 | -- | 76,218 |
 | 38 | Discount EDGAR | 0001477932 | -- | 37,422 |
 **For 10-K/10-Q specifically (estimated from biotech IPO data and market research):**
 - DFIN: ~40-50% of annual/quarterly filings
 - Workiva: ~25-35% (has been gaining share from DFIN since ~2010)
 - Toppan Merrill: ~10-15%
 - RDG Filings: ~5%
 - Broadridge/CompSci: ~5%
 - Others (law firms, self-filed, smaller agents): ~5-10%
 ---
 ## 3. XBRL/iXBRL Tool Signatures
 The iXBRL tagging tool is often the same as the filing generator, but not always. Key distinguishing patterns in the iXBRL layer:
 | Tool | Context Ref Pattern | Fact ID Pattern | Unit Ref Pattern |
 |------|-------------------|----------------|-----------------|
 | Workiva | `C_{uuid}` | `F_{uuid}` | `U_{uuid}` |
 | DFIN New | `C_{uuid}` | `F_{uuid}` | Standard names |
 | DFIN Legacy | `Fact_{large_int}` | `Fact_{large_int}` | Standard names |
 | Toppan Merrill | `As_Of_{date}_{guid}` / `From_{date}_to_{date}_{guid}` | `Hidden_{guid}` | `Unit_Standard_USD_{guid}` |
 | ThunderDome | `d_{date_range}` / `i_{date}` | `thunderdome-{name}` or `ixv-{n}` or `c{n}` | Standard names |
 | CompSci Transform | `c0`, `c1`, `c2` ... | `ixv-{number}` | Standard names |
 | GoFiler (XDX) | `From{date}to{date}` / `AsOf{date}` | `xdx2ixbrl{number}` | Standard names |
 | XBRLMaster | `From{date}to{date}` | `ixv-{number}` | Standard names |
 | Broadridge PROfile | `c{date}to{date}_{axis}_{member}` | Descriptive | Standard names |
 ---
 ## 4. Detection Priority (Recommended Heuristic Order)
 For maximum reliability, check signatures in this order:
 1. **HTML comments** (first 10 lines) -- most generators embed identifying comments
   - `Workiva Platform` --> Workiva
   - `DFIN New ActiveDisclosure` --> DFIN New
   - `Toppan Merrill Bridge` --> Toppan Merrill
   - `ThunderDome Portal` --> RDG Filings
   - `CompSci Transform` --> CompSci/Broadridge
   - `Broadridge PROfile` --> Broadridge
   - `XBRLMaster` --> Discount EDGAR / NTDAS
 2. **XML namespaces** on `<html>` tag
   - `xmlns:thunderdome="http://www.RDGFilings.com"` --> RDG
   - `xmlns:compsci="http://compsciresources.com"` --> CompSci
 3. **XDX comments** between head and body --> GoFiler/Novaworks
 4. **Accession number prefix** (first 10 digits) --> identifies filing agent CIK
 5. **Body style patterns** as fallback
 6. **iXBRL fact ID patterns** as secondary confirmation
 ---
 ## 5. Known Quality Issues by Generator
 ### CompSci Transform
 - **Words broken across spans**: Text is split at arbitrary character boundaries, not word boundaries. A single word like "cybersecurity" may be split across 2-3 `<span>` tags. This breaks naive text extraction that operates per-element.
 - **Empty div spacers**: `<div>\n\n</div>` between every paragraph adds noise.
 - **Field comments in body**: `<!-- Field: Rule-Page -->` markers interspersed with content.
 ### Workiva
 - **Extreme span nesting**: Every text run gets its own `<span>` with full inline style. A simple bold sentence may have 5+ spans.
 - **Large file sizes**: Inline style repetition causes 10-K files to be 2-5x larger than equivalent DFIN filings.
 - **Clean word boundaries**: Despite heavy span usage, spans align with word/phrase boundaries, making text extraction reliable.
 ### DFIN New ActiveDisclosure
 - **`min-width:fit-content` everywhere**: Unusual CSS property on every span; may cause rendering inconsistencies in older browsers.
 - **`font-kerning:none`**: Explicit kerning disable on all text spans.
 - **Generally clean**: Text extraction works well; word boundaries respected.
 ### DFIN Legacy
 - **Uppercase HTML tags**: Older filings use `<P>`, `<B>`, `<FONT>` -- need case-insensitive parsing.
 - **Mixed HTML versions**: Some documents mix HTML 3.2 and 4.0 constructs.
 - **SGML wrappers**: Some filings wrapped in `<DOCUMENT>` SGML envelope.
 ### GoFiler / Novaworks
 - **XDX comment noise**: Multiple `<!-- Field: Set; ... -->` comments that must be stripped.
 - **Generally clean HTML**: Body content is straightforward.
 ### Toppan Merrill Bridge
 - **Clean output**: Among the cleanest generators. Minimal inline style bloat.
 - **GUID-heavy IDs**: Context and unit refs use base64-like GUIDs that are less human-readable.
 ---
 ## 6. Self-Filed / In-House Filings
 Some large filers submit directly using their own CIK as the accession number prefix. These filings have **no generator comment** and variable HTML quality.
 **Detection:** Accession number prefix matches the filer's own CIK (e.g., Halliburton CIK `0000045012` files with accession `0000045012-25-000010`).
 **However:** Even self-filed companies typically use a commercial tool. Halliburton's self-filed 10-K contains the Workiva comment signature, indicating they use Workiva but submit directly rather than through a filing agent.
 **Truly in-house HTML** (no commercial tool) is rare among 10-K filers. When it occurs:
 - No identifying comments
 - No consistent structural patterns
 - May use Word-to-HTML conversion (look for `mso-` CSS prefixes from Microsoft Office)
 - May have minimal or no iXBRL tagging
 ---
 ## 7. Law Firm Filings
 Several large law firms act as filing agents:
 - Davis Polk & Wardwell (`0000950103`) -- 326K total filings
 - Paul Weiss (`0000950142`) -- 56K total filings
 - Foley & Lardner (`0000897069`) -- 30K total filings
 - Sidley Austin (`0000905148`) -- 39K total filings
 - Seward & Kissel (`0000919574`) -- 107K total filings
 Law firms typically file transactional documents (S-1, proxy, 8-K) rather than periodic 10-K filings. The HTML in law-firm-filed documents often comes from Word conversion and lacks commercial generator signatures.
 ---
 ## 8. Summary: Quick Detection Regex Table
 ```
 Pattern                                              | Generator
 -----------------------------------------------------|------------------
 /Workiva Platform/                                   | Workiva
 /DFIN New ActiveDisclosure/                          | DFIN (New)
 /Donnelley Financial Solutions/                      | DFIN (New)
 /Toppan Merrill Bridge/                              | Toppan Merrill
 /ThunderDome Portal/                                 | RDG Filings
 /CompSci Transform/                                  | CompSci/Broadridge
 /Broadridge PROfile/                                 | Broadridge
 /XBRLMaster/                                         | Discount EDGAR
 /xmlns:thunderdome="http:\/\/www\.RDGFilings\.com"/  | RDG Filings
 /xmlns:compsci="http:\/\/compsciresources\.com"/     | CompSci
 /Field: Set; Name: xdx/                             | GoFiler/Novaworks
 /dfinsolutions\.com/                                 | DFIN
 /min-width:fit-content/                              | DFIN (New)
 /BRPFPage/                                           | Broadridge PROfile
 /id="XBRLDIV"/                                       | XBRLMaster
 ```
 ---
 ## Sources
 - Direct inspection of SEC EDGAR filings (March 2026)
 - [secfilingdata.com/top-filing-agents](https://www.secfilingdata.com/top-filing-agents/) -- filing agent rankings
 - [newstreetir.com -- Top SEC Filing Agents for Biotech IPOs](https://newstreetir.com/2025/05/14/who-are-the-top-sec-filing-agents-for-biotech-ipos/) -- biotech IPO market share
 - [houseblend.io -- SEC Filing Software Platforms](https://www.houseblend.io/articles/sec-filing-software-platforms-pricing-compliance) -- vendor comparison
 - [novaworkssoftware.com/inlinexbrl](https://www.novaworkssoftware.com/inlinexbrl.php) -- XDX format documentation
 - [rdgfilings.com/thunderdome](https://rdgfilings.com/thunderdome-client-portal/) -- ThunderDome Portal
 - [toppanmerrill.com/bridge](https://www.toppanmerrill.com/bridge/) -- Toppan Merrill Bridge
 - [edgarmaster.com](https://edgarmaster.com/) -- EDGARMaster / XBRLMaster by NTDAS
 - [pernasresearch.com -- DFIN analysis](https://pernasresearch.com/research-vault/donnelley-financial-initiation/) -- market share dynamics
--- a/docs/LABELING-CODEBOOK.md
+++ b/docs/LABELING-CODEBOOK.md
@ -271,6 +271,16 @@ No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1
 Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
 ### Case 9: Generic regulatory compliance language
 > *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
 This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
 The key distinctions:
 - If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted)
 - If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6)
 - If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity
 ---
 ## Dimension 2: Specificity Level
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -65,6 +65,7 @@ After extracting clean section text, splitting into paragraphs had its own chall
 - **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
 - **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge.
 - **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable.
 - **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `<td>` per bullet item, and use `&#183;` (middle dot in Symbol font) instead of the standard `&#8226;` bullet character. Since `stripHtml()` doesn't decode `&#183;` as a bullet marker, the bullet-aware merge logic never fires. Each bullet item starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments and merges them with the preceding block. This cascades: a Bancorp 34 filing had three separate elements — two bullet items about risk management processes and a standalone paragraph disclosing a $25,000 cybersecurity incident — concatenated into a single 114-word run-on sentence. The HTML structure was completely unambiguous (separate `<td>` and `<p>` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded.
 ### 8-K Extraction
@ -549,6 +550,174 @@ This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating
 ---
 ## Phase 10: Data Quality Audit and Corpus Remediation
 ### The Discovery
 While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data:
 1. **Orphan words.** HTML source wraps text at fixed column width. When a `<span>` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs.
 2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk.
 ### The Generator Investigation
 Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup.
 The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate.
 By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform.
 Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`.
 ### Six Surgical Patches
 All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation.
 | Patch | Method | Paragraphs | Annotated |
 |-------|--------|-----------|-----------|
 | 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 |
 | 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 |
 | 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 |
 | 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 |
 | 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 |
 | **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** |
 The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse.
 ### Orphan Word Re-Annotation
 The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs:
 - Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline)
 - Management Role 0.49x under-represented
 - Board Governance 0.60x under-represented
 Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong.
 **Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures.
 **Results:**
 - **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real
 - **37 paragraphs (2.4%)** changed consensus specificity
 - **152 total (9.9%)** changed on at least one dimension
 - mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%)
 - 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings
 - Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34)
 The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs.
 ### No-Cyber-Keyword Paragraphs: A False Alarm
 The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other.
 **Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept.
 ### Heading-Stripped Paragraphs: Labels Still Valid
 For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label.
 ### Embedded Bullet Lists: The Cascade Failure
 A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on:
 > establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023...
 The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `<td>` bullet items about risk management processes, and a standalone `<p>` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them.
 **Root cause: a three-part cascade failure in the extraction pipeline.**
 1. **Bullet character not recognized.** The HTML used `&#183;` (middle dot in Symbol font) instead of `&#8226;` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires.
 2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block.
 3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph.
 The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text.
 We identified two classes of this failure:
 1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers).
 2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text.
 All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal.
 ### Sample Weighting for Fine-Tuning
 The quality tier system maps directly to training sample weights:
 | Tier | Weight | Rationale |
 |------|--------|-----------|
 | clean | 1.0 | No issues |
 | headed | 1.0 | Heading removed, body text intact |
 | minor | 1.0 | Orphan word restored |
 | degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs |
 This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data.
 ### Data Integrity Framework
 The audit produced a formal data integrity framework:
 1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor
 2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash
 3. Annotations link by UUID — stable across patches
 4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes
 5. Every patch is documented with scope, method, validation, and annotation impact
 6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file
 ### Quality Tier System
 Each paragraph gets a quality tier based on detected issues:
 | Tier | Criteria | Count | % |
 |------|----------|-------|---|
 | clean | No detected issues | 58,165 | 80.7% |
 | headed | Had inlined heading (now stripped) | 7,402 | 10.3% |
 | degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% |
 | minor | Had orphan word (now fixed) | 2,147 | 3.0% |
 All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning.
 ---
 ## Phase 11: DAPT Corpus Preparation
 ### Corpus Cleaning
 The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required:
 **Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers.
 **Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms.
 **Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms.
 The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed.
 ### Corpus Statistics (Final)
 | Metric | Value |
 |--------|-------|
 | Documents | 14,756 (14,568 after <10K filter) |
 | Total tokens | ~1.056 billion (ModernBERT tokenizer) |
 | Median document | ~73K tokens (347K chars) |
 | Training sequences (seq_len=8192) | ~136K |
 | Steps per epoch (eff. batch=32) | ~4,257 |
 | Estimated training time | ~4-8 hours per epoch (RTX 3090) |
 ### Sequence Length Decision
 ModernBERT was pre-trained at 8192 tokens. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates. The tradeoff — batch_size drops from 4 to 1, compensated by gradient_accumulation=32 — results in comparable training time because 4x fewer steps offset slower per-step throughput.
 ### Epoch Decision
 We train for 1 epoch (single pass), following the empirical consensus:
 - **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL):** Used a single pass over 2-8B token domain corpora. Sufficient for consistent downstream gains across all four domains tested.
 - **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold.
 Full procedure documented in `docs/DAPT-PROCEDURE.md`.
 ---
 ## Cost and Time Ledger
 ### Tooling
@ -565,7 +734,8 @@ All code was written collaboratively with **Claude Code** (Anthropic's agentic c
 | Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. |
 | Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). |
 | Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 |
-| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 |
+| Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. |
 | **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 |
 Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem.
@ -578,9 +748,10 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
 | Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). |
 | Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis |
 | Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision |
 | Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs |
 | Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
 | Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure |
-| **Total to date** | **~31h** | |
+| **Total to date** | **~35h** | |
 ### Remaining Work (estimated)
@ -589,7 +760,7 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
 | Human labeling (1,200 paragraphs, 6 annotators) | ~6-8h | $0 (team labor) |
 | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
 | Training data assembly | ~2h | $0 |
-| DAPT pre-training | ~48-72h GPU | $0 (own 3090) |
+| DAPT pre-training (1 epoch) | ~4-8h GPU | $0 (own 3090) |
 | TAPT pre-training | ~2-3h GPU | $0 |
 | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
 | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
@ -702,6 +873,11 @@ Three models from three providers — minimizes correlated errors.
 | Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning |
 | Stage 1 prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v2.5) + buildJudgePrompt() |
 | Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency |
 | Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 |
 | Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis |
 | No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs |
 | Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers |
 | Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles |
 | Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data |
 | Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison |
 | Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation |
@ -732,3 +908,6 @@ Three models from three providers — minimizes correlated errors.
 - Systematic model biases are quantifiable and predictable. Use them as signal, not noise.
 - Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change.
 - Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling.
 - **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying.
 - **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average.
 - **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic.
--- a/docs/SEC-HTML-CLEANING.md
+++ b/docs/SEC-HTML-CLEANING.md
@ -0,0 +1,184 @@
 # SEC Filing HTML Cleaning — Lessons & Pitfalls
 Everything we've learned about cleaning SEC EDGAR HTML for text extraction, specifically for Item 1C (Cybersecurity) from 10-K filings. These lessons likely apply to any SEC filing text extraction pipeline.
 ## The HTML landscape
 SEC filings come from thousands of different filers using dozens of different tools (Workiva/Toppan Merrill, Donnelley Financial, various legal/accounting software). There is no standard HTML structure. The same semantic content — a paragraph of body text — can appear as:
 - `<p><span style="...">Text here</span></p>`
 - `<div><font face="..." size="...">Text here</font></div>`
 - Nested XBRL inline tags: `<ix:nonNumeric><p><span>Text</span></p></ix:nonNumeric>`
 - Table-based layouts: `<table><tr><td><span>Text</span></td></tr></table>`
 - Deeply nested `<div>` structures with inline styles
 The only constant: it will be ugly.
 ## Inline element newlines (the orphan word problem)
 **The bug:** Many filing generators produce HTML where the first word of a paragraph is on its own line within a `<span>` tag:
 ```html
 <p><span style="font-family: Times New Roman; font-size: 10pt">Our
 sole executive officer and director is responsible for assessing and
 managing cybersecurity risks...</span></p>
 ```
 When this is stripped to plain text, `Our` ends up on its own line. If downstream processing splits on newlines and filters short lines (< 20 words), `Our` is silently dropped. The paragraph becomes `sole executive officer and director is responsible...` — missing its subject.
 **Prevalence:** ~1.4% of filings (156/11,299) have this pattern in their Item 1C section. It produces ~2,500 affected paragraphs across the corpus.
 **Common orphaned words:** `We` (73), `Our` (37), `The` (5), `To` (17), `As` (15), `In` (13), `Cybersecurity` (10), `Management` (6), `Following` (6). Basically any sentence-starting word.
 **Why it happens:** The filing generator wraps text at a fixed column width in the HTML source. If the `<span>` opening tag + attributes eat most of a line, only the first word fits before the line break. The browser renders this identically (HTML treats source newlines as whitespace), but text extraction that preserves newlines from inline elements breaks.
 **Detection (for patching existing data):** Match the pattern `<span...>Word\nlowercase continuation...` directly in the raw HTML. Three validation layers are needed:
 1. **Same-tag check:** The orphan word and continuation must be within the same inline element (`<span>`, `<a>`, `<font>`, etc.). This distinguishes orphan first-words from section headings above paragraphs. Critically, exclude `<ix:...>` XBRL tags — these are structural, not inline, and their first text is often a section title.
 2. **Bold/underline filter:** Skip matches inside `<b>`, `<strong>`, or `text-decoration: underline`. These are section headings that happen to have a line break mid-heading (e.g., `<b>Risk\nManagement and Strategy</b>`). Without this filter, headings get inlined into body text.
 3. **Stripped-text validation:** After finding an orphan word in the raw HTML, confirm it exists as a standalone word in the `stripHtml()` output. This catches mid-word splits across adjacent spans (see below).
 **Case-sensitivity matters:** If using a regex with the `i` (case-insensitive) flag for tag name matching, the `[a-z]` check on the continuation text becomes meaningless — it will match uppercase too, letting headings through. Either drop the `i` flag (and match tags as `[Ss][Pp][Aa][Nn]` etc.) or validate continuation case separately.
 **Prevention (for future extractions):** In the paragraph segmenter, buffer single-word blocks that would otherwise be dropped (below minimum word count) and prepend them to the next block when it starts lowercase. This must happen at the segmentation stage, not in the extraction merge logic — changes to merge behavior cascade through downstream paragraph boundary decisions.
 ## Mid-word splits across adjacent spans
 **The bug:** Some filing generators split a single word across multiple `<span>` tags, sometimes with empty formatting spans between them:
 ```html
 <span style="font-size: 10pt">B</span>
 <span style="font-size: 8pt"></span>
 <span style="font-size: 10pt">lackrock
 maintains a comprehensive cybersecurity risk management program...</span>
 ```
 The HTML cleaner's adjacent-inline-boundary collapse correctly joins `B` + `lackrock` into `Blackrock` in the stripped text. But if a patching script operates on raw HTML (to find orphan patterns), it sees `<span>lackrock\nmaintains...` and incorrectly treats `lackrock` as an orphan word, prepending it to produce `lackrock maintains...` instead of the correct `Blackrock maintains...`.
 **Detection:** After finding a candidate orphan word in raw HTML, verify it exists as a standalone word (surrounded by whitespace or at line boundaries) in the stripped text. If `stripHtml()` produces `Blackrock` (not `lackrock`), the candidate is a word fragment, not an orphan.
 **Root cause:** The filing generator uses separate spans for styling changes (font-size) that happen to fall at character boundaries within words. The empty `<span style="font-size: 8pt"></span>` is a zero-width formatting artifact.
 ## Adjacent inline element boundaries
 **The bug:** Different formatting applied to adjacent text creates word-joining when tags are stripped:
 ```html
 <span style="color: black">word</span><span style="color: blue">The next word</span>
 ```
 Naively stripping tags produces `wordThe next word`. The words at the span boundary merge.
 **Fix:** Before stripping tags, collapse adjacent inline element boundaries to spaces:
 ```js
 .replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi,
  (_m, _tag, ws) => ws.length > 0 ? " " : "")
 ```
 This replaces `</span><span>` (and similar) with a space, preventing word joins. The whitespace check (`ws.length > 0`) handles cases where whitespace already exists between tags.
 Same treatment needed for XBRL inline tags (`</ix:nonNumeric><ix:nonNumeric>`).
 ## Source newlines vs block-element breaks
 **The issue:** HTML source files contain newlines in two semantically different roles:
 1. **Block-element breaks:** `</p>`, `</div>`, `<br>` — these are paragraph boundaries
 2. **Source line wrapping:** Newlines within inline elements from the filing generator's line-length limit — these are meaningless whitespace
 Both become `\n` in the stripped text. The extraction pipeline relies on newlines to separate paragraphs, so collapsing all newlines breaks paragraph detection. But preserving all newlines creates the orphan word problem.
 **The tradeoff:** We chose to preserve newlines (they're needed for paragraph boundary detection in the extraction pass). The orphan word problem is handled downstream in the segmenter. An alternative (sentinel-based) approach — using `\x00` for block breaks, collapsing source newlines to spaces, then restoring sentinels — was tested but caused too many changes to paragraph segmentation across the corpus (18,589 paragraphs changed text in regression testing).
 ## XBRL inline tags (iXBRL / `ix:` namespace)
 **What they are:** Starting in 2024, SEC filings use Inline XBRL to tag structured data directly in HTML. The `cyd:` taxonomy covers cybersecurity disclosures. Tags like `<ix:nonNumeric name="cyd:CybersecurityRiskManagementProcessesIntegratedTextBlock">` wrap entire sections.
 **Pitfalls:**
 - **Not inline formatting:** Despite being inline XML elements, `ix:` tags are structural — they wrap paragraphs, sections, even entire Items. Treating them like `<span>` for orphan detection will match section headings.
 - **XBRL metadata leaks into text:** CIK numbers (`0000123456`), namespace URIs (`xbrli:`, `fasb.org`), ticker-date identifiers (`ae-20231231`) can appear in the text stream. Filter lines where >50% of tokens look like XBRL metadata.
 - **`continuedAt` chains:** Long sections are split across multiple `ix:continuation` blocks. These can interrupt the visual flow of text.
 ## Running headers/footers and page artifacts
 SEC HTML often retains print-formatting artifacts:
 | Pattern | Example | Detection |
 |---------|---------|-----------|
 | Page numbers | `17`, `- 17 -`, `Page 17` | Regex: `/^[-–—\s]*[A-Za-z]?[-–—]?\s*\d+[-–—\s]*$/` |
 | Running headers | `ACME CORP FORM 10-K` | Short line + company name + form type |
 | Table of contents markers | `Table of Contents` | Exact match, strip trailing content |
 | Back-to-top links | `(Back to Index)` | Regex: `/back\s+to\s+(index|top|toc)/i` |
 | Part headings | `PART II` | Short line, roman numerals |
 These appear mid-text because they're print-layout remnants. Filter them in the extraction pass, before paragraph segmentation.
 ## Subsidiary headers in combined filings
 Holding companies file combined 10-Ks covering multiple subsidiaries. Each subsidiary section repeats a header:
 ```
 ENTERGY ARKANSAS, LLC AND SUBSIDIARIES
 ```
 These are ALL-CAPS, contain entity suffixes (LLC, INC, CORP, L.P.), and include "AND SUBSIDIARIES". Filter with:
 ```js
 /^[A-Z][A-Z\s,.'&-]{5,}(?:LLC|INC|CORP|COMPANY|L\.?P\.?)\b.*\bAND\s+SUBSIDIARIES\b/
 ```
 ## PDF extraction artifacts
 Some filings are PDF-converted-to-HTML, producing:
 - **Missing spaces:** `word.Next` → fix with `/([a-z])\.([A-Z])/g`
 - **CamelCase joins:** `wordThe next` → fix common English words: `/([a-z])(The|Our|We|This|...)\b/g`
 - **Orphaned punctuation:** `Director ,` → fix with `/ ([,;:.!?)])/g`
 - **Colon joins:** `word:Word` → fix with `/([a-z]):([A-Z])/g`
 ## Entity decoding
 SEC HTML uses a mix of named entities, decimal entities, and hex entities. Common ones to handle:
 ```
 &nbsp; &#160; &#xa0;  →  space
 &amp;                 →  &
 &mdash; &#8212; &#151; →  —
 &ndash; &#8211; &#150; →  –
 &rsquo; &#8217; &#146; →  '  (right single quote, used as apostrophe)
 &ldquo; &rdquo;        →  "  (curly quotes)
 &bull; &#8226; &#149;   →  •
 &#153;                  →  ™
 ```
 Some filings use the Greek question mark (U+037E) instead of a semicolon — looks identical but breaks regex.
 ## Truncation detection
 The extraction pipeline caps output at 50 blocks / 15,000 words. Filings that hit this cap may be truncated. Detection: check if the last paragraph of each filing ends with terminal punctuation (`[.!?;")]\s*$`). If not, the filing was likely cut mid-sentence — remove all its paragraphs from the training corpus.
 **Limitation:** This only catches truncation at sentence boundaries. If the cap happens to fall at a sentence end, the filing appears complete even though content was lost. No fix for this without comparing against the full filing length.
 ## Merge logic and cascade effects
 The extraction pipeline merges short/broken lines in multiple passes. **Any change to merge logic cascades:** merging two lines changes the resulting line's length, which affects whether subsequent lines trigger length-based merge thresholds, which changes the next merge decision, etc.
 In regression testing, a single-word forward-merge change in the extraction pass caused 1,812 ripple-effect text changes across the corpus. Moving the fix to the segmentation stage (after all extraction merges complete) reduced ripples but still affected ~800 paragraphs.
 **Lesson:** For retroactive data fixes, prefer surgical data patching (find-and-prepend on the JSONL) over re-running extraction. For future extraction, place fixes as late in the pipeline as possible to minimize cascade.
 ## Testing extraction changes
 When modifying the HTML cleaner, extraction, or segmentation code, regression test against the full corpus:
 1. Re-extract all cached HTML files with the modified code
 2. Compare against existing paragraphs by `(accessionNumber, paragraphIndex)`
 3. Classify changes:
   - **Clean prefix** (new text ends with old text) — orphan word recovered
   - **Clean suffix** (new text starts with old text) — fragment absorbed
   - **Re-merge** (text differs in other ways) — cascade/ripple effect
   - **Paragraph count change** — boundary shift, highest-risk regression
 4. Investigate any paragraph count decreases and text shrinkages — these are the most likely regressions
 For the orphan word fix, acceptable results were: 215 clean prefix fixes, 0 paragraph count changes, 0 text shrinkages.
--- a/python/audit_corpus.py
+++ b/python/audit_corpus.py
@ -0,0 +1,248 @@
 """
 Quality audit of the SEC-cyBERT DAPT training corpus.
 Reads sharded JSONL files and performs qualitative checks on document content.
 READ-ONLY — does not modify any files.
 """
 import json
 import os
 import random
 import re
 import sys
 from pathlib import Path
 CORPUS_DIR = Path(__file__).resolve().parent.parent / "data" / "dapt-corpus"
 SHARDS = sorted(CORPUS_DIR.glob("shard-*.jsonl"))
 random.seed(42)
 def load_all_docs() -> list[dict]:
    """Load all documents from all shards."""
    docs = []
    for shard in SHARDS:
        with open(shard) as f:
            for line in f:
                line = line.strip()
                if line:
                    docs.append(json.loads(line))
    return docs
 def separator(title: str) -> None:
    print("\n" + "=" * 80)
    print(f"  {title}")
    print("=" * 80 + "\n")
 def audit_smallest(docs: list[dict]) -> None:
    separator("1. SMALLEST 20 DOCUMENTS (by chars)")
    sorted_docs = sorted(docs, key=lambda d: d["chars"])
    for i, doc in enumerate(sorted_docs[:20], 1):
        text = doc["text"]
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---")
        # Show full text for tiny docs, cap at 2000 chars
        display = text if len(text) <= 2000 else text[:2000] + "\n... [TRUNCATED]"
        print(display)
        print()
 def audit_largest(docs: list[dict]) -> None:
    separator("2. LARGEST 5 DOCUMENTS (first/last 500 chars)")
    sorted_docs = sorted(docs, key=lambda d: d["chars"], reverse=True)
    for i, doc in enumerate(sorted_docs[:5], 1):
        text = doc["text"]
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---")
        print("FIRST 500 CHARS:")
        print(text[:500])
        print("\n... [GAP] ...\n")
        print("LAST 500 CHARS:")
        print(text[-500:])
        print()
 def audit_mid_samples(docs: list[dict]) -> None:
    separator("3. RANDOM MID-DOCUMENT SAMPLES (10 docs, 500 chars from 50% point)")
    sample = random.sample(docs, 10)
    for i, doc in enumerate(sample, 1):
        text = doc["text"]
        mid = len(text) // 2
        start = max(0, mid - 250)
        end = min(len(text), mid + 250)
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
        print(text[start:end])
        print()
 def audit_xbrl_contamination(docs: list[dict]) -> None:
    separator("4. XBRL-CONTAMINATED STARTS (first 200 chars with XBRL patterns)")
    xbrl_pattern = re.compile(
        r"(0000\d{6}|xbrli:|fasb\.org|us-gaap:|dei:|srt:|^\d{4}-\d{2}-\d{2}\s*$)",
        re.MULTILINE,
    )
    found = []
    for doc in docs:
        first200 = doc["text"][:200]
        if xbrl_pattern.search(first200):
            found.append(doc)
        if len(found) >= 10:
            break
    if not found:
        print("No XBRL-contaminated documents found in initial scan.")
        print("Trying broader pattern...")
        # Try a broader search
        broad_pattern = re.compile(r"(xmlns|xbrl|0001\d{6})", re.IGNORECASE)
        for doc in docs:
            first200 = doc["text"][:200]
            if broad_pattern.search(first200):
                found.append(doc)
            if len(found) >= 10:
                break
    for i, doc in enumerate(found[:10], 1):
        text = doc["text"]
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
        print("FIRST 500 CHARS:")
        print(text[:500])
        # Find where XBRL junk ends and real text begins
        # Look for "UNITED STATES" or "FORM 10-K" as transition marker
        for marker in ["UNITED STATES", "FORM 10-K", "FORM 10-k", "ANNUAL REPORT"]:
            idx = text.find(marker)
            if idx > 0 and idx < 5000:
                print(f"\n  >> Transition to real text at char {idx} (marker: '{marker}')")
                break
        print()
 def audit_short_lines(docs: list[dict]) -> None:
    separator("5. DOCS WITH MOST SHORT LINES (<10 chars, excluding empty)")
    scored = []
    for doc in docs:
        lines = doc["text"].split("\n")
        non_empty = [l for l in lines if l.strip()]
        short = [l for l in non_empty if 0 < len(l.strip()) < 10]
        if non_empty:
            ratio = len(short) / len(non_empty)
            scored.append((ratio, len(short), len(non_empty), doc, short))
    scored.sort(key=lambda x: x[0], reverse=True)
    for i, (ratio, n_short, n_total, doc, short_lines) in enumerate(scored[:10], 1):
        print(
            f"--- #{i} | accession={doc['accession']} | ratio={ratio:.2%} "
            f"| {n_short}/{n_total} short lines ---"
        )
        # Show 20 short lines with surrounding context
        text = doc["text"]
        lines = text.split("\n")
        shown = 0
        for j, line in enumerate(lines):
            stripped = line.strip()
            if 0 < len(stripped) < 10 and shown < 20:
                # Show line with 1 line of context on each side
                ctx_start = max(0, j - 1)
                ctx_end = min(len(lines), j + 2)
                for k in range(ctx_start, ctx_end):
                    prefix = ">>>" if k == j else "   "
                    print(f"  {prefix} L{k+1}: {lines[k][:100]}")
                print()
                shown += 1
        print()
 def audit_transitions(docs: list[dict]) -> None:
    separator("6. TRANSITION ZONES (SEC cover page -> company content)")
    # Find docs that have the SEC header
    candidates = [d for d in docs if "SECURITIES AND EXCHANGE COMMISSION" in d["text"][:2000]]
    sample = random.sample(candidates, min(5, len(candidates)))
    for i, doc in enumerate(sample, 1):
        text = doc["text"]
        idx = text.find("SECURITIES AND EXCHANGE COMMISSION")
        if idx < 0:
            continue
        # Find end of cover page area — look for company-specific content markers
        # like "Item 1" or "PART I" or "Table of Contents"
        transition_markers = ["Item 1", "ITEM 1", "PART I", "TABLE OF CONTENTS", "Table of Contents"]
        transition_idx = -1
        for marker in transition_markers:
            t = text.find(marker, idx + 100)
            if t > 0 and (transition_idx < 0 or t < transition_idx):
                transition_idx = t
        if transition_idx > 0:
            start = max(0, transition_idx - 250)
            end = min(len(text), transition_idx + 250)
            print(f"--- #{i} | accession={doc['accession']} ---")
            print(f"Cover page at char {idx}, transition at char {transition_idx}")
            print(f"SHOWING chars {start}-{end}:")
            print(text[start:end])
        else:
            # Just show around the SEC header
            start = max(0, idx - 50)
            end = min(len(text), idx + 450)
            print(f"--- #{i} | accession={doc['accession']} ---")
            print(f"Cover page at char {idx}, no clear transition marker found")
            print(text[start:end])
        print()
 def audit_financial_tables(docs: list[dict]) -> None:
    separator("7. FINANCIAL TABLE QUALITY (>30% lines with $ or mostly numeric)")
    scored = []
    dollar_or_numeric = re.compile(r"(\$|^\s*[\d,.\-()]+\s*$)")
    for doc in docs:
        lines = doc["text"].split("\n")
        non_empty = [l for l in lines if l.strip()]
        if not non_empty:
            continue
        matching = sum(1 for l in non_empty if dollar_or_numeric.search(l))
        ratio = matching / len(non_empty)
        if ratio > 0.30:
            scored.append((ratio, doc))
    scored.sort(key=lambda x: x[0], reverse=True)
    for i, (ratio, doc) in enumerate(scored[:5], 1):
        text = doc["text"]
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | numeric ratio={ratio:.1%} ---")
        # Find a dense numeric section
        lines = text.split("\n")
        # Find a window of 20 lines with the most dollar/numeric content
        best_start = 0
        best_count = 0
        window = 20
        for j in range(len(lines) - window):
            count = sum(1 for l in lines[j : j + window] if dollar_or_numeric.search(l))
            if count > best_count:
                best_count = count
                best_start = j
        print(f"DENSEST 20-LINE WINDOW (starting at line {best_start + 1}, {best_count}/{window} numeric):")
        for l in lines[best_start : best_start + window]:
            print(f"  | {l[:120]}")
        print()
 def audit_endings(docs: list[dict]) -> None:
    separator("8. END-OF-DOCUMENT QUALITY (last 300 chars of 15 random docs)")
    sample = random.sample(docs, 15)
    for i, doc in enumerate(sample, 1):
        text = doc["text"]
        print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
        print(text[-300:])
        print()
 def main() -> None:
    print("Loading all documents from corpus...")
    docs = load_all_docs()
    print(f"Loaded {len(docs)} documents from {len(SHARDS)} shards.\n")
    audit_smallest(docs)
    audit_largest(docs)
    audit_mid_samples(docs)
    audit_xbrl_contamination(docs)
    audit_short_lines(docs)
    audit_transitions(docs)
    audit_financial_tables(docs)
    audit_endings(docs)
    separator("AUDIT COMPLETE")
    print(f"Total documents audited: {len(docs)}")
 if __name__ == "__main__":
    main()
--- a/python/configs/dapt/modernbert.yaml
+++ b/python/configs/dapt/modernbert.yaml
@ -7,7 +7,7 @@ model:
 data:
  corpus_path: ../data/dapt-corpus
  text_field: text
-  max_seq_length: 2048
+  max_seq_length: 8192
  validation_split: 0.02
 training:
@ -15,8 +15,8 @@ training:
  learning_rate: 5.0e-5
  mlm_probability: 0.30
  num_train_epochs: 1
-  per_device_train_batch_size: 4
+  per_device_train_batch_size: 1
-  gradient_accumulation_steps: 8  # effective batch = 32
+  gradient_accumulation_steps: 32  # effective batch = 32
  warmup_ratio: 0.05
  weight_decay: 0.01
  bf16: true
--- a/python/src/dapt/train.py
+++ b/python/src/dapt/train.py
@ -47,6 +47,14 @@ def train(config: DAPTConfig) -> None:
    dataset = load_corpus(config.data.corpus_path, config.data.text_field)
    print(f"  Raw documents: {len(dataset):,}")
    # Filter tiny documents (cover pages, empty filings)
    min_chars = 10_000
    before = len(dataset)
    dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars)
    filtered = before - len(dataset)
    if filtered > 0:
        print(f"  Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining")
    print(f"  Tokenizing and chunking to {config.data.max_seq_length} tokens...")
    chunked = tokenize_and_chunk(
        dataset,
--- a/scripts/analyze_generator_quality.py
+++ b/scripts/analyze_generator_quality.py
@ -0,0 +1,334 @@
 #!/usr/bin/env python3
 """
 Quantify how EFiling/XDX generator quality issues affect the annotated paragraph set.
 READ-ONLY analysis — does not modify any files.
 """
 import json
 import re
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 # Reuse detect_generator from the existing script
 sys.path.insert(0, str(Path(__file__).parent))
 from detect_generators import detect_generator
 # Paths
 HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
 PARAGRAPHS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/paragraphs/paragraphs-clean.jsonl")
 ANNOTATIONS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/annotations/stage1.jsonl")
 SEP = "=" * 100
 def load_paragraphs():
    """Load paragraphs, return dict: id -> paragraph dict."""
    paragraphs = {}
    with open(PARAGRAPHS_PATH) as f:
        for line in f:
            p = json.loads(line)
            paragraphs[p["id"]] = p
    return paragraphs
 def load_annotations():
    """Load annotations, return dict: paragraphId -> annotation dict."""
    annotations = {}
    with open(ANNOTATIONS_PATH) as f:
        for line in f:
            a = json.loads(line)
            pid = a["paragraphId"]
            # Keep the first annotation per paragraph (or overwrite — doesn't matter for counts)
            annotations[pid] = a
    return annotations
 def detect_all_generators():
    """Detect generators for all HTML files. Return dict: accession -> generator."""
    accession_to_gen = {}
    files = sorted(HTML_DIR.glob("*.html"))
    total = len(files)
    for i, fp in enumerate(files):
        accession = fp.stem
        gen, _evidence = detect_generator(str(fp))
        accession_to_gen[accession] = gen
        if (i + 1) % 3000 == 0:
            print(f"  Scanned {i + 1}/{total} HTML files...", file=sys.stderr)
    print(f"  Scanned {total}/{total} HTML files.", file=sys.stderr)
    return accession_to_gen
 def starts_lowercase(text: str) -> bool:
    """True if text starts with a lowercase letter (orphan word candidate)."""
    if not text:
        return False
    return text[0].islower()
 def is_list_item(text: str) -> bool:
    """True if text looks like a list item (starts with bullet, dash, number+period, etc.)."""
    stripped = text.strip()
    if not stripped:
        return False
    # Common list patterns: "- ", "• ", "* ", "1. ", "a) ", "(a) ", "(i) "
    if re.match(r'^[-•*▪◦]\s', stripped):
        return True
    if re.match(r'^\d+[.)]\s', stripped):
        return True
    if re.match(r'^\([a-z0-9ivx]+\)\s', stripped, re.I):
        return True
    if re.match(r'^[a-z][.)]\s', stripped):
        return True
    return False
 def looks_like_inlined_header(text: str) -> bool:
    """
    True if text starts with a section heading run into body text, e.g.:
    "Risk Management and Strategy We recognize the importance..."
    "Cybersecurity Governance Our Board of Directors oversees..."
    Key distinction from normal sentences: the heading portion is a noun phrase
    (not a full sentence subject like "Our Board" or "The Company"), and is
    immediately followed by a new sentence that starts a different thought.
    We look for known SEC cybersecurity section heading patterns followed by
    body text starting with a capital letter (new sentence) with no punctuation
    separating them (no period, colon, or newline — just a space).
    """
    # Known heading patterns for SEC Item 1C disclosures
    heading_patterns = [
        r'(?:Cybersecurity\s+)?Risk\s+Management(?:\s+and\s+Strategy)?',
        r'(?:Cybersecurity\s+)?Governance(?:\s+and\s+Risk\s+Management)?',
        r'Cybersecurity\s+Governance',
        r'Cybersecurity\s+Risk\s+Management\s+and\s+Strategy',
        r'Board\s+Oversight(?:\s+of\s+(?:Risks?\s+from\s+)?Cybersecurity(?:\s+(?:Threats?|Risks?))?)?',
        r'Management(?:\'s)?\s+Role\s+in\s+(?:Managing\s+)?Cybersecurity',
        r'Governance\s+(?:Related\s+to|Oversight\s+of)\s+Cybersecurity(?:\s+Risks?)?',
        r'Impact\s+of\s+Cybersecurity\s+(?:Risks?|Threats?)',
        r'Cybersecurity\s+(?:Strategy|Overview|Program)',
        r'(?:Management\s+and|Management|Governance)\s+(?:Strategy|Overview)',
        r'Risk\s+Factors?',
        r'Oversight\s+of\s+Cybersecurity\s+Risk\s+Management',
    ]
    for pat in heading_patterns:
        # Heading immediately followed by body text (capital letter starting new sentence)
        m = re.match(rf'^({pat})\s+([A-Z])', text)
        if m:
            return True
        # Also catch heading followed by lowercase (rarer but possible)
        m = re.match(rf'^({pat})\s+([a-z])', text)
        if m:
            return True
    return False
 def main():
    print("Loading data...")
    paragraphs = load_paragraphs()
    annotations = load_annotations()
    print(f"  Paragraphs: {len(paragraphs):,}")
    print(f"  Annotations: {len(annotations):,}")
    # Unique annotated paragraph IDs
    annotated_ids = set(annotations.keys()) & set(paragraphs.keys())
    print(f"  Annotated paragraphs with matching paragraph data: {len(annotated_ids):,}")
    print("\nDetecting generators for all HTML files...")
    accession_to_gen = detect_all_generators()
    print(f"  HTML files scanned: {len(accession_to_gen):,}")
    # Map each paragraph to its generator
    para_to_gen = {}
    missing_accessions = set()
    for pid, p in paragraphs.items():
        acc = p["filing"]["accessionNumber"]
        gen = accession_to_gen.get(acc)
        if gen is None:
            missing_accessions.add(acc)
            gen = "NO_HTML_FILE"
        para_to_gen[pid] = gen
    if missing_accessions:
        print(f"\n  WARNING: {len(missing_accessions)} accession numbers in paragraphs have no HTML file")
    # =====================================================================
    # SECTION 1: Annotated paragraphs by generator
    # =====================================================================
    print(f"\n{SEP}")
    print("SECTION 1: Annotated paragraphs by generator")
    print(SEP)
    ann_gen_counts = Counter()
    for pid in annotated_ids:
        ann_gen_counts[para_to_gen[pid]] += 1
    total_ann = len(annotated_ids)
    print(f"\n{'Generator':<50} {'Count':>7} {'%':>7}")
    print("-" * 70)
    for gen, count in ann_gen_counts.most_common():
        pct = count / total_ann * 100
        print(f"{gen:<50} {count:>7} {pct:>6.1f}%")
    print("-" * 70)
    print(f"{'TOTAL':<50} {total_ann:>7} {100.0:>6.1f}%")
    # =====================================================================
    # SECTION 2: Lowercase-start (orphan word) analysis for annotated set
    # =====================================================================
    print(f"\n{SEP}")
    print("SECTION 2: Lowercase-start paragraphs in annotated set")
    print(SEP)
    # All annotated lowercase-start
    ann_lc = {pid for pid in annotated_ids if starts_lowercase(paragraphs[pid]["text"])}
    ann_lc_nonlist = {pid for pid in ann_lc if not is_list_item(paragraphs[pid]["text"])}
    print(f"\nAnnotated paragraphs starting with lowercase: {len(ann_lc):,} / {total_ann:,} ({len(ann_lc)/total_ann*100:.2f}%)")
    print(f"  Of those, excluding list items: {len(ann_lc_nonlist):,} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)")
    # Breakdown by generator for lowercase-start non-list
    lc_by_gen = Counter()
    for pid in ann_lc_nonlist:
        lc_by_gen[para_to_gen[pid]] += 1
    print(f"\n{'Generator':<50} {'LC-start':>9} {'Total ann':>10} {'% of gen':>9}")
    print("-" * 85)
    for gen, _ in ann_gen_counts.most_common():
        lc_count = lc_by_gen.get(gen, 0)
        gen_total = ann_gen_counts[gen]
        pct = lc_count / gen_total * 100 if gen_total else 0
        if lc_count > 0:
            print(f"{gen:<50} {lc_count:>9} {gen_total:>10} {pct:>8.1f}%")
    # Specific callouts
    efiling_gens = {"EFiling/EDGAR Agent", "EFiling XDX"}
    efiling_ann = {pid for pid in annotated_ids if para_to_gen[pid] in efiling_gens}
    efiling_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] in efiling_gens}
    compsci_ann = {pid for pid in annotated_ids if para_to_gen[pid] == "CompSci Transform"}
    compsci_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] == "CompSci Transform"}
    print(f"\n--- Specific callouts ---")
    print(f"EFiling/XDX annotated paragraphs starting lowercase (non-list): {len(efiling_lc):,} / {len(efiling_ann):,} ({len(efiling_lc)/len(efiling_ann)*100:.1f}% of EFiling/XDX)" if efiling_ann else "EFiling/XDX: 0 annotated paragraphs")
    print(f"CompSci Transform annotated paragraphs starting lowercase (non-list): {len(compsci_lc):,} / {len(compsci_ann):,} ({len(compsci_lc)/len(compsci_ann)*100:.1f}% of CompSci)" if compsci_ann else "CompSci Transform: 0 annotated paragraphs")
    print(f"\nTotal affected annotated paragraphs (LC non-list): {len(ann_lc_nonlist):,} / {total_ann:,} = {len(ann_lc_nonlist)/total_ann*100:.2f}%")
    # =====================================================================
    # SECTION 3: Orphan-word paragraphs detail
    # =====================================================================
    print(f"\n{SEP}")
    print("SECTION 3: Orphan-word paragraph details (LC-start, non-list, annotated)")
    print(SEP)
    # Breakdown by generator
    print(f"\nBreakdown by generator:")
    print(f"{'Generator':<50} {'Count':>7} {'% of orphan':>12}")
    print("-" * 75)
    for gen, count in lc_by_gen.most_common():
        pct = count / len(ann_lc_nonlist) * 100
        print(f"{gen:<50} {count:>7} {pct:>11.1f}%")
    # 10 example texts with labels
    print(f"\n10 example orphan-word annotated paragraphs:")
    print("-" * 100)
    examples = sorted(ann_lc_nonlist)[:10]
    for pid in examples:
        text = paragraphs[pid]["text"][:150]
        ann = annotations[pid]
        label = ann.get("label", {})
        cat = label.get("content_category", "?")
        spec = label.get("specificity_level", "?")
        gen = para_to_gen[pid]
        print(f"  [{gen}] cat={cat}, spec={spec}")
        print(f"    \"{text}...\"")
        print()
    # Category distribution in orphan-word paragraphs vs overall
    print(f"\nCategory distribution: orphan-word vs overall annotated set")
    print("-" * 80)
    orphan_cats = Counter()
    for pid in ann_lc_nonlist:
        cat = annotations[pid].get("label", {}).get("content_category", "Unknown")
        orphan_cats[cat] += 1
    overall_cats = Counter()
    for pid in annotated_ids:
        cat = annotations[pid].get("label", {}).get("content_category", "Unknown")
        overall_cats[cat] += 1
    all_cats = sorted(set(orphan_cats.keys()) | set(overall_cats.keys()))
    print(f"{'Category':<40} {'Orphan':>7} {'Orphan%':>8} {'Overall':>8} {'Overall%':>9} {'Over-rep':>9}")
    print("-" * 85)
    for cat in all_cats:
        o_count = orphan_cats.get(cat, 0)
        a_count = overall_cats.get(cat, 0)
        o_pct = o_count / len(ann_lc_nonlist) * 100 if ann_lc_nonlist else 0
        a_pct = a_count / total_ann * 100
        ratio = (o_pct / a_pct) if a_pct > 0 else 0
        flag = " <<<" if ratio > 1.5 else ""
        print(f"{cat:<40} {o_count:>7} {o_pct:>7.1f}% {a_count:>8} {a_pct:>8.1f}% {ratio:>8.2f}x{flag}")
    # =====================================================================
    # SECTION 4: Inlined headers analysis
    # =====================================================================
    print(f"\n{SEP}")
    print("SECTION 4: Inlined headers in annotated paragraphs")
    print(SEP)
    ann_inlined = set()
    for pid in annotated_ids:
        text = paragraphs[pid]["text"]
        if looks_like_inlined_header(text):
            ann_inlined.add(pid)
    print(f"\nAnnotated paragraphs with inlined headers: {len(ann_inlined):,} / {total_ann:,} ({len(ann_inlined)/total_ann*100:.2f}%)")
    inlined_by_gen = Counter()
    for pid in ann_inlined:
        inlined_by_gen[para_to_gen[pid]] += 1
    print(f"\n{'Generator':<50} {'Inlined':>8} {'Total ann':>10} {'% of gen':>9}")
    print("-" * 85)
    for gen, _ in ann_gen_counts.most_common():
        ih_count = inlined_by_gen.get(gen, 0)
        gen_total = ann_gen_counts[gen]
        pct = ih_count / gen_total * 100 if gen_total else 0
        if ih_count > 0:
            print(f"{gen:<50} {ih_count:>8} {gen_total:>10} {pct:>8.1f}%")
    # Show some examples
    print(f"\n10 example inlined-header paragraphs:")
    print("-" * 100)
    examples_ih = sorted(ann_inlined)[:10]
    for pid in examples_ih:
        text = paragraphs[pid]["text"][:150]
        gen = para_to_gen[pid]
        cat = annotations[pid].get("label", {}).get("content_category", "?")
        print(f"  [{gen}] cat={cat}")
        print(f"    \"{text}...\"")
        print()
    # =====================================================================
    # SECTION 5: Combined impact summary
    # =====================================================================
    print(f"\n{SEP}")
    print("SECTION 5: Combined impact summary")
    print(SEP)
    affected = ann_lc_nonlist | ann_inlined
    print(f"\nOrphan-word (LC non-list):        {len(ann_lc_nonlist):>6} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)")
    print(f"Inlined headers:                  {len(ann_inlined):>6} ({len(ann_inlined)/total_ann*100:.2f}%)")
    print(f"Either issue (union):             {len(affected):>6} ({len(affected)/total_ann*100:.2f}%)")
    print(f"Total annotated set:              {total_ann:>6}")
    # EFiling/XDX specifically
    efiling_affected = {pid for pid in affected if para_to_gen[pid] in efiling_gens}
    print(f"\nEFiling/XDX affected (either issue): {len(efiling_affected):,} / {len(efiling_ann):,}")
 if __name__ == "__main__":
    main()
--- a/scripts/audit_corpus.py
+++ b/scripts/audit_corpus.py
@ -0,0 +1,435 @@
 #!/usr/bin/env python3
 """Audit sec-cyBERT paragraph corpus for text quality issues."""
 import json
 import re
 import random
 import os
 from collections import Counter, defaultdict
 from pathlib import Path
 DATA_FILE = Path("data/paragraphs/paragraphs-clean.jsonl")
 HTML_DIR = Path("data/raw/html")
 # ── Load all paragraphs ──────────────────────────────────────────────────────
 print("Loading paragraphs...")
 paragraphs = []
 with open(DATA_FILE) as f:
    for line in f:
        paragraphs.append(json.loads(line))
 print(f"Loaded {len(paragraphs):,} paragraphs.\n")
 def show(text, limit=200):
    """Truncate text for display."""
    if len(text) <= limit:
        return text
    return text[:limit] + "..."
 def header(title):
    print("\n" + "=" * 80)
    print(f"  {title}")
    print("=" * 80 + "\n")
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 1: Inlined headers
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 1: Inlined Headers")
 inlined_header_examples = []
 # Detect heading+body merged into one paragraph.
 # A heading is a short (2-10 word) title-case or ALL-CAPS phrase at the start,
 # immediately followed (no colon/period separator) by a sentence starting with
 # a common sentence-opener like We/Our/The/As/In/This/A/An/Each/Management/For/Since/During.
 pat_merged_header = re.compile(
    r"^([A-Z][A-Za-z\s,&/\-\']+?)(?<![.;:!\?\)])\s+"
    r"(We |Our |The |As |In |This |A |An |Each |To |Management |During |Since |For )"
 )
 STOP_WORDS = {"and", "of", "the", "for", "in", "to", "on", "with", "our",
              "its", "an", "a", "or", "&"}
 for p in paragraphs:
    text = p["text"]
    if len(text) < 50:
        continue
    m = pat_merged_header.match(text)
    if not m:
        continue
    heading_candidate = m.group(1).strip()
    words = heading_candidate.split()
    if not (2 <= len(words) <= 10):
        continue
    # Must look like a heading: title case or all caps
    is_title = all(
        w[0].isupper() or w.lower() in STOP_WORDS
        for w in words if w
    )
    is_allcaps = heading_candidate == heading_candidate.upper() and len(heading_candidate) > 5
    if is_title or is_allcaps:
        kind = "ALLCAPS" if is_allcaps else "TITLECASE"
        inlined_header_examples.append((kind, p, heading_candidate))
 print(f"Found {len(inlined_header_examples):,} paragraphs with potential inlined headers.")
 print(f"  - ALLCAPS pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='ALLCAPS'):,}")
 print(f"  - TITLECASE pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='TITLECASE'):,}")
 print()
 # Show 20 examples, mix of both types
 random.seed(42)
 sample = random.sample(inlined_header_examples, min(20, len(inlined_header_examples)))
 for i, (kind, p, hdr) in enumerate(sample, 1):
    print(f"  [{i}] ({kind}) Header: \"{hdr}\"  [{p['filing']['companyName'][:30]}]")
    print(f"      {show(p['text'])}")
    print()
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 2: Sentence boundary violations
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 2: Sentence Boundary Violations")
 boundary_examples = []
 # word.Next — period followed immediately by uppercase letter (not abbreviations)
 pat_dotcap = re.compile(r"[a-z]\.([A-Z][a-z])")
 # word,Next — comma followed immediately by uppercase letter
 pat_commacap = re.compile(r"[a-z],([A-Z][a-z])")
 # Two words jammed: lowercase then uppercase with no space/punct
 pat_jammed = re.compile(r"[a-z]{2}[A-Z][a-z]{2}")
 # Common false positives for dot-cap: abbreviations, names
 false_pos_dot = re.compile(
    r"(?:Mr|Mrs|Ms|Dr|Jr|Sr|Inc|Corp|Ltd|Co|No|vs|St|Dept|Gen|Gov|Sec|Vol|Rev|etc|U\.S|U\.K)\."
 )
 for p in paragraphs:
    text = p["text"]
    issues = []
    for m in pat_dotcap.finditer(text):
        start = max(0, m.start() - 10)
        context = text[start : m.end() + 10]
        # skip if it's a known abbreviation
        if not false_pos_dot.search(text[max(0, m.start() - 5) : m.end()]):
            issues.append(("dot-cap", context))
    for m in pat_commacap.finditer(text):
        start = max(0, m.start() - 10)
        context = text[start : m.end() + 10]
        issues.append(("comma-cap", context))
    if issues:
        boundary_examples.append((p, issues))
 print(f"Found {len(boundary_examples):,} paragraphs with sentence boundary violations.")
 print()
 random.seed(43)
 sample = random.sample(boundary_examples, min(20, len(boundary_examples)))
 for i, (p, issues) in enumerate(sample, 1):
    print(f"  [{i}] [{p['filing']['companyName'][:30]}]")
    for kind, ctx in issues[:3]:
        print(f"      ({kind}) ...{ctx}...")
    print(f"      Full start: {show(p['text'], 150)}")
    print()
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 3: Garbled / nonsensical text
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 3: Garbled / Nonsensical Text")
 garbled_examples = []
 # Spaced-out characters: single chars separated by spaces
 pat_spaced = re.compile(r"(?:\b[a-zA-Z]\s){4,}")
 for p in paragraphs:
    text = p["text"]
    reason = None
    # Check spaced-out characters
    if pat_spaced.search(text):
        reason = "spaced-chars"
    # Check long non-ASCII runs
    non_ascii = sum(1 for c in text if ord(c) > 127)
    if non_ascii > len(text) * 0.15 and len(text) > 20:
        reason = f"non-ASCII ({non_ascii}/{len(text)} chars)"
    # Check mostly numbers/symbols (>50% non-alpha)
    alpha = sum(1 for c in text if c.isalpha())
    if len(text) > 20 and alpha < len(text) * 0.4:
        reason = f"low-alpha ({alpha}/{len(text)} = {alpha/len(text):.0%})"
    if reason:
        garbled_examples.append((reason, p))
 print(f"Found {len(garbled_examples):,} potentially garbled paragraphs.")
 reason_counts = Counter(r.split("(")[0].strip() for r, _ in garbled_examples)
 for r, c in reason_counts.most_common():
    print(f"  - {r}: {c}")
 print()
 random.seed(44)
 sample = random.sample(garbled_examples, min(10, len(garbled_examples)))
 for i, (reason, p) in enumerate(sample, 1):
    print(f"  [{i}] ({reason}) [{p['filing']['companyName'][:30]}] wc={p['wordCount']}")
    print(f"      {show(p['text'], 250)}")
    print()
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 4: HTML / markup artifacts
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 4: HTML / Markup Artifacts")
 html_examples = []
 pat_html_tag = re.compile(r"<[a-zA-Z/][^>]*>")
 pat_html_entity = re.compile(r"&(?:amp|lt|gt|nbsp|quot|#\d+|#x[0-9a-fA-F]+);")
 pat_xbrl = re.compile(r"\b(?:ix|us-gaap|dei|xbrli):")
 pat_css = re.compile(r"(?:font-family|font-size|color:|margin:|padding:|text-align|line-height)", re.IGNORECASE)
 for p in paragraphs:
    text = p["text"]
    reasons = []
    if pat_html_tag.search(text):
        reasons.append("html-tag")
    if pat_html_entity.search(text):
        reasons.append("html-entity")
    if pat_xbrl.search(text):
        reasons.append("xbrl")
    if pat_css.search(text):
        reasons.append("css")
    if reasons:
        html_examples.append((reasons, p))
 print(f"Found {len(html_examples):,} paragraphs with HTML/markup artifacts.")
 reason_counts = Counter()
 for reasons, _ in html_examples:
    for r in reasons:
        reason_counts[r] += 1
 for r, c in reason_counts.most_common():
    print(f"  - {r}: {c}")
 print()
 random.seed(45)
 sample = random.sample(html_examples, min(10, len(html_examples)))
 for i, (reasons, p) in enumerate(sample, 1):
    print(f"  [{i}] ({', '.join(reasons)}) [{p['filing']['companyName'][:30]}]")
    print(f"      {show(p['text'], 250)}")
    print()
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 5: Truncated paragraphs
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 5: Truncated Paragraphs")
 truncated = []
 # Common abbreviations that end sentences without terminal punct being an issue
 abbrevs = {"inc", "corp", "ltd", "co", "mr", "mrs", "ms", "dr", "jr", "sr",
           "etc", "al", "eg", "ie", "vs", "no", "approx", "dept", "gov"}
 for p in paragraphs:
    text = p["text"].rstrip()
    if not text:
        continue
    # Check if ends with terminal punctuation
    last_char = text[-1]
    if last_char in ".!?:;)\"'""'":
        continue
    # Check if it's a very short text (likely a heading)
    if p["wordCount"] <= 5:
        continue
    # Check if last word is a common abbreviation
    last_word = text.split()[-1].lower().rstrip(".,;:!?")
    if last_word in abbrevs:
        continue
    truncated.append(p)
 print(f"Found {len(truncated):,} potentially truncated paragraphs (no terminal punctuation, >5 words).")
 print()
 random.seed(46)
 sample = random.sample(truncated, min(10, len(truncated)))
 for i, p in enumerate(sample, 1):
    text = p["text"]
    print(f"  [{i}] [{p['filing']['companyName'][:30]}] wc={p['wordCount']}")
    # Show the END of the text
    if len(text) > 200:
        print(f"      ...{text[-200:]}")
    else:
        print(f"      {text}")
    print()
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 6: Duplicate text across filings
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 6: Cross-Filing Duplicate Text")
 # Group by textHash
 hash_to_paras = defaultdict(list)
 for p in paragraphs:
    hash_to_paras[p["textHash"]].append(p)
 # Find hashes that appear in multiple different filings
 cross_filing_dupes = {}
 for h, ps in hash_to_paras.items():
    accessions = set(p["filing"]["accessionNumber"] for p in ps)
    if len(accessions) > 1:
        cross_filing_dupes[h] = ps
 total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values())
 print(f"Unique textHashes appearing in multiple filings: {len(cross_filing_dupes):,}")
 print(f"Total paragraphs involved: {total_dupe_paragraphs:,}")
 print()
 # Sort by number of filings (most duplicated first)
 sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(set(p["filing"]["accessionNumber"] for p in x[1])), reverse=True)
 print("Top 15 most duplicated paragraphs:")
 for i, (h, ps) in enumerate(sorted_dupes[:15], 1):
    accessions = set(p["filing"]["accessionNumber"] for p in ps)
    companies = set(p["filing"]["companyName"] for p in ps)
    print(f"\n  [{i}] Hash={h}, in {len(accessions)} filings, {len(companies)} companies")
    print(f"      Companies: {', '.join(list(companies)[:5])}{'...' if len(companies) > 5 else ''}")
    print(f"      Text: {show(ps[0]['text'], 200)}")
 # Check for same-company cross-year dupes vs different-company dupes
 same_company_dupes = 0
 diff_company_dupes = 0
 for h, ps in cross_filing_dupes.items():
    companies = set(p["filing"]["companyName"] for p in ps)
    if len(companies) == 1:
        same_company_dupes += 1
    else:
        diff_company_dupes += 1
 print(f"\n\nBreakdown:")
 print(f"  Same company, different filings (likely year-over-year boilerplate): {same_company_dupes:,}")
 print(f"  Different companies (likely industry boilerplate or extraction error): {diff_company_dupes:,}")
 # ══════════════════════════════════════════════════════════════════════════════
 # CHECK 7: Ground truth spot-check
 # ══════════════════════════════════════════════════════════════════════════════
 header("CHECK 7: Ground Truth Spot-Check (10 random paragraphs vs. source HTML)")
 def normalize_html_to_plain(html_text):
    """Convert raw HTML to normalized plain text for comparison."""
    plain = re.sub(r"<[^>]+>", " ", html_text)
    # Decode common HTML entities
    plain = re.sub(r"&nbsp;?", " ", plain)
    plain = re.sub(r"&amp;", "&", plain)
    plain = re.sub(r"&lt;", "<", plain)
    plain = re.sub(r"&gt;", ">", plain)
    plain = re.sub(r"&rsquo;|&#8217;|&#x2019;", "\u2019", plain)
    plain = re.sub(r"&lsquo;|&#8216;|&#x2018;", "\u2018", plain)
    plain = re.sub(r"&rdquo;|&#8221;|&#x201D;", "\u201D", plain)
    plain = re.sub(r"&ldquo;|&#8220;|&#x201C;", "\u201C", plain)
    plain = re.sub(r"&mdash;|&#8212;", "\u2014", plain)
    plain = re.sub(r"&ndash;|&#8211;", "\u2013", plain)
    plain = re.sub(r"&#(\d+);", lambda m: chr(int(m.group(1))), plain)
    plain = re.sub(r"&#x([0-9a-fA-F]+);", lambda m: chr(int(m.group(1), 16)), plain)
    plain = re.sub(r"&\w+;", " ", plain)
    plain = re.sub(r"\s+", " ", plain)
    return plain
 random.seed(99)
 spot_check_sample = random.sample(paragraphs, 10)
 match_count = 0
 partial_count = 0
 not_found_count = 0
 for i, p in enumerate(spot_check_sample, 1):
    acc = p["filing"]["accessionNumber"]
    html_path = HTML_DIR / f"{acc}.html"
    print(f"  [{i}] {p['filing']['companyName'][:40]} | {acc}")
    print(f"      Paragraph index: {p['paragraphIndex']}, word count: {p['wordCount']}")
    corpus_text = p["text"]
    corpus_norm = re.sub(r"\s+", " ", corpus_text).strip()
    if not html_path.exists():
        print(f"      *** HTML file not found: {html_path}")
        print(f"      Corpus text: {show(corpus_text, 150)}")
        not_found_count += 1
        print()
        continue
    with open(html_path, "r", errors="replace") as f:
        html_content = f.read()
    plain_html = normalize_html_to_plain(html_content)
    # Check if the entire corpus text appears verbatim in the HTML plain text
    if corpus_norm in plain_html:
        print(f"      VERBATIM MATCH: Corpus text found exactly in HTML source.")
        match_count += 1
    else:
        # Try to find a distinctive substring to locate the paragraph
        # Use multiple probes from different positions
        found = False
        for start_frac in [0.3, 0.5, 0.1, 0.7]:
            start_pos = int(len(corpus_norm) * start_frac)
            probe = corpus_norm[start_pos:start_pos + 40]
            if not probe:
                continue
            idx = plain_html.find(probe)
            if idx >= 0:
                found = True
                # Show surrounding context from HTML
                ctx_start = max(0, idx - 80)
                ctx_end = min(len(plain_html), idx + len(corpus_norm) + 80)
                html_ctx = plain_html[ctx_start:ctx_end].strip()
                print(f"      PARTIAL MATCH: Text found in HTML but paragraph boundaries differ.")
                print(f"      Corpus first 120: {corpus_norm[:120]}")
                print(f"      HTML context 120: {html_ctx[:120]}")
                partial_count += 1
                break
        if not found:
            print(f"      NOT FOUND in HTML plain text!")
            print(f"      Corpus text: {show(corpus_text, 150)}")
            not_found_count += 1
    print()
 print(f"Spot-check results: {match_count} verbatim, {partial_count} partial, {not_found_count} not found")
 # ══════════════════════════════════════════════════════════════════════════════
 # SUMMARY
 # ══════════════════════════════════════════════════════════════════════════════
 header("SUMMARY")
 print(f"Total paragraphs: {len(paragraphs):,}")
 print(f"  1. Inlined headers:              {len(inlined_header_examples):,}")
 print(f"  2. Sentence boundary violations: {len(boundary_examples):,}")
 print(f"  3. Garbled / nonsensical text:   {len(garbled_examples):,}")
 print(f"  4. HTML / markup artifacts:      {len(html_examples):,}")
 print(f"  5. Truncated paragraphs:         {len(truncated):,}")
 print(f"  6. Cross-filing duplicates:      {len(cross_filing_dupes):,} unique texts in {total_dupe_paragraphs:,} paragraphs")
 print()
--- a/scripts/audit_paragraphs.py
+++ b/scripts/audit_paragraphs.py
@ -0,0 +1,405 @@
 """
 Audit SEC-cyBERT paragraph corpus for boundary errors.
 Run from project root: python3 scripts/audit_paragraphs.py
 """
 import json
 import random
 import re
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 DATA_PATH = Path("data/paragraphs/paragraphs-clean.jsonl")
 def load_paragraphs():
    paragraphs = []
    with open(DATA_PATH) as f:
        for line in f:
            paragraphs.append(json.loads(line))
    return paragraphs
 def section_header(title):
    bar = "=" * 80
    print(f"\n{bar}")
    print(f"  {title}")
    print(bar)
 def truncate(text, n):
    if len(text) <= n:
        return text
    return text[:n] + "..."
 # ---------------------------------------------------------------------------
 # Load
 # ---------------------------------------------------------------------------
 print("Loading paragraphs...")
 paragraphs = load_paragraphs()
 print(f"Loaded {len(paragraphs):,} paragraphs")
 # Group by accessionNumber
 by_filing = defaultdict(list)
 for p in paragraphs:
    acc = p["filing"]["accessionNumber"]
    by_filing[acc].append(p)
 print(f"Unique filings: {len(by_filing):,}")
 # ---------------------------------------------------------------------------
 # 1. Paragraphs-per-filing distribution
 # ---------------------------------------------------------------------------
 section_header("1. PARAGRAPHS-PER-FILING DISTRIBUTION")
 counts = sorted([len(ps) for ps in by_filing.values()])
 n = len(counts)
 import math
 mean = sum(counts) / n
 variance = sum((c - mean) ** 2 for c in counts) / n
 stdev = math.sqrt(variance)
 def percentile(sorted_list, pct):
    idx = pct / 100 * (len(sorted_list) - 1)
    lo = int(math.floor(idx))
    hi = int(math.ceil(idx))
    if lo == hi:
        return sorted_list[lo]
    frac = idx - lo
    return sorted_list[lo] * (1 - frac) + sorted_list[hi] * frac
 print(f"  Min:    {counts[0]}")
 print(f"  P5:     {percentile(counts, 5):.1f}")
 print(f"  P25:    {percentile(counts, 25):.1f}")
 print(f"  Median: {percentile(counts, 50):.1f}")
 print(f"  P75:    {percentile(counts, 75):.1f}")
 print(f"  P95:    {percentile(counts, 95):.1f}")
 print(f"  Max:    {counts[-1]}")
 print(f"  Stdev:  {stdev:.2f}")
 print(f"  Mean:   {mean:.2f}")
 # Histogram buckets
 buckets = [1, 2, 3, 5, 10, 15, 20, 30, 50, 100, 200]
 print("\n  Histogram:")
 prev = 0
 for b in buckets:
    c = sum(1 for x in counts if prev < x <= b)
    if c > 0:
        print(f"    ({prev+1}-{b}]: {c:>5} filings")
    prev = b
 c = sum(1 for x in counts if x > buckets[-1])
 if c > 0:
    print(f"    (>{buckets[-1]}):  {c:>5} filings")
 # Fewest paragraphs
 print("\n  --- 10 filings with FEWEST paragraphs ---")
 sorted_filings = sorted(by_filing.items(), key=lambda x: len(x[1]))
 for acc, ps in sorted_filings[:10]:
    company = ps[0]["filing"]["companyName"]
    print(f"\n  [{acc}] {company} — {len(ps)} paragraph(s):")
    for p in sorted(ps, key=lambda x: x["paragraphIndex"]):
        print(f"    p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}")
 # Most paragraphs
 print("\n  --- 10 filings with MOST paragraphs ---")
 for acc, ps in sorted_filings[-10:]:
    company = ps[0]["filing"]["companyName"]
    print(f"\n  [{acc}] {company} — {len(ps)} paragraph(s):")
    for p in sorted(ps, key=lambda x: x["paragraphIndex"])[:5]:
        print(f"    p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}")
    if len(ps) > 5:
        print(f"    ... ({len(ps) - 5} more)")
 # ---------------------------------------------------------------------------
 # 2. Suspiciously long paragraphs
 # ---------------------------------------------------------------------------
 section_header("2. SUSPICIOUSLY LONG PARAGRAPHS (top 20 by word count)")
 sorted_by_wc = sorted(paragraphs, key=lambda p: p["wordCount"], reverse=True)
 for i, p in enumerate(sorted_by_wc[:20]):
    acc = p["filing"]["accessionNumber"]
    company = p["filing"]["companyName"]
    text = p["text"]
    first200 = text[:200]
    last200 = text[-200:] if len(text) > 400 else ""
    print(f"\n  #{i+1}: {p['wordCount']} words | p{p['paragraphIndex']} | {company}")
    print(f"    Acc: {acc}")
    print(f"    FIRST 200: {first200}")
    if last200:
        print(f"    LAST  200: {last200}")
    # Check for signs of merged paragraphs
    issues = []
    if p["wordCount"] > 300:
        issues.append("VERY LONG (>300w)")
    # Look for heading-like patterns mid-text (capitalized lines, bold markers)
    lines = text.split("\n")
    if len(lines) > 1:
        issues.append(f"CONTAINS {len(lines)} LINES (possible merge)")
    # Look for sentence-ending followed by topic shift
    sentences = re.split(r'(?<=[.!?])\s+', text)
    if len(sentences) > 8:
        issues.append(f"{len(sentences)} sentences")
    if issues:
        print(f"    FLAGS: {', '.join(issues)}")
 # ---------------------------------------------------------------------------
 # 3. Suspiciously short paragraphs
 # ---------------------------------------------------------------------------
 section_header("3. SUSPICIOUSLY SHORT PARAGRAPHS (<25 words)")
 short = [p for p in paragraphs if p["wordCount"] < 25]
 print(f"\n  Total paragraphs <25 words: {len(short)} ({100*len(short)/len(paragraphs):.1f}%)")
 # Categorize
 headings = []
 standalone = []
 fragments = []
 list_items = []
 heading_patterns = re.compile(
    r"^(risk management|cybersecurity|governance|strategy|board|"
    r"oversight|incident|material|information security|"
    r"risk factors|item 1c|risk management and strategy|"
    r"risk management, strategy|governance, risk management)"
    , re.IGNORECASE
 )
 for p in short:
    text = p["text"].strip()
    lower = text.lower()
    # Heading detection: short, no period at end, title-case-ish
    is_heading = False
    if len(text.split()) <= 8 and not text.endswith("."):
        is_heading = True
    if heading_patterns.match(lower):
        is_heading = True
    if text.isupper() and len(text.split()) <= 10:
        is_heading = True
    # List item: starts with bullet, dash, number, or letter
    is_list = bool(re.match(r"^(\d+[.)]\s|[-•●◦▪]\s|[a-z][.)]\s|\([a-z]\)\s|\(\d+\)\s)", text))
    # Fragment: doesn't end with period/question/exclamation and not a heading
    is_fragment = not is_heading and not is_list and not re.search(r'[.!?"]$', text.rstrip())
    if is_heading:
        headings.append(p)
    elif is_list:
        list_items.append(p)
    elif is_fragment:
        fragments.append(p)
    else:
        standalone.append(p)
 print(f"  Headings:            {len(headings)}")
 print(f"  Standalone sentences:{len(standalone)}")
 print(f"  Fragments:           {len(fragments)}")
 print(f"  List items:          {len(list_items)}")
 def show_examples(label, items, count):
    sample = items[:count] if len(items) <= count else random.sample(items, count)
    print(f"\n  --- {label} (showing {len(sample)} of {len(items)}) ---")
    for p in sample:
        acc = p["filing"]["accessionNumber"]
        print(f"    [{p['wordCount']}w] p{p['paragraphIndex']} | {truncate(p['text'], 120)}")
        print(f"         {p['filing']['companyName']} | {acc}")
 random.seed(42)
 show_examples("Headings", headings, 10)
 show_examples("Standalone sentences", standalone, 8)
 show_examples("Fragments", fragments, 8)
 show_examples("List items", list_items, 4)
 # ---------------------------------------------------------------------------
 # 4. Sequential paragraph coherence
 # ---------------------------------------------------------------------------
 section_header("4. SEQUENTIAL PARAGRAPH COHERENCE (20 random filings)")
 random.seed(123)
 sample_accs = random.sample(list(by_filing.keys()), min(20, len(by_filing)))
 mid_sentence_breaks = []
 topic_shifts = []
 for acc in sample_accs:
    ps = sorted(by_filing[acc], key=lambda x: x["paragraphIndex"])
    for i in range(len(ps) - 1):
        curr = ps[i]
        nxt = ps[i + 1]
        curr_text = curr["text"].strip()
        nxt_text = nxt["text"].strip()
        # Check: does current paragraph end mid-sentence?
        # Signs: ends with comma, semicolon, conjunction, lowercase word, no terminal punctuation
        ends_mid = False
        if curr_text and not re.search(r'[.!?:"\)]$', curr_text):
            ends_mid = True
        if curr_text and re.search(r'(,|;|\band\b|\bor\b|\bbut\b|\bthat\b|\bwhich\b)\s*$', curr_text):
            ends_mid = True
        # Check: does next paragraph start with lowercase (continuation)?
        starts_lower = bool(nxt_text) and nxt_text[0].islower()
        if ends_mid or starts_lower:
            mid_sentence_breaks.append({
                "acc": acc,
                "company": curr["filing"]["companyName"],
                "curr_idx": curr["paragraphIndex"],
                "nxt_idx": nxt["paragraphIndex"],
                "curr_end": curr_text[-150:] if len(curr_text) > 150 else curr_text,
                "nxt_start": nxt_text[:150] if len(nxt_text) > 150 else nxt_text,
                "ends_mid": ends_mid,
                "starts_lower": starts_lower,
            })
 print(f"\n  Checked {len(sample_accs)} filings")
 print(f"  Potential mid-sentence breaks found: {len(mid_sentence_breaks)}")
 print("\n  --- Examples of mid-sentence / continuation breaks ---")
 for ex in mid_sentence_breaks[:5]:
    print(f"\n  [{ex['acc']}] {ex['company']}")
    print(f"    p{ex['curr_idx']} ENDS:   ...{ex['curr_end']}")
    print(f"    p{ex['nxt_idx']} STARTS: {ex['nxt_start']}...")
    flags = []
    if ex["ends_mid"]:
        flags.append("no terminal punctuation")
    if ex["starts_lower"]:
        flags.append("next starts lowercase")
    print(f"    FLAGS: {', '.join(flags)}")
 if len(mid_sentence_breaks) == 0:
    print("  (none found)")
 # Also check for topic shifts within single paragraphs (long ones in sampled filings)
 print("\n  --- Checking for intra-paragraph topic shifts ---")
 shift_examples = []
 for acc in sample_accs:
    for p in by_filing[acc]:
        if p["wordCount"] < 150:
            continue
        text = p["text"]
        # Look for heading-like substrings mid-text
        # e.g., "Risk Management" or "Governance" appearing after a sentence end
        matches = list(re.finditer(
            r'(?<=[.!?]\s)(Risk Management|Governance|Strategy|Cybersecurity|'
            r'Board of Directors|Incident Response|Overview|Third.Party)',
            text
        ))
        if matches:
            shift_examples.append({
                "acc": acc,
                "company": p["filing"]["companyName"],
                "idx": p["paragraphIndex"],
                "wordCount": p["wordCount"],
                "match": matches[0].group(),
                "context": text[max(0, matches[0].start()-80):matches[0].end()+80],
            })
 print(f"  Paragraphs with possible embedded topic headers: {len(shift_examples)}")
 for ex in shift_examples[:5]:
    print(f"\n  [{ex['acc']}] {ex['company']} p{ex['idx']} ({ex['wordCount']}w)")
    print(f"    Found '{ex['match']}' mid-paragraph:")
    print(f"    ...{ex['context']}...")
 # ---------------------------------------------------------------------------
 # 5. Paragraph index gaps
 # ---------------------------------------------------------------------------
 section_header("5. PARAGRAPH INDEX GAPS & DUPLICATES")
 gap_filings = []
 dup_filings = []
 for acc, ps in by_filing.items():
    indices = sorted(p["paragraphIndex"] for p in ps)
    # Check for duplicates
    if len(indices) != len(set(indices)):
        counter = Counter(indices)
        dups = {k: v for k, v in counter.items() if v > 1}
        dup_filings.append((acc, ps[0]["filing"]["companyName"], dups))
    # Check for gaps (should be 0, 1, 2, ...)
    expected = list(range(indices[0], indices[0] + len(indices)))
    if indices != expected:
        missing = set(expected) - set(indices)
        extra = set(indices) - set(expected)
        if missing or extra:
            gap_filings.append((acc, ps[0]["filing"]["companyName"], sorted(missing), sorted(extra), indices))
 print(f"\n  Filings with duplicate paragraph indices: {len(dup_filings)}")
 for acc, company, dups in dup_filings[:10]:
    print(f"    [{acc}] {company}: duplicates at indices {dups}")
 print(f"\n  Filings with index gaps: {len(gap_filings)}")
 for acc, company, missing, extra, indices in gap_filings[:10]:
    print(f"    [{acc}] {company}")
    if missing:
        print(f"      Missing indices: {missing}")
    if extra:
        print(f"      Unexpected indices: {extra}")
    print(f"      Actual indices: {indices}")
 # Check if all start at 0
 non_zero_start = [(acc, ps) for acc, ps in by_filing.items()
                  if min(p["paragraphIndex"] for p in ps) != 0]
 print(f"\n  Filings not starting at index 0: {len(non_zero_start)}")
 for acc, ps in non_zero_start[:5]:
    start = min(p["paragraphIndex"] for p in ps)
    print(f"    [{acc}] {ps[0]['filing']['companyName']}: starts at {start}")
 # ---------------------------------------------------------------------------
 # 6. Cross-filing duplicate paragraphs
 # ---------------------------------------------------------------------------
 section_header("6. CROSS-FILING DUPLICATE PARAGRAPHS")
 # Group by textHash
 by_hash = defaultdict(list)
 for p in paragraphs:
    by_hash[p["textHash"]].append(p)
 # Find hashes appearing in multiple filings
 cross_filing_dupes = {}
 for h, ps in by_hash.items():
    accs = set(p["filing"]["accessionNumber"] for p in ps)
    if len(accs) > 1:
        cross_filing_dupes[h] = ps
 total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values())
 unique_dupe_texts = len(cross_filing_dupes)
 print(f"\n  Unique paragraph texts appearing in >1 filing: {unique_dupe_texts}")
 print(f"  Total paragraphs that are cross-filing duplicates: {total_dupe_paragraphs} ({100*total_dupe_paragraphs/len(paragraphs):.1f}%)")
 # Also count same-hash within same filing
 within_filing_dupes = 0
 for h, ps in by_hash.items():
    accs = [p["filing"]["accessionNumber"] for p in ps]
    if len(accs) != len(set(accs)):
        within_filing_dupes += 1
 print(f"  Hashes duplicated WITHIN a single filing: {within_filing_dupes}")
 # Top 20 most duplicated
 sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(x[1]), reverse=True)
 print("\n  --- Top 20 most duplicated texts across filings ---")
 for i, (h, ps) in enumerate(sorted_dupes[:20]):
    n_filings = len(set(p["filing"]["accessionNumber"] for p in ps))
    text = ps[0]["text"]
    print(f"\n  #{i+1}: hash={h} | {n_filings} filings | {ps[0]['wordCount']}w")
    print(f"    TEXT: {truncate(text, 200)}")
 # Boilerplate analysis: texts appearing in 3+ filings
 boilerplate_threshold = 3
 boilerplate_hashes = {h for h, ps in cross_filing_dupes.items()
                      if len(set(p["filing"]["accessionNumber"] for p in ps)) >= boilerplate_threshold}
 boilerplate_paragraphs = sum(len(by_hash[h]) for h in boilerplate_hashes)
 print(f"\n  Boilerplate (text in {boilerplate_threshold}+ filings):")
 print(f"    Unique texts: {len(boilerplate_hashes)}")
 print(f"    Total paragraphs: {boilerplate_paragraphs} ({100*boilerplate_paragraphs/len(paragraphs):.1f}%)")
 print("\n" + "=" * 80)
 print("  AUDIT COMPLETE")
 print("=" * 80)
--- a/scripts/data_quality_audit.py
+++ b/scripts/data_quality_audit.py
@ -0,0 +1,539 @@
 #!/usr/bin/env python3
 """
 Novel data quality audit for paragraphs-clean.jsonl.
 READ-ONLY: prints findings to stdout, does not modify any files.
 """
 import json
 import re
 import sys
 from collections import Counter, defaultdict
 from pathlib import Path
 DATA_PATH = Path(__file__).resolve().parent.parent / "data" / "paragraphs" / "paragraphs-clean.jsonl"
 # ── Cybersecurity domain keywords (broad) ──────────────────────────────
 CYBER_KEYWORDS = {
    "cyber", "cybersecurity", "security", "breach", "incident", "threat",
    "vulnerability", "malware", "ransomware", "phishing", "firewall",
    "encryption", "intrusion", "unauthorized", "attack", "hacker",
    "data protection", "information security", "network security",
    "access control", "authentication", "risk management", "ciso",
    "chief information security", "chief information officer",
    "information technology", "it systems", "data privacy", "privacy",
    "personally identifiable", "pii", "soc", "nist", "iso 27001",
    "penetration test", "disaster recovery", "business continuity",
    "third party", "vendor", "supply chain", "cloud", "endpoint",
    "monitoring", "detection", "response", "remediation", "patch",
    "compliance", "regulatory", "safeguard", "protect", "secure",
    "confidential", "integrity", "availability", "resilience",
    "governance", "oversight", "board of directors", "audit committee",
    "risk factor", "material", "disclosure", "1c", "item 1c",
 }
 # ── Non-cyber legal boilerplate patterns ────────────────────────────────
 BOILERPLATE_PATTERNS = [
    re.compile(r"forward[- ]looking\s+statements?", re.I),
    re.compile(r"safe\s+harbor", re.I),
    re.compile(r"private\s+securities\s+litigation\s+reform\s+act", re.I),
    re.compile(r"cautionary\s+statement", re.I),
    re.compile(r"except\s+as\s+required\s+by\s+law.*no\s+obligation\s+to\s+update", re.I),
    re.compile(r"this\s+(annual\s+)?report\s+(on\s+form\s+10-k\s+)?contains?\s+forward", re.I),
 ]
 # ── SEC item cross-reference pattern ────────────────────────────────────
 SEC_ITEM_RE = re.compile(r"\bItem\s+(\d+[A-Z]?)\b", re.I)
 # ── Dollar amount pattern ──────────────────────────────────────────────
 DOLLAR_RE = re.compile(r"\$[\d,]+(?:\.\d+)?\s*(?:thousand|million|billion|trillion)?", re.I)
 # ── Date patterns (unusual formats) ────────────────────────────────────
 DATE_PATTERNS = [
    # MM/DD/YYYY or MM-DD-YYYY
    re.compile(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b"),
    # Month DD, YYYY
    re.compile(r"\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}\b", re.I),
    # DD Month YYYY
    re.compile(r"\b\d{1,2}\s+(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\b", re.I),
    # YYYY-MM-DD (ISO)
    re.compile(r"\b\d{4}-\d{2}-\d{2}\b"),
 ]
 # ── Bullet point characters ────────────────────────────────────────────
 BULLET_RE = re.compile(r"[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25AB\u25CF\u25CB\u25A0\u25A1]")
 # ── Helpers ─────────────────────────────────────────────────────────────
 def truncate(text: str, max_len: int = 200) -> str:
    if len(text) <= max_len:
        return text
    return text[:max_len] + "..."
 def print_section(title: str):
    print(f"\n{'=' * 80}")
    print(f"  {title}")
    print(f"{'=' * 80}")
 def print_finding(name: str, concern: str, count: int, total: int, examples: list[dict]):
    pct = count / total * 100 if total else 0
    print(f"\n--- {name} [{concern} CONCERN] ---")
    print(f"    Count: {count:,} / {total:,} ({pct:.2f}%)")
    for i, ex in enumerate(examples[:5]):
        filing = ex.get("filing", {})
        company = filing.get("companyName", "?")
        print(f"    Example {i+1} [{company}]:")
        print(f"      {truncate(ex['text'], 300)}")
    if count > 5:
        print(f"    ... and {count - 5:,} more")
 def has_cyber_relevance(text_lower: str) -> bool:
    for kw in CYBER_KEYWORDS:
        if kw in text_lower:
            return True
    return False
 # ── Load data ──────────────────────────────────────────────────────────
 def load_data():
    paragraphs = []
    with open(DATA_PATH) as f:
        for line in f:
            paragraphs.append(json.loads(line))
    return paragraphs
 def main():
    print("Loading data...")
    paragraphs = load_data()
    total = len(paragraphs)
    print(f"Loaded {total:,} paragraphs.\n")
    # Pre-compute lowercase texts
    texts_lower = [p["text"].lower() for p in paragraphs]
    # ════════════════════════════════════════════════════════════════════
    print_section("1. CHARACTER-LEVEL ANOMALIES")
    # ════════════════════════════════════════════════════════════════════
    # 1a. High uppercase ratio (>30%)
    high_upper = []
    for p in paragraphs:
        t = p["text"]
        alpha = sum(1 for c in t if c.isalpha())
        if alpha < 10:
            continue
        upper = sum(1 for c in t if c.isupper())
        ratio = upper / alpha
        if ratio > 0.30:
            high_upper.append({**p, "_ratio": ratio})
    high_upper.sort(key=lambda x: x["_ratio"], reverse=True)
    print_finding("High uppercase ratio (>30% of alpha chars)", "MEDIUM",
                  len(high_upper), total, high_upper)
    # 1b. Unusual punctuation density
    high_punct = []
    for p in paragraphs:
        t = p["text"]
        if len(t) < 30:
            continue
        semis = t.count(";")
        colons = t.count(":")
        dashes = t.count("—") + t.count("–") + t.count("-")
        punct_count = semis + colons + dashes
        density = punct_count / len(t)
        if density > 0.05:
            high_punct.append({**p, "_density": density, "_semis": semis, "_colons": colons, "_dashes": dashes})
    high_punct.sort(key=lambda x: x["_density"], reverse=True)
    print_finding("High punctuation density (semicolons/colons/dashes >5% of chars)", "LOW",
                  len(high_punct), total, high_punct)
    # 1c. Non-ASCII characters
    non_ascii_paras = []
    non_ascii_chars_all = Counter()
    for p in paragraphs:
        t = p["text"]
        non_ascii = [(c, hex(ord(c)), ord(c)) for c in t if ord(c) > 127]
        if non_ascii:
            chars_found = set((c, h) for c, h, _ in non_ascii)
            for c, h, _ in non_ascii:
                non_ascii_chars_all[f"{c} ({h})"] += 1
            non_ascii_paras.append({**p, "_chars": chars_found})
    print_finding("Paragraphs with non-ASCII characters", "MEDIUM",
                  len(non_ascii_paras), total, non_ascii_paras)
    if non_ascii_chars_all:
        print("\n    Non-ASCII character frequency:")
        for char_repr, cnt in non_ascii_chars_all.most_common(20):
            print(f"      {char_repr}: {cnt:,} occurrences")
    # 1d. Unusual whitespace (multiple spaces, tabs)
    multi_space_re = re.compile(r"  +")
    tab_re = re.compile(r"\t")
    whitespace_issues = []
    for p in paragraphs:
        t = p["text"]
        multi = len(multi_space_re.findall(t))
        tabs = len(tab_re.findall(t))
        if multi > 0 or tabs > 0:
            whitespace_issues.append({**p, "_multi_spaces": multi, "_tabs": tabs})
    print_finding("Unusual whitespace (multiple spaces or tabs)", "MEDIUM",
                  len(whitespace_issues), total, whitespace_issues)
    # ════════════════════════════════════════════════════════════════════
    print_section("2. CONTENT ANOMALIES")
    # ════════════════════════════════════════════════════════════════════
    # 2a. Dollar amounts
    dollar_paras = []
    for p in paragraphs:
        matches = DOLLAR_RE.findall(p["text"])
        if matches:
            dollar_paras.append({**p, "_amounts": matches})
    print_finding("Paragraphs with dollar amounts", "MEDIUM",
                  len(dollar_paras), total, dollar_paras)
    if dollar_paras:
        # Show distribution of dollar amounts
        all_amounts = []
        for dp in dollar_paras:
            all_amounts.extend(dp["_amounts"])
        print(f"\n    Total dollar amount mentions: {len(all_amounts):,}")
        amount_counter = Counter(all_amounts)
        print("    Most common amounts:")
        for amt, cnt in amount_counter.most_common(10):
            print(f"      {amt}: {cnt:,}")
    # 2b. Dates in text
    date_paras = []
    for p in paragraphs:
        t = p["text"]
        found_dates = []
        for pat in DATE_PATTERNS:
            found_dates.extend(pat.findall(t))
        if found_dates:
            date_paras.append({**p, "_dates": found_dates})
    print_finding("Paragraphs containing dates", "LOW",
                  len(date_paras), total, date_paras)
    if date_paras:
        all_dates = []
        for dp in date_paras:
            all_dates.extend(dp["_dates"])
        print(f"\n    Total date mentions: {len(all_dates):,}")
    # 2c. Cross-references to other SEC items
    cross_ref_paras = []
    for p in paragraphs:
        matches = SEC_ITEM_RE.findall(p["text"])
        # Filter out Item 1C (that's expected)
        other_items = [m for m in matches if m.upper() != "1C"]
        if other_items:
            cross_ref_paras.append({**p, "_items": other_items})
    # Count which items are referenced
    item_counts = Counter()
    for crp in cross_ref_paras:
        for item in crp["_items"]:
            item_counts[f"Item {item}"] += 1
    print_finding("Cross-references to non-1C SEC items", "HIGH",
                  len(cross_ref_paras), total, cross_ref_paras)
    if item_counts:
        print("\n    Referenced items:")
        for item, cnt in item_counts.most_common():
            print(f"      {item}: {cnt:,}")
    # 2d. Non-cyber legal boilerplate
    boilerplate_paras = []
    for p in paragraphs:
        t = p["text"]
        matched = []
        for pat in BOILERPLATE_PATTERNS:
            if pat.search(t):
                matched.append(pat.pattern[:60])
        if matched:
            boilerplate_paras.append({**p, "_patterns": matched})
    print_finding("Non-cybersecurity legal boilerplate", "HIGH",
                  len(boilerplate_paras), total, boilerplate_paras)
    # ════════════════════════════════════════════════════════════════════
    print_section("3. STRUCTURAL ANOMALIES")
    # ════════════════════════════════════════════════════════════════════
    # 3a. Bullet points mid-text
    bullet_paras = []
    for p in paragraphs:
        t = p["text"]
        if BULLET_RE.search(t):
            bullet_paras.append(p)
        elif re.search(r"(?:^|\n)\s*[-*]\s+\w", t):
            bullet_paras.append(p)
    print_finding("Paragraphs with bullet points mid-text", "MEDIUM",
                  len(bullet_paras), total, bullet_paras)
    # 3b. Embedded newlines
    newline_paras = []
    for p in paragraphs:
        t = p["text"]
        nl_count = t.count("\n")
        if nl_count > 0:
            newline_paras.append({**p, "_newlines": nl_count})
    newline_paras.sort(key=lambda x: x["_newlines"], reverse=True)
    print_finding("Paragraphs with embedded newlines", "MEDIUM",
                  len(newline_paras), total, newline_paras)
    # 3c. Mid-paragraph headings (ALL CAPS phrase of 3+ words followed by different content)
    mid_heading_re = re.compile(r"(?<=\. )([A-Z][A-Z\s]{10,}[A-Z])(?=\.?\s+[A-Z][a-z])")
    mid_heading_paras = []
    for p in paragraphs:
        t = p["text"]
        matches = mid_heading_re.findall(t)
        if matches:
            mid_heading_paras.append({**p, "_headings": matches})
    print_finding("Mid-paragraph headings (ALL CAPS phrase mid-sentence)", "MEDIUM",
                  len(mid_heading_paras), total, mid_heading_paras)
    # ════════════════════════════════════════════════════════════════════
    print_section("4. OUTLIER DETECTION")
    # ════════════════════════════════════════════════════════════════════
    # 4a. Extremely high word count (>400)
    long_paras = [p for p in paragraphs if p["wordCount"] > 400]
    long_paras.sort(key=lambda x: x["wordCount"], reverse=True)
    print_finding("Extremely long paragraphs (>400 words)", "HIGH",
                  len(long_paras), total, long_paras)
    if long_paras:
        wc_values = [p["wordCount"] for p in long_paras]
        print(f"\n    Word count range: {min(wc_values)} - {max(wc_values)}")
        print(f"    Mean: {sum(wc_values)/len(wc_values):.0f}")
    # 4b. Low information density
    # Common English stopwords
    STOPWORDS = {
        "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for",
        "of", "with", "by", "from", "is", "are", "was", "were", "be", "been",
        "being", "have", "has", "had", "do", "does", "did", "will", "would",
        "could", "should", "may", "might", "shall", "can", "that", "which",
        "who", "whom", "this", "these", "those", "it", "its", "we", "our",
        "us", "they", "their", "them", "he", "she", "his", "her", "as",
        "if", "not", "no", "nor", "so", "than", "too", "very", "such",
        "also", "each", "any", "all", "both", "other", "some", "into",
        "through", "during", "before", "after", "about", "between", "under",
        "over", "above", "up", "down", "out", "off", "then", "once",
    }
    low_info_paras = []
    for p in paragraphs:
        words = re.findall(r"[a-z]+", p["text"].lower())
        if len(words) < 20:
            continue
        stop_ratio = sum(1 for w in words if w in STOPWORDS) / len(words)
        if stop_ratio > 0.65:
            low_info_paras.append({**p, "_stop_ratio": stop_ratio})
    low_info_paras.sort(key=lambda x: x["_stop_ratio"], reverse=True)
    print_finding("Low information density (>65% stopwords)", "LOW",
                  len(low_info_paras), total, low_info_paras)
    # 4c. Exact substring matches across filings
    print("\n--- Exact substring matches across filings [HIGH CONCERN] ---")
    print("    (Checking paragraphs that appear as substrings of others in different filings...)")
    # Group by accession number for efficiency
    by_accession = defaultdict(list)
    for p in paragraphs:
        acc = p["filing"]["accessionNumber"]
        by_accession[acc].append(p)
    # For efficiency, only check paragraphs 50-200 chars (likely fragments/duplicates)
    # Sort by length so shorter ones are checked as substrings of longer ones
    candidates = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"], p["id"])
                  for p in paragraphs if 50 <= len(p["text"]) <= 200]
    longer_texts = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"])
                    for p in paragraphs if len(p["text"]) > 200]
    substring_matches = []
    # Use a set for dedup
    seen = set()
    # Only check a sample for performance
    check_limit = min(len(candidates), 3000)
    for i in range(check_limit):
        cand_text, cand_acc, cand_co, cand_id = candidates[i]
        for long_text, long_acc, long_co in longer_texts[:5000]:
            if cand_acc == long_acc:
                continue  # same filing, skip
            if cand_text in long_text and cand_id not in seen:
                seen.add(cand_id)
                substring_matches.append({
                    "text": cand_text,
                    "filing": {"companyName": cand_co, "accessionNumber": cand_acc},
                    "_found_in": long_co,
                })
                break
    print(f"    Count (sampled {check_limit:,} short paras against {min(len(longer_texts), 5000):,} long paras): {len(substring_matches):,}")
    for i, ex in enumerate(substring_matches[:5]):
        print(f"    Example {i+1} [{ex['filing']['companyName']}] (also in {ex['_found_in']}):")
        print(f"      {truncate(ex['text'], 300)}")
    if len(substring_matches) > 5:
        print(f"    ... and {len(substring_matches) - 5:,} more")
    # ════════════════════════════════════════════════════════════════════
    print_section("5. SEMANTIC COHERENCE")
    # ════════════════════════════════════════════════════════════════════
    # 5a. Company name mismatch — look for SPECIFIC named companies in text
    # that differ from the filing company. Filter out generic refs like "the Company".
    company_name_mismatches = []
    # Pattern: proper noun(s) + legal suffix at end, NOT preceded by "the "
    specific_company_re = re.compile(
        r"(?<!\bthe )(?<!\bThe )(?<!\ba )(?<!\bA )"
        r"\b([A-Z][A-Za-z&\.']+(?:\s+[A-Z][A-Za-z&\.']+){0,5})"
        r",?\s+(Corp(?:oration)?|Inc(?:orporated)?|LLC|Ltd|L\.P\.|Holdings|Partners)\b\.?"
    )
    # Generic phrases to ignore
    GENERIC_COMPANY_REFS = {
        "the company", "our company", "a company", "each company",
        "any company", "this company", "such company", "parent company",
        "holding company", "shell company", "blank check company",
        "portfolio company", "operating company", "management company",
        "insurance company", "affiliated company",
    }
    for p in paragraphs:
        t = p["text"]
        filing_company = p["filing"]["companyName"]
        matches = specific_company_re.findall(t)
        if not matches:
            continue
        filing_words = set(w.lower() for w in re.findall(r"[A-Za-z]{3,}", filing_company))
        for name_part, suffix in matches:
            full = f"{name_part} {suffix}".strip()
            if full.lower() in GENERIC_COMPANY_REFS:
                continue
            mention_words = set(w.lower() for w in re.findall(r"[A-Za-z]{3,}", full))
            generic = {"inc", "corp", "corporation", "incorporated", "company", "group",
                       "holdings", "the", "and", "llc", "ltd", "partners", "new"}
            meaningful_filing = filing_words - generic
            meaningful_mention = mention_words - generic
            if meaningful_mention and not (meaningful_mention & meaningful_filing):
                company_name_mismatches.append({
                    **p,
                    "_mentioned": full,
                    "_filing_company": filing_company,
                })
                break
    print_finding("Company name in text doesn't match filing metadata", "HIGH",
                  len(company_name_mismatches), total, company_name_mismatches)
    if company_name_mismatches:
        print("\n    Sample mismatches (mentioned vs filing):")
        for ex in company_name_mismatches[:15]:
            print(f"      Mentioned: '{ex['_mentioned']}' | Filing: '{ex['_filing_company']}'")
    # 5b. No cybersecurity keywords at all
    no_cyber = []
    for i, p in enumerate(paragraphs):
        if not has_cyber_relevance(texts_lower[i]):
            no_cyber.append(p)
    print_finding("No cybersecurity keywords at all", "HIGH",
                  len(no_cyber), total, no_cyber)
    if no_cyber:
        # Show word count distribution of non-cyber paragraphs
        wc_dist = Counter()
        for p in no_cyber:
            bucket = (p["wordCount"] // 50) * 50
            wc_dist[f"{bucket}-{bucket+49}"] += 1
        print("\n    Word count distribution of non-cyber paragraphs:")
        for bucket, cnt in sorted(wc_dist.items()):
            print(f"      {bucket} words: {cnt:,}")
    # ════════════════════════════════════════════════════════════════════
    print_section("BONUS: ADDITIONAL NOVEL CHECKS")
    # ════════════════════════════════════════════════════════════════════
    # 6a. Paragraphs that are mostly a URL or contain URLs
    url_re = re.compile(r"https?://\S+|www\.\S+")
    url_paras = []
    for p in paragraphs:
        urls = url_re.findall(p["text"])
        if urls:
            url_ratio = sum(len(u) for u in urls) / len(p["text"])
            url_paras.append({**p, "_urls": urls, "_ratio": url_ratio})
    url_paras.sort(key=lambda x: x["_ratio"], reverse=True)
    print_finding("Paragraphs containing URLs", "MEDIUM",
                  len(url_paras), total, url_paras)
    # 6b. Paragraphs with parenthetical references that look like citations/footnotes
    footnote_re = re.compile(r"\(\d+\)|\[\d+\]|(?:footnote|fn\.?)\s*\d+", re.I)
    footnote_paras = []
    for p in paragraphs:
        if footnote_re.search(p["text"]):
            footnote_paras.append(p)
    print_finding("Paragraphs with footnote/citation references", "LOW",
                  len(footnote_paras), total, footnote_paras)
    # 6c. Paragraphs that look like table data (multiple numeric values separated by whitespace)
    table_re = re.compile(r"(?:\d[\d,.]*\s+){3,}")
    table_paras = []
    for p in paragraphs:
        if table_re.search(p["text"]):
            table_paras.append(p)
    print_finding("Paragraphs that look like table/numeric data", "HIGH",
                  len(table_paras), total, table_paras)
    # 6d. Encoding artifacts (replacement chars, zero-width spaces, BOM, etc.)
    encoding_re = re.compile(r"[\ufffd\u200b\u200c\u200d\ufeff\u00a0]")
    encoding_paras = []
    for p in paragraphs:
        matches = encoding_re.findall(p["text"])
        if matches:
            encoding_paras.append({**p, "_artifacts": Counter(f"U+{ord(c):04X} ({c!r})" for c in matches)})
    print_finding("Encoding artifacts (replacement chars, NBSP, zero-width, BOM)", "HIGH",
                  len(encoding_paras), total, encoding_paras)
    if encoding_paras:
        all_artifacts = Counter()
        for ep in encoding_paras:
            all_artifacts.update(ep["_artifacts"])
        print("\n    Artifact frequency:")
        for art, cnt in all_artifacts.most_common():
            print(f"      {art}: {cnt:,}")
    # 6e. Repeated sentences within a paragraph
    repeated_sent_paras = []
    for p in paragraphs:
        t = p["text"]
        # Split on sentence boundaries
        sentences = re.split(r'(?<=[.!?])\s+', t)
        if len(sentences) < 3:
            continue
        sent_counter = Counter(s.strip().lower() for s in sentences if len(s.strip()) > 20)
        dupes = {s: c for s, c in sent_counter.items() if c > 1}
        if dupes:
            repeated_sent_paras.append({**p, "_dupes": dupes})
    print_finding("Paragraphs with repeated sentences", "HIGH",
                  len(repeated_sent_paras), total, repeated_sent_paras)
    # ════════════════════════════════════════════════════════════════════
    print_section("SUMMARY")
    # ════════════════════════════════════════════════════════════════════
    print(f"\n  Total paragraphs analyzed: {total:,}")
    print(f"\n  HIGH concern findings:")
    print(f"    - Cross-references to non-1C items: {len(cross_ref_paras):,}")
    print(f"    - Non-cyber legal boilerplate: {len(boilerplate_paras):,}")
    print(f"    - Extremely long paragraphs (>400 words): {len(long_paras):,}")
    print(f"    - Company name mismatches: {len(company_name_mismatches):,}")
    print(f"    - No cybersecurity keywords: {len(no_cyber):,}")
    print(f"    - Table/numeric data: {len(table_paras):,}")
    print(f"    - Encoding artifacts: {len(encoding_paras):,}")
    print(f"    - Repeated sentences: {len(repeated_sent_paras):,}")
    print(f"    - Exact substring matches (sampled): {len(substring_matches):,}")
    print(f"\n  MEDIUM concern findings:")
    print(f"    - High uppercase ratio: {len(high_upper):,}")
    print(f"    - Non-ASCII characters: {len(non_ascii_paras):,}")
    print(f"    - Unusual whitespace: {len(whitespace_issues):,}")
    print(f"    - Dollar amounts: {len(dollar_paras):,}")
    print(f"    - Bullet points mid-text: {len(bullet_paras):,}")
    print(f"    - Embedded newlines: {len(newline_paras):,}")
    print(f"    - Mid-paragraph headings: {len(mid_heading_paras):,}")
    print(f"    - URLs in text: {len(url_paras):,}")
    print(f"\n  LOW concern findings:")
    print(f"    - High punctuation density: {len(high_punct):,}")
    print(f"    - Date mentions: {len(date_paras):,}")
    print(f"    - Low information density: {len(low_info_paras):,}")
    print(f"    - Footnote references: {len(footnote_paras):,}")
 if __name__ == "__main__":
    main()
--- a/scripts/detect_generators.py
+++ b/scripts/detect_generators.py
@ -0,0 +1,537 @@
 #!/usr/bin/env python3
 """
 Detect HTML generators for all SEC filing HTML files.
 Phase 1: Exhaustive signature detection
 Phase 2: Cluster remaining unknowns
 Phase 3: Summary statistics
 """
 import os
 import re
 import sys
 from collections import defaultdict, Counter
 from pathlib import Path
 HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
 READ_BYTES = 20_000
 # Known SEC filing agent CIKs (accession number prefixes)
 FILING_AGENT_CIKS = {
    "0000950170": "Donnelley Financial Solutions",
    "0001193125": "Donnelley Financial Solutions",
    "0001558370": "Toppan Merrill",
    "0001654954": "Toppan Merrill",
 }
 def detect_generator(filepath: str) -> tuple[str, str]:
    """Read first 20KB of file and detect generator. Returns (generator, evidence)."""
    with open(filepath, "rb") as f:
        raw = f.read(READ_BYTES)
    text = raw.decode("utf-8", errors="replace")
    text_lower = text.lower()
    # --- Explicit generator metadata ---
    # 1. <meta name="generator" content="..."> (both attribute orderings)
    m = re.search(
        r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
        text, re.I,
    )
    if not m:
        m = re.search(
            r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']',
            text, re.I,
        )
    if m:
        return _normalize_generator(m.group(1)), f'meta generator: {m.group(1)}'
    # 2. <meta name="Creator" content="...">
    m = re.search(
        r'<meta\s+name\s*=\s*["\']Creator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
        text, re.I,
    )
    if m:
        return _normalize_generator(m.group(1)), f'meta Creator: {m.group(1)}'
    # 4. <meta name="Producer" content="...">
    m = re.search(
        r'<meta\s+name\s*=\s*["\']Producer["\']\s+content\s*=\s*["\']([^"\']+)["\']',
        text, re.I,
    )
    if m:
        return _normalize_generator(m.group(1)), f'meta Producer: {m.group(1)}'
    # 15. ProgId meta tag (Word, Excel converters)
    m = re.search(
        r'<meta\s+name\s*=\s*["\']ProgId["\']\s+content\s*=\s*["\']([^"\']+)["\']',
        text, re.I,
    )
    if m:
        progid = m.group(1)
        if "word" in progid.lower():
            return "Microsoft Word", f"ProgId: {progid}"
        if "excel" in progid.lower():
            return "Microsoft Excel", f"ProgId: {progid}"
        return _normalize_generator(progid), f"ProgId: {progid}"
    # --- HTML comment signatures (search full 20KB) ---
    # Workiva / Wdesk
    if re.search(r"<!--.*Created with the Workiva Platform.*-->", text, re.I):
        return "Workiva", "comment: Created with the Workiva Platform"
    if re.search(r"<!--.*Copyright\s+\d{4}\s+Workiva.*-->", text, re.I):
        return "Workiva", "comment: Copyright Workiva"
    if re.search(r"<!--.*Document created using Wdesk.*-->", text, re.I):
        return "Workiva", "comment: Document created using Wdesk"
    # Toppan Merrill / Bridge
    if re.search(r"<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->", text, re.I):
        return "Toppan Merrill", "comment: Toppan Merrill"
    if re.search(r"<!--.*Merrill\s*Bridge.*-->", text, re.I):
        return "Toppan Merrill", "comment: Merrill Bridge"
    # Donnelley Financial Solutions / RR Donnelley
    if re.search(r"<!--.*Donnelley Financial Solutions.*-->", text, re.I):
        return "Donnelley Financial Solutions", "comment: Donnelley Financial Solutions"
    if re.search(r"<!--.*RR\s*Donnelley.*-->", text, re.I):
        return "Donnelley Financial Solutions", "comment: RR Donnelley"
    # Broadridge PROfile
    if re.search(r"<!--.*Broadridge\s+PROfile.*-->", text, re.I):
        return "Broadridge PROfile", "comment: Broadridge PROfile"
    # Also match "Licensed to: ... Document created using Broadridge PROfile"
    if "broadridge" in text_lower:
        return "Broadridge PROfile", "keyword: broadridge"
    # SEC Publisher (in title or comment)
    m_title = re.search(r"<title[^>]*>([^<]+)</title>", text, re.I)
    title_text = m_title.group(1).strip() if m_title else ""
    if "sec publisher" in text_lower or "sec publisher" in title_text.lower():
        return "SEC Publisher", "title/keyword: SEC Publisher"
    # IRIS Carbon (various filing agents using IRIS Carbon platform)
    m = re.search(r"<!--.*Powered by IRIS Carbon.*-->", text, re.I)
    if m:
        # Extract the filing agent name before "Powered by IRIS Carbon"
        m2 = re.search(r"<!--\s*([^,]+),\s*Powered by IRIS Carbon", text, re.I)
        agent = m2.group(1).strip() if m2 else "Unknown agent"
        return "IRIS Carbon", f"comment: {agent} via IRIS Carbon"
    # Certent Disclosure Management
    if re.search(r"<!--.*Certent\s+Disclosure\s+Management.*-->", text, re.I):
        return "Certent", "comment: Certent Disclosure Management"
    if "certent" in text_lower:
        return "Certent", "keyword: certent"
    # CompSci Resources, LLC
    if re.search(r"<!--.*CompSci Resources.*-->", text, re.I):
        return "CompSci Transform", "comment: CompSci Resources"
    # RDG Portal
    if re.search(r"<!--.*RDG Portal.*-->", text, re.I):
        return "RDG Portal", "comment: RDG Portal"
    # PDF to EDGAR
    if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]:
        return "PDF to EDGAR", "title/keyword: PDF to EDGAR"
    # Generic generated/created by comments (but NOT bare dates)
    m = re.search(r"<!--\s*Generated\s+by\s+([^-]+?)-->", text, re.I)
    if m:
        val = m.group(1).strip()
        if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
            return _normalize_generator(val), f"comment: Generated by {val}"
    m = re.search(r"<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->", text, re.I)
    if m:
        val = m.group(1).strip()
        if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
            return _normalize_generator(val), f"comment: Created by/with {val}"
    # --- Keyword signatures in full text ---
    # 5. Workiva
    if re.search(r"\bwdesk\b", text_lower):
        return "Workiva", "keyword: wdesk"
    if re.search(r"\bworkiva\b", text_lower):
        return "Workiva", "keyword: workiva"
    # 6. Donnelley/DFIN
    if re.search(r"\brrdonnelley\b", text_lower):
        return "Donnelley Financial Solutions", "keyword: rrdonnelley"
    if re.search(r"\bedgar-online\b", text_lower):
        return "Donnelley Financial Solutions", "keyword: edgar-online"
    # 7. Toppan Merrill
    if re.search(r"\btoppan\b", text_lower):
        return "Toppan Merrill", "keyword: toppan"
    if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower):
        return "Toppan Merrill", "keyword: merrill + bridge/xbrl"
    if re.search(r"\bbowne\b", text_lower):
        return "Toppan Merrill", "keyword: bowne"
    # 8. CompSci Transform
    if re.search(r"\bcompsci\b", text_lower):
        return "CompSci Transform", "keyword: compsci"
    # 9. ThunderDome
    if re.search(r"\bthunderdome\b", text_lower):
        return "ThunderDome", "keyword: thunderdome"
    # 10. GoXBRL
    if re.search(r"\bgoxbrl\b", text_lower):
        return "GoXBRL", "keyword: goxbrl"
    # 16. CSS class naming patterns
    if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower):
        return "Workiva", "CSS class prefix: wk_"
    # --- SGML document wrapper detection ---
    has_sgml = re.search(r"<DOCUMENT>\s*\n?\s*<TYPE>", text, re.I)
    if has_sgml:
        m_fn = re.search(r"<FILENAME>\s*([\w\-\.]+)", text, re.I)
        if m_fn:
            filename = m_fn.group(1).lower()
            # d + digits = Donnelley Financial Solutions
            if re.match(r"d\d+", filename):
                return "Donnelley Financial Solutions", f"SGML filename: {m_fn.group(1)}"
            # tm + digits = Toppan Merrill
            if re.match(r"tm\d+", filename):
                return "Toppan Merrill", f"SGML filename: {m_fn.group(1)}"
            # ea + digits = EFiling/EDGAR Agent
            if re.match(r"ea\d+", filename):
                return "EFiling/EDGAR Agent", f"SGML filename: {m_fn.group(1)}"
        # SGML-wrapped but no known filename pattern — check for other signals inside
        # Rule-Page comments = Broadridge/EFiling variant
        if "<!-- field: rule-page" in text_lower or "rule-page" in text_lower[:5000]:
            return "Broadridge PROfile", "SGML + Rule-Page field comments"
        # Field: Set comments with xdx = EFiling XDX tool
        if "field: set; name: xdx" in text_lower:
            return "EFiling XDX", "SGML + xdx Field:Set comments"
        # <!-- Field: Set --> or <!-- Field: Rule --> without xdx
        if "<!-- field:" in text_lower[:5000]:
            return "EFiling/EDGAR Agent", "SGML + Field comments"
        # Donnelley structural pattern: Center/DIV 8.5in
        if re.search(r'<Center><DIV STYLE="width:8\.5in"', text):
            return "Donnelley Financial Solutions", "SGML + Center/DIV 8.5in layout"
        # Check accession prefix for known filing agents
        basename = os.path.basename(filepath)
        accession_prefix = basename.split("-")[0]
        if accession_prefix in FILING_AGENT_CIKS:
            return FILING_AGENT_CIKS[accession_prefix], f"SGML + filing agent CIK {accession_prefix}"
        # Remaining SGML-wrapped: classify by structural patterns
        font_count = text_lower.count("<font")
        if font_count > 5:
            return "SGML-wrapped (legacy/font-based)", f"SGML + {font_count} <font> tags"
        return "SGML-wrapped (unknown)", "SGML wrapper, no specific generator"
    # --- Inline XBRL detection for non-SGML files ---
    has_ix_ns = "xmlns:ix=" in text_lower or "<ix:header" in text_lower
    # 12. Structural: Donnelley uppercase P STYLE + Center DIV 8.5in
    if re.search(r'<P STYLE="[^"]*font-family:Times New Roman"', text) and re.search(
        r'<Center><DIV STYLE="width:8\.5in"', text
    ):
        return "Donnelley Financial Solutions", "structural: uppercase P STYLE + Center DIV 8.5in"
    # 14. Title tag tool names
    if title_text:
        title_lower = title_text.lower()
        if "workiva" in title_lower or "wdesk" in title_lower:
            return "Workiva", f"title: {title_text}"
    if has_ix_ns:
        # 11. ix:header with tool info / Field comments
        if "field: set; name: xdx" in text_lower:
            return "EFiling XDX", "iXBRL + xdx Field:Set comments"
        if "<!-- field: rule" in text_lower:
            return "Broadridge PROfile", "iXBRL + Rule-Page field comments"
        if "<!-- field:" in text_lower[:5000]:
            return "EFiling/EDGAR Agent", "iXBRL + Field comments"
        # Filing agent CIK-based detection
        basename = os.path.basename(filepath)
        accession_prefix = basename.split("-")[0]
        if accession_prefix in FILING_AGENT_CIKS:
            agent = FILING_AGENT_CIKS[accession_prefix]
            return f"{agent}", f"iXBRL + filing agent CIK {accession_prefix}"
        # 13. XML declaration encoding as structural signal
        if '<?xml version="1.0" encoding="utf-8"' in text_lower[:200]:
            return "Inline XBRL (utf-8 toolchain)", "iXBRL + utf-8 XML declaration"
        if "<?xml version='1.0' encoding='ascii'?>" in text_lower[:200]:
            if re.search(r'<div style="display:none"><ix:header>', text_lower[:3000]):
                return "Inline XBRL (SEC/EDGAR standard)", "iXBRL + ASCII XML + hidden ix:header"
            return "Inline XBRL (SEC/EDGAR standard)", "iXBRL + ASCII XML declaration"
        # Generic inline XBRL with no other signal
        return "Inline XBRL (tool unresolved)", "iXBRL namespace only"
    # --- Structural fallbacks for non-XBRL files ---
    font_count = text_lower.count("<font")
    td_count = text_lower.count("<td")
    span_count = text_lower.count("<span")
    if font_count > 20:
        return "Legacy generator (font-based)", f"structural: {font_count} <font> tags"
    if td_count > 50 and span_count < 10:
        return "Table-based generator", f"structural: {td_count} <td> tags"
    data_attr_count = len(re.findall(r"\bdata-\w+", text_lower))
    if data_attr_count > 10:
        return "Modern web tooling", f"structural: {data_attr_count} data- attributes"
    return "Unknown", "no signature detected"
 def _normalize_generator(raw: str) -> str:
    """Normalize generator names to canonical forms."""
    r = raw.strip().lower()
    if "workiva" in r or "wdesk" in r:
        return "Workiva"
    if "donnelley" in r or "dfin" in r or "rrdonnelley" in r:
        return "Donnelley Financial Solutions"
    if ("toppan" in r) or ("merrill" in r and "bridge" in r):
        return "Toppan Merrill"
    if "word" in r and "microsoft" in r:
        return "Microsoft Word"
    if "excel" in r and "microsoft" in r:
        return "Microsoft Excel"
    if "thunderdome" in r:
        return "ThunderDome"
    if "goxbrl" in r:
        return "GoXBRL"
    if "compsci" in r:
        return "CompSci Transform"
    if "certent" in r:
        return "Certent"
    if "iris carbon" in r:
        return "IRIS Carbon"
    if "broadridge" in r or "profile" in r:
        return "Broadridge PROfile"
    if "sec publisher" in r:
        return "SEC Publisher"
    return raw.strip()
 def extract_body_snippet(filepath: str) -> str:
    """Extract first 200 bytes after <body> tag."""
    with open(filepath, "rb") as f:
        raw = f.read(READ_BYTES)
    text = raw.decode("utf-8", errors="replace")
    m = re.search(r"<body[^>]*>(.*)", text, re.I | re.S)
    if m:
        body = m.group(1)[:200].strip()
        return re.sub(r"\s+", " ", body)
    return re.sub(r"\s+", " ", text[:200])
 def extract_class_names(filepath: str, max_elements: int = 10) -> list[str]:
    """Extract CSS class names from first N elements."""
    with open(filepath, "rb") as f:
        raw = f.read(READ_BYTES)
    text = raw.decode("utf-8", errors="replace")
    classes = re.findall(r'class\s*=\s*["\']([^"\']+)["\']', text, re.I)
    return classes[:max_elements]
 def main():
    files = sorted(HTML_DIR.glob("*.html"))
    total = len(files)
    print(f"Processing {total} HTML files...\n")
    results: dict[str, tuple[str, str]] = {}
    generator_examples: dict[str, list[str]] = defaultdict(list)
    generator_methods: dict[str, set[str]] = defaultdict(set)
    for i, fp in enumerate(files):
        accession = fp.stem
        gen, evidence = detect_generator(str(fp))
        results[accession] = (gen, evidence)
        generator_examples[gen].append(accession)
        method = evidence.split(":")[0].strip()
        generator_methods[gen].add(method)
        if (i + 1) % 2000 == 0:
            print(f"  Processed {i + 1}/{total}...", file=sys.stderr)
    # --- Phase 1 output ---
    print("=" * 110)
    print("PHASE 1: Generator Detection Results")
    print("=" * 110)
    gen_counts = Counter(gen for gen, _ in results.values())
    for gen, count in gen_counts.most_common():
        pct = count / total * 100
        examples = generator_examples[gen][:3]
        methods = ", ".join(sorted(generator_methods[gen]))
        print(f"\n  {gen}")
        print(f"    Count: {count:,} ({pct:.1f}%)")
        print(f"    Methods: {methods}")
        print(f"    Examples: {', '.join(examples)}")
    # --- Phase 2: Cluster unknowns ---
    unknowns = [acc for acc, (gen, _) in results.items() if gen == "Unknown"]
    print(f"\n\n{'=' * 110}")
    print(f"PHASE 2: Clustering {len(unknowns)} Unknown Files")
    print("=" * 110)
    if unknowns:
        fingerprints: dict[str, list[str]] = defaultdict(list)
        for acc in unknowns:
            fp = HTML_DIR / f"{acc}.html"
            with open(fp, "rb") as f:
                raw_bytes = f.read(READ_BYTES)
            text = raw_bytes.decode("utf-8", errors="replace")
            text_lower = text.lower()
            has_xml_decl = text.startswith("<?xml")
            has_doctype = "<!doctype" in text_lower[:500]
            first_tag_m = re.search(r"<(\w+)", text)
            first_tag = first_tag_m.group(1).lower() if first_tag_m else ""
            td_c = text_lower.count("<td")
            span_c = text_lower.count("<span")
            div_c = text_lower.count("<div")
            p_c = text_lower.count("<p ")
            font_c = text_lower.count("<font")
            counts = {"td": td_c, "span": span_c, "div": div_c, "p": p_c, "font": font_c}
            dominant = max(counts, key=counts.get) if max(counts.values()) > 0 else "empty"
            classes = re.findall(r'class\s*=\s*["\']([^"\']+)["\']', text[:5000], re.I)
            class_prefix = ""
            if classes:
                fc = classes[0].split()[0]
                if "_" in fc:
                    class_prefix = fc.split("_")[0] + "_"
                elif "-" in fc:
                    class_prefix = fc.split("-")[0] + "-"
                else:
                    class_prefix = fc[:4]
            fingerprint = (
                f"xml={has_xml_decl}|doctype={has_doctype}|first={first_tag}"
                f"|layout={dominant}|cls={class_prefix}"
            )
            fingerprints[fingerprint].append(acc)
        for idx, (fp_key, accs) in enumerate(
            sorted(fingerprints.items(), key=lambda x: -len(x[1]))
        ):
            print(f"\n  Cluster {idx + 1} ({len(accs)} files): {fp_key}")
            for acc in accs[:5]:
                filepath = HTML_DIR / f"{acc}.html"
                snippet = extract_body_snippet(str(filepath))
                cls = extract_class_names(str(filepath), 5)
                print(f"    {acc}:")
                print(f"      Snippet: {snippet[:120]}")
                if cls:
                    print(f"      Classes: {cls[:5]}")
            if len(accs) > 5:
                print(f"    ... and {len(accs) - 5} more files")
    else:
        print("  No truly unknown files remain!")
    # --- Phase 3: Summary ---
    print(f"\n\n{'=' * 110}")
    print("PHASE 3: Summary Statistics")
    print("=" * 110)
    # Compute consolidated generator groups for the summary
    # Group small variants under their parent
    GROUP_MAP = {
        "Inline XBRL (utf-8 toolchain)": "Inline XBRL (tool unresolved)",
        "Inline XBRL (tool unresolved)": "Inline XBRL (tool unresolved)",
    }
    header = (
        f"\n{'Generator':<45} {'Count':>7} {'%':>7}  "
        f"{'Detection Methods':<50} {'Examples (up to 3)'}"
    )
    print(header)
    print("-" * 170)
    for gen, count in gen_counts.most_common():
        pct = count / total * 100
        examples = ", ".join(generator_examples[gen][:3])
        methods = ", ".join(sorted(generator_methods[gen]))
        if len(methods) > 50:
            methods = methods[:47] + "..."
        print(f"{gen:<45} {count:>7} {pct:>6.1f}%  {methods:<50} {examples}")
    print("-" * 170)
    print(f"{'TOTAL':<45} {total:>7} {100.0:>6.1f}%")
    unknown_count = gen_counts.get("Unknown", 0)
    identified = total - unknown_count
    print(f"\nIdentified: {identified:,} / {total:,} ({identified / total * 100:.1f}%)")
    print(f"Truly unidentified: {unknown_count:,} / {total:,} ({unknown_count / total * 100:.1f}%)")
    # Consolidated view: group by parent tool family
    print(f"\n\n{'=' * 110}")
    print("CONSOLIDATED VIEW (grouped by tool family)")
    print("=" * 110)
    family_map = {
        "Workiva": "Workiva",
        "Donnelley Financial Solutions": "Donnelley Financial Solutions",
        "Toppan Merrill": "Toppan Merrill",
        "CompSci Transform": "CompSci Transform",
        "ThunderDome": "ThunderDome",
        "EFiling/EDGAR Agent": "EFiling/EDGAR Agent",
        "EFiling XDX": "EFiling/EDGAR Agent",
        "Broadridge PROfile": "Broadridge PROfile",
        "SEC Publisher": "SEC Publisher",
        "IRIS Carbon": "IRIS Carbon",
        "RDG Portal": "RDG Portal",
        "Certent": "Certent",
        "PDF to EDGAR": "PDF to EDGAR",
        "GoXBRL": "GoXBRL",
        "Microsoft Word": "Microsoft Word",
        "Microsoft Excel": "Microsoft Excel",
        "Inline XBRL (SEC/EDGAR standard)": "Inline XBRL (unattributed)",
        "Inline XBRL (utf-8 toolchain)": "Inline XBRL (unattributed)",
        "Inline XBRL (tool unresolved)": "Inline XBRL (unattributed)",
        "SGML-wrapped (legacy/font-based)": "SGML-wrapped (unattributed)",
        "SGML-wrapped (unknown)": "SGML-wrapped (unattributed)",
        "Legacy generator (font-based)": "Other/Legacy",
        "Table-based generator": "Other/Legacy",
        "Modern web tooling": "Other/Legacy",
        "Unknown": "Unknown",
    }
    family_counts: Counter = Counter()
    family_examples: dict[str, list[str]] = defaultdict(list)
    for gen, count in gen_counts.items():
        family = family_map.get(gen, gen)
        family_counts[family] += count
        family_examples[family].extend(generator_examples[gen][:3])
    print(f"\n{'Tool Family':<45} {'Count':>7} {'%':>7}")
    print("-" * 65)
    for family, count in family_counts.most_common():
        pct = count / total * 100
        examples = ", ".join(family_examples[family][:3])
        print(f"{family:<45} {count:>7} {pct:>6.1f}%  {examples}")
    print("-" * 65)
    print(f"{'TOTAL':<45} {total:>7} {100.0:>6.1f}%")
 if __name__ == "__main__":
    main()
--- a/scripts/find_heading_candidates.py
+++ b/scripts/find_heading_candidates.py
@ -0,0 +1,511 @@
 """
 Heading candidate detection in SEC-cyBERT paragraph data.
 Searches for inlined section headings that previous passes missed.
 READ-ONLY: does not modify data, prints analysis to stdout.
 """
 import json
 import re
 from collections import Counter, defaultdict
 from pathlib import Path
 DATA_PATH = Path(__file__).resolve().parent.parent / "data" / "paragraphs" / "paragraphs-clean.jsonl"
 # ── Load data ──────────────────────────────────────────────────────────────────
 print(f"Loading data from {DATA_PATH} ...")
 paragraphs = []
 with open(DATA_PATH) as f:
    for line in f:
        paragraphs.append(json.loads(line))
 print(f"Loaded {len(paragraphs):,} paragraphs.\n")
 # ── Helpers ────────────────────────────────────────────────────────────────────
 def preview(text: str, n: int = 150) -> str:
    """First n chars, single-line."""
    return text[:n].replace("\n", " ").strip()
 COMMON_SENTENCE_STARTERS = {
    "we", "our", "the", "a", "an", "as", "in", "on", "to", "for", "if",
    "this", "these", "that", "those", "it", "its", "such", "no", "not",
    "with", "from", "at", "by", "or", "and", "all", "any", "each",
    "while", "when", "where", "although", "because", "since", "after",
    "before", "during", "under", "over", "between", "through", "into",
    "upon", "about", "there", "here", "however", "additionally",
    "furthermore", "moreover", "also", "finally", "similarly",
    "accordingly", "consequently", "therefore", "thus", "nonetheless",
    "notwithstanding", "specifically", "generally", "currently",
    "recently", "historically", "collectively", "certain",
 }
 HEADING_KEYWORDS = {
    "oversight", "framework", "assessment", "compliance", "integration",
    "governance", "strategy", "management", "disclosure", "reporting",
    "response", "recovery", "prevention", "detection", "monitoring",
    "awareness", "training", "policy", "policies", "procedures",
    "controls", "cybersecurity", "information", "security", "risk",
    "board", "committee", "audit", "technology", "infrastructure",
    "incident", "incidents", "threat", "threats", "vulnerability",
    "program", "processes", "overview", "background", "introduction",
    "summary", "conclusion", "material", "materiality",
 }
 HEADING_GERUNDS = {
    "protecting", "monitoring", "assessing", "managing", "overseeing",
    "implementing", "establishing", "maintaining", "identifying",
    "evaluating", "mitigating", "addressing", "enhancing", "ensuring",
    "integrating", "reporting", "disclosing", "detecting", "preventing",
    "responding", "recovering", "training", "educating", "reviewing",
    "governing", "supervising", "coordinating", "leveraging",
    "strengthening", "safeguarding", "securing",
 }
 SEPARATOR_LINE = "=" * 100
 def print_section(title: str):
    print(f"\n{SEPARATOR_LINE}")
    print(f"  {title}")
    print(SEPARATOR_LINE)
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 1: First-sentence grammatical analysis
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 1: First-clause looks like a heading (title case prefix → sentence body)")
 # Pattern: first N words are in title case, then a transition to normal
 # sentence text. E.g. "Risk Management and Strategy We have..."
 approach1_hits = []
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 6:
        continue
    # Find the transition point: where title-case words stop
    title_words = 0
    for w in words:
        # Strip punctuation for checking
        clean = re.sub(r"[^a-zA-Z]", "", w)
        if not clean:
            title_words += 1
            continue
        # "and", "of", "the", "for", "in", "on" can be lowercase in titles
        if clean.lower() in {"and", "of", "the", "for", "in", "on", "a", "an", "or", "to", "by", "with"}:
            title_words += 1
            continue
        if clean[0].isupper():
            title_words += 1
        else:
            break
    # We want 3+ title-case words at the start, then a transition
    if title_words >= 3 and title_words < len(words) - 2:
        # Check that the word after the title block starts lowercase (sentence body)
        rest_start = words[title_words] if title_words < len(words) else ""
        rest_clean = re.sub(r"[^a-zA-Z]", "", rest_start)
        if rest_clean and rest_clean[0].islower():
            heading_part = " ".join(words[:title_words])
            # Skip if heading part is just common sentence starters
            if heading_part.lower().split()[0] not in COMMON_SENTENCE_STARTERS:
                approach1_hits.append({
                    "id": p["id"],
                    "heading_words": title_words,
                    "heading": heading_part,
                    "preview": preview(text),
                })
 # Count heading patterns
 heading_counter = Counter(h["heading"] for h in approach1_hits)
 print(f"\nFound {len(approach1_hits):,} paragraphs with title-case prefix → lowercase body.")
 print(f"Unique heading prefixes: {len(heading_counter):,}")
 print(f"\nTOP 30 most common heading prefixes:")
 for heading, count in heading_counter.most_common(30):
    # Find an example
    ex = next(h for h in approach1_hits if h["heading"] == heading)
    print(f"  [{count:4d}x] \"{heading}\"")
    print(f"         Example: {ex['preview']}")
 print(f"\nSample of UNIQUE (1x) heading prefixes (first 30):")
 unique_headings = [(h, ex) for h, ex in ((h, next(x for x in approach1_hits if x["heading"] == h)) for h in heading_counter if heading_counter[h] == 1)]
 for heading, ex in unique_headings[:30]:
    print(f"  \"{heading}\"")
    print(f"    → {ex['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 2: Capitalization anomalies
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 2: Capitalization anomalies")
 # 2a: ALL CAPS at start
 allcaps_hits = []
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 4:
        continue
    # Check first 3+ words are ALL CAPS
    caps_count = 0
    for w in words:
        clean = re.sub(r"[^a-zA-Z]", "", w)
        if not clean:
            caps_count += 1
            continue
        if clean.isupper() and len(clean) > 1:
            caps_count += 1
        else:
            break
    if caps_count >= 3:
        allcaps_hits.append({
            "id": p["id"],
            "caps_words": caps_count,
            "preview": preview(text),
        })
 print(f"\n2a. ALL CAPS for first 3+ words: {len(allcaps_hits):,} paragraphs")
 for h in allcaps_hits[:20]:
    print(f"  [{h['caps_words']} caps words] {h['preview']}")
 # 2b: First word is capitalized but NOT a common sentence starter
 # and looks like a heading keyword
 heading_start_hits = []
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 4:
        continue
    first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
    if first_word in HEADING_KEYWORDS and first_word not in COMMON_SENTENCE_STARTERS:
        heading_start_hits.append({
            "id": p["id"],
            "first_word": first_word,
            "preview": preview(text),
        })
 heading_start_counter = Counter(h["first_word"] for h in heading_start_hits)
 print(f"\n2b. First word is a heading keyword (not a sentence starter): {len(heading_start_hits):,} paragraphs")
 print("Breakdown by keyword:")
 for kw, count in heading_start_counter.most_common(30):
    ex = next(h for h in heading_start_hits if h["first_word"] == kw)
    print(f"  [{count:4d}x] \"{kw}\" — {ex['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 3: Separator patterns
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 3: Separator patterns (heading followed by separator then sentence)")
 separator_patterns = {
    "period": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\.\s+([A-Z][a-z])"),
    "dash/em-dash": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\s*[–—-]\s*([A-Z][a-z])"),
    "semicolon": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60});\s*([A-Z][a-z])"),
    "double space": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\s{2,}([A-Z][a-z])"),
    "colon": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60}):\s*([A-Z][a-z])"),
    "parenthetical prefix": re.compile(r"^\([a-z0-9ivx]+\)\s*([A-Z][A-Za-z\s,&]{3,60})\s+([a-z])"),
    "bullet/pipe prefix": re.compile(r"^[•●■▪◦‣|]\s*([A-Z][A-Za-z\s,&]{3,60})\s+([a-z])"),
    "tab separator": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\t+(.+)"),
 }
 for sep_name, pattern in separator_patterns.items():
    hits = []
    for p in paragraphs:
        text = p["text"].strip()
        m = pattern.match(text)
        if m:
            heading_candidate = m.group(1).strip() if m.lastindex >= 1 else ""
            # Filter: heading should have at least 2 words
            if len(heading_candidate.split()) >= 2:
                hits.append({
                    "id": p["id"],
                    "heading": heading_candidate,
                    "preview": preview(text),
                })
    heading_counts = Counter(h["heading"] for h in hits)
    print(f"\n  Separator: {sep_name} — {len(hits):,} hits")
    if hits:
        for heading, count in heading_counts.most_common(20):
            ex = next(h for h in hits if h["heading"] == heading)
            print(f"    [{count:4d}x] \"{heading}\"")
            print(f"           {ex['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 4: Repeated first-3-words analysis
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 4: Repeated first-3-word phrases")
 first3_counter = Counter()
 first3_examples = {}
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 4:
        continue
    first3 = " ".join(words[:3])
    first3_counter[first3] += 1
    if first3 not in first3_examples:
        first3_examples[first3] = preview(text)
 # Filter to phrases appearing 3+ times that look heading-like
 # (not common sentence starters)
 common_starts = {
    "we have implemented", "we have established", "we have adopted",
    "we have not", "we do not", "we are not", "we believe that",
    "we use a", "we rely on", "we have a", "we also have",
    "our board of", "the board of", "the company has",
    "the audit committee", "in addition to", "as part of",
    "as a result", "in the event", "as of the",
    "in accordance with", "with respect to",
 }
 print(f"\nFirst-3-word phrases appearing 5+ times (excluding common sentence starts):")
 for phrase, count in first3_counter.most_common(200):
    if count < 5:
        break
    if phrase.lower() in common_starts:
        continue
    # Check if it looks heading-like: title case or contains heading keywords
    words_lower = phrase.lower().split()
    is_heading_like = (
        all(w[0].isupper() or w in {"and", "of", "the", "for", "in", "on", "a", "or", "to"}
            for w in phrase.split() if re.sub(r"[^a-zA-Z]", "", w))
        and words_lower[0] not in COMMON_SENTENCE_STARTERS
    )
    label = " [HEADING-LIKE]" if is_heading_like else ""
    print(f"  [{count:4d}x] \"{phrase}\"{label}")
    print(f"         {first3_examples[phrase]}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 5: Cross-paragraph heading detection (short para → sentence para)
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 5: Cross-paragraph heading detection (standalone short headings)")
 # Group paragraphs by accession number, sorted by index
 by_filing = defaultdict(list)
 for p in paragraphs:
    acc = p["filing"]["accessionNumber"]
    by_filing[acc].append(p)
 for acc in by_filing:
    by_filing[acc].sort(key=lambda x: x["paragraphIndex"])
 standalone_headings = []
 for acc, pars in by_filing.items():
    for i in range(len(pars) - 1):
        curr = pars[i]
        nxt = pars[i + 1]
        curr_text = curr["text"].strip()
        curr_words = curr_text.split()
        nxt_text = nxt["text"].strip()
        # Current paragraph is short (< 10 words)
        if len(curr_words) > 10 or len(curr_words) < 2:
            continue
        # Current paragraph looks like a heading:
        # - Title case or all caps
        # - No period at end (headings rarely end with period)
        # - Not a single common word
        if curr_text.endswith(".") and not curr_text.endswith("etc."):
            continue
        # Check title-case-ish
        alpha_words = [w for w in curr_words if re.sub(r"[^a-zA-Z]", "", w)]
        if not alpha_words:
            continue
        title_case_ratio = sum(
            1 for w in alpha_words
            if re.sub(r"[^a-zA-Z]", "", w)[0].isupper()
            or re.sub(r"[^a-zA-Z]", "", w).lower() in {"and", "of", "the", "for", "in", "on", "a", "or", "to", "by", "with"}
        ) / len(alpha_words)
        if title_case_ratio < 0.8:
            continue
        # Next paragraph should start with a sentence (lowercase second word or common starter)
        nxt_words = nxt_text.split()
        if len(nxt_words) < 3:
            continue
        standalone_headings.append({
            "id": curr["id"],
            "heading_text": curr_text,
            "next_preview": preview(nxt_text),
            "accession": acc,
            "company": curr["filing"]["companyName"],
        })
 heading_text_counter = Counter(h["heading_text"] for h in standalone_headings)
 print(f"\nFound {len(standalone_headings):,} potential standalone heading paragraphs.")
 print(f"Unique heading texts: {len(heading_text_counter):,}")
 print(f"\nTOP 30 most common standalone headings:")
 for heading, count in heading_text_counter.most_common(30):
    ex = next(h for h in standalone_headings if h["heading_text"] == heading)
    print(f"  [{count:4d}x] \"{heading}\"")
    print(f"         Next para: {ex['next_preview']}")
 print(f"\nSample of UNIQUE standalone headings (first 30):")
 unique_standalone = [h for h in standalone_headings if heading_text_counter[h["heading_text"]] == 1]
 for h in unique_standalone[:30]:
    print(f"  \"{h['heading_text']}\" ({h['company']})")
    print(f"    Next: {h['next_preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 6: Unusual word patterns at paragraph start
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 6: Unusual starting words (gerunds, heading nouns)")
 # 6a: Gerunds at start
 gerund_hits = []
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 4:
        continue
    first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
    if first_word.endswith("ing") and len(first_word) > 4:
        if first_word in HEADING_GERUNDS or first_word not in COMMON_SENTENCE_STARTERS:
            gerund_hits.append({
                "id": p["id"],
                "first_word": first_word,
                "preview": preview(text),
            })
 gerund_counter = Counter(h["first_word"] for h in gerund_hits)
 print(f"\n6a. Paragraphs starting with gerunds: {len(gerund_hits):,}")
 print("TOP 20 gerunds:")
 for word, count in gerund_counter.most_common(20):
    ex = next(h for h in gerund_hits if h["first_word"] == word)
    print(f"  [{count:4d}x] \"{word}\" — {ex['preview']}")
 # 6b: Heading nouns at start (already covered in 2b, but let's look at
 # multi-word patterns starting with heading nouns)
 noun_phrase_hits = []
 for p in paragraphs:
    text = p["text"].strip()
    words = text.split()
    if len(words) < 4:
        continue
    first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
    if first_word in HEADING_KEYWORDS:
        # Check if the first 2-3 words form a heading-like phrase
        first_few = " ".join(words[:min(4, len(words))])
        noun_phrase_hits.append({
            "id": p["id"],
            "first_few": first_few,
            "preview": preview(text),
        })
 noun_counter = Counter(h["first_few"] for h in noun_phrase_hits)
 print(f"\n6b. Paragraphs starting with heading keyword nouns: {len(noun_phrase_hits):,}")
 print("TOP 20 opening phrases:")
 for phrase, count in noun_counter.most_common(20):
    ex = next(h for h in noun_phrase_hits if h["first_few"] == phrase)
    print(f"  [{count:4d}x] \"{phrase}\" — {ex['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 7: Numbers/letters at start (list items / numbered headings)
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 7: Numbered/lettered items at paragraph start")
 numbered_patterns = {
    "roman_paren": re.compile(r"^\((?:i{1,3}|iv|v|vi{0,3}|ix|x)\)\s"),
    "letter_paren": re.compile(r"^\([a-z]\)\s"),
    "number_paren": re.compile(r"^\(\d+\)\s"),
    "number_dot": re.compile(r"^\d+\.\s"),
    "letter_dot": re.compile(r"^[a-z]\.\s"),
    "roman_dot": re.compile(r"^(?:i{1,3}|iv|v|vi{0,3}|ix|x)\.\s"),
    "bullet_chars": re.compile(r"^[•●■▪◦‣►▸→·]\s"),
    "dash_bullet": re.compile(r"^[-–—]\s+[A-Z]"),
 }
 for pattern_name, pattern in numbered_patterns.items():
    hits = []
    for p in paragraphs:
        text = p["text"].strip()
        if pattern.match(text):
            hits.append({
                "id": p["id"],
                "preview": preview(text),
                "wordCount": p["wordCount"],
            })
    print(f"\n  Pattern: {pattern_name} — {len(hits):,} hits")
    if hits:
        # Show word count distribution
        short = sum(1 for h in hits if h["wordCount"] < 15)
        medium = sum(1 for h in hits if 15 <= h["wordCount"] < 50)
        long = sum(1 for h in hits if h["wordCount"] >= 50)
        print(f"    Length distribution: <15 words: {short}, 15-49: {medium}, 50+: {long}")
        print(f"    Examples (first 10):")
        for h in hits[:10]:
            print(f"      [{h['wordCount']:3d}w] {h['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # APPROACH 8 (BONUS): Colon-separated inline headings deep dive
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("APPROACH 8 (BONUS): Known heading phrases appearing ANYWHERE in first sentence")
 # Check for known SEC 1C heading phrases appearing at the start of a paragraph
 # even if not perfectly title-cased
 known_heading_phrases = [
    "risk management", "risk assessment", "risk factors",
    "governance", "board oversight", "board of directors",
    "incident response", "third party", "third-party",
    "cybersecurity program", "cybersecurity risk", "cybersecurity governance",
    "information security", "data protection", "data privacy",
    "security operations", "security awareness",
    "management oversight", "committee oversight",
    "risk management and strategy", "risk management, strategy",
    "material cybersecurity", "materiality assessment",
    "disclosure controls",
 ]
 phrase_hits = defaultdict(list)
 for p in paragraphs:
    text = p["text"].strip()
    # Only look at the first ~80 chars
    first_part = text[:80].lower()
    for phrase in known_heading_phrases:
        if first_part.startswith(phrase):
            phrase_hits[phrase].append({
                "id": p["id"],
                "preview": preview(text),
            })
 print(f"\nParagraphs starting with known heading phrases:")
 for phrase in sorted(phrase_hits.keys(), key=lambda x: -len(phrase_hits[x])):
    hits = phrase_hits[phrase]
    print(f"\n  \"{phrase}\" — {len(hits)} hits")
    for h in hits[:5]:
        print(f"    {h['preview']}")
 # ══════════════════════════════════════════════════════════════════════════════
 # SUMMARY
 # ══════════════════════════════════════════════════════════════════════════════
 print_section("SUMMARY")
 print(f"""
 Approach 1 (title-case prefix → body):   {len(approach1_hits):,} hits
 Approach 2a (ALL CAPS start):            {len(allcaps_hits):,} hits
 Approach 2b (heading keyword start):     {len(heading_start_hits):,} hits
 Approach 3 (separator patterns):         see above per-separator
 Approach 5 (standalone short headings):  {len(standalone_headings):,} hits
 Approach 6a (gerund starts):             {len(gerund_hits):,} hits
 Approach 6b (heading noun starts):       {len(noun_phrase_hits):,} hits
 Approach 7 (numbered/lettered):          see above per-pattern
 Approach 8 (known phrase starts):        {sum(len(v) for v in phrase_hits.values()):,} hits
 """)
 print("Done.")
--- a/scripts/generator_analysis.py
+++ b/scripts/generator_analysis.py
@ -0,0 +1,471 @@
 """
 Investigate whether certain SEC filing generators produce systematically worse
 text extraction in the SEC-cyBERT corpus. READ-ONLY analysis.
 """
 import json
 import os
 import random
 import re
 from collections import Counter, defaultdict
 from pathlib import Path
 random.seed(42)
 HTML_DIR = Path("data/raw/html")
 PARAGRAPHS_FILE = Path("data/paragraphs/paragraphs-clean.jsonl")
 # ─────────────────────────────────────────────────────────────────────────────
 # Helpers
 # ─────────────────────────────────────────────────────────────────────────────
 def extract_generator(header_bytes: bytes) -> str:
    """Extract generator from first ~5KB of an HTML file."""
    text = header_bytes.decode("utf-8", errors="replace")
    # 1. <meta name="generator" content="...">
    m = re.search(
        r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
        text, re.IGNORECASE
    )
    if m:
        return m.group(1).strip()
    # Also try content before name order
    m = re.search(
        r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']',
        text, re.IGNORECASE
    )
    if m:
        return m.group(1).strip()
    # 2. <!-- Generated by ... -->
    m = re.search(r'<!--\s*Generated\s+by\s+([^->]+)', text, re.IGNORECASE)
    if m:
        return m.group(1).strip()
    # 3. Distinctive patterns
    if "Workiva" in text or "wkiva" in text.lower():
        return "Workiva (pattern)"
    if "ix:header" in text.lower() or "ix:hidden" in text.lower():
        # iXBRL inline — common but not a specific generator
        pass
    if "toppanmerrill" in text.lower() or "toppan" in text.lower():
        return "Toppan Merrill (pattern)"
    if "donnelley" in text.lower() or "EDGAR Online" in text.lower():
        return "Donnelley/EDGAR Online (pattern)"
    if "GoXBRL" in text:
        return "GoXBRL (pattern)"
    return "UNKNOWN"
 def normalize_generator(raw: str) -> str:
    """Normalize generator strings to canonical names."""
    low = raw.lower()
    if "workiva" in low or "wdesk" in low or "wkiva" in low:
        return "Workiva"
    if "toppan" in low or "merrill" in low:
        return "Toppan Merrill"
    if "donnelley" in low or "edgar online" in low:
        return "Donnelley"
    if "goxbrl" in low:
        return "GoXBRL"
    if "word" in low or "microsoft" in low:
        return "Microsoft Word"
    if "webfilings" in low:
        return "WebFilings"
    if "novaworks" in low:
        return "Novaworks"
    if "ez-xbrl" in low or "ezxbrl" in low:
        return "EZ-XBRL"
    if "ixbrl" in low or "inline xbrl" in low:
        return "iXBRL Generator"
    if "vintage" in low:
        return "Vintage (Donnelley)"
    if "edgar" in low:
        return "EDGAR"
    if raw == "UNKNOWN":
        return "UNKNOWN"
    return raw  # keep as-is if no match
 def read_generator_for_file(filepath: Path) -> str:
    """Read the first 5KB and extract the generator."""
    try:
        with open(filepath, "rb") as f:
            header = f.read(5000)
        return normalize_generator(extract_generator(header))
    except Exception:
        return "ERROR"
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 0: Load paragraphs
 # ─────────────────────────────────────────────────────────────────────────────
 print("Loading paragraphs...")
 paragraphs = []
 filing_paragraphs = defaultdict(list)  # accession -> [paragraph dicts]
 with open(PARAGRAPHS_FILE) as f:
    for line in f:
        p = json.loads(line)
        paragraphs.append(p)
        acc = p["filing"]["accessionNumber"]
        filing_paragraphs[acc].append(p)
 print(f"  Loaded {len(paragraphs):,} paragraphs from {len(filing_paragraphs):,} filings\n")
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 1: Identify filing generators (500 random HTML files)
 # ─────────────────────────────────────────────────────────────────────────────
 print("=" * 80)
 print("STEP 1: IDENTIFY FILING GENERATORS (500-file sample)")
 print("=" * 80)
 all_html_files = sorted(HTML_DIR.glob("*.html"))
 sample_files = random.sample(all_html_files, min(500, len(all_html_files)))
 sample_generators = {}  # filename_stem -> generator
 raw_generator_strings = []
 for f in sample_files:
    try:
        with open(f, "rb") as fh:
            header = fh.read(5000)
        raw = extract_generator(header)
        raw_generator_strings.append(raw)
        gen = normalize_generator(raw)
        sample_generators[f.stem] = gen
    except Exception:
        sample_generators[f.stem] = "ERROR"
 gen_counts = Counter(sample_generators.values())
 print(f"\nGenerator distribution (500-file sample):\n")
 print(f"  {'Generator':<30} {'Count':>6} {'%':>7}")
 print(f"  {'-'*30} {'-'*6} {'-'*7}")
 for gen, count in gen_counts.most_common():
    print(f"  {gen:<30} {count:>6} {count/5:.1f}%")
 print(f"\nRaw generator strings (unique):")
 raw_counts = Counter(raw_generator_strings)
 for raw, count in raw_counts.most_common(20):
    print(f"  [{count:>4}] {raw[:80]}")
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 2: Generator-specific quality metrics
 # ─────────────────────────────────────────────────────────────────────────────
 print("\n" + "=" * 80)
 print("STEP 2: GENERATOR-SPECIFIC QUALITY METRICS")
 print("=" * 80)
 # Major generators: those with >20 filings in sample
 major_gens = {g for g, c in gen_counts.items() if c > 20}
 print(f"\nMajor generators (>20 in sample): {sorted(major_gens)}\n")
 # For each sampled filing that has paragraphs, compute metrics
 gen_metrics = defaultdict(lambda: {
    "filing_count": 0,
    "para_counts": [],
    "word_counts": [],
    "lowercase_starts": 0,
    "total_paras": 0,
    "short_paras": 0,  # <25 words
    "html_sizes": [],
    "text_sizes": [],
 })
 for stem, gen in sample_generators.items():
    if gen not in major_gens:
        continue
    acc = stem  # filename stem is the accession number
    paras = filing_paragraphs.get(acc, [])
    m = gen_metrics[gen]
    m["filing_count"] += 1
    m["para_counts"].append(len(paras))
    # HTML file size
    html_path = HTML_DIR / f"{stem}.html"
    try:
        html_size = html_path.stat().st_size
    except Exception:
        html_size = 0
    m["html_sizes"].append(html_size)
    total_text_len = 0
    for p in paras:
        wc = p.get("wordCount", len(p["text"].split()))
        m["word_counts"].append(wc)
        m["total_paras"] += 1
        total_text_len += len(p["text"])
        if p["text"] and p["text"][0].islower():
            m["lowercase_starts"] += 1
        if wc < 25:
            m["short_paras"] += 1
    m["text_sizes"].append(total_text_len)
 # Print table
 print(f"  {'Generator':<22} {'Files':>5} {'Avg ¶':>7} {'Avg WC':>7} {'%lc':>6} {'%short':>7} {'ExtRatio':>9}")
 print(f"  {'-'*22} {'-'*5} {'-'*7} {'-'*7} {'-'*6} {'-'*7} {'-'*9}")
 for gen in sorted(major_gens):
    m = gen_metrics[gen]
    n = m["filing_count"]
    if n == 0:
        continue
    avg_paras = sum(m["para_counts"]) / n if n else 0
    avg_wc = sum(m["word_counts"]) / len(m["word_counts"]) if m["word_counts"] else 0
    pct_lc = (m["lowercase_starts"] / m["total_paras"] * 100) if m["total_paras"] else 0
    pct_short = (m["short_paras"] / m["total_paras"] * 100) if m["total_paras"] else 0
    # Extraction ratio: total text bytes / html bytes
    total_html = sum(m["html_sizes"])
    total_text = sum(m["text_sizes"])
    ext_ratio = (total_text / total_html * 100) if total_html else 0
    print(f"  {gen:<22} {n:>5} {avg_paras:>7.1f} {avg_wc:>7.1f} {pct_lc:>5.1f}% {pct_short:>6.1f}% {ext_ratio:>8.2f}%")
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 3: HTML structure analysis — representative snippets
 # ─────────────────────────────────────────────────────────────────────────────
 print("\n" + "=" * 80)
 print("STEP 3: HTML STRUCTURE ANALYSIS (paragraph encoding by generator)")
 print("=" * 80)
 top5_gens = [g for g, _ in gen_counts.most_common(5)]
 for gen in top5_gens:
    # Find a sample file for this generator
    sample_acc = None
    for stem, g in sample_generators.items():
        if g == gen:
            sample_acc = stem
            break
    if not sample_acc:
        continue
    html_path = HTML_DIR / f"{sample_acc}.html"
    try:
        with open(html_path, "r", errors="replace") as fh:
            content = fh.read(50000)  # read enough to find a paragraph
        # Find a <p> tag or similar paragraph structure
        # Look for a <p tag with content
        m = re.search(r'(<p\b[^>]*>[^<]{20,})', content, re.IGNORECASE)
        if m:
            snippet = m.group(1)[:200]
        else:
            # Try <div> or <span> with text
            m = re.search(r'(<(?:div|span)\b[^>]*>[^<]{20,})', content, re.IGNORECASE)
            if m:
                snippet = m.group(1)[:200]
            else:
                snippet = "(no paragraph tag found in first 50KB)"
    except Exception as e:
        snippet = f"(error: {e})"
    print(f"\n  Generator: {gen}")
    print(f"  File: {sample_acc}.html")
    print(f"  Snippet: {snippet}")
    print()
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 4: Generator fingerprinting of problem paragraphs
 # ─────────────────────────────────────────────────────────────────────────────
 print("=" * 80)
 print("STEP 4: GENERATOR FINGERPRINTING OF PROBLEM PARAGRAPHS")
 print("=" * 80)
 # Identify problem paragraphs
 lowercase_paras = []
 long_paras = []  # >300 words
 short_paras = []  # <25 words
 for p in paragraphs:
    wc = p.get("wordCount", len(p["text"].split()))
    if p["text"] and p["text"][0].islower():
        lowercase_paras.append(p)
    if wc > 300:
        long_paras.append(p)
    if wc < 25:
        short_paras.append(p)
 print(f"\n  Problem paragraph counts:")
 print(f"    Lowercase starts: {len(lowercase_paras):,}")
 print(f"    Long (>300 words): {len(long_paras):,}")
 print(f"    Short (<25 words): {len(short_paras):,}")
 print(f"    Total paragraphs: {len(paragraphs):,}")
 # For each category, sample up to 200 and look up generators
 # We need a cache of accession -> generator since we may need to read many files
 print("\n  Building generator cache for problem filings...")
 problem_accessions = set()
 for p in lowercase_paras:
    problem_accessions.add(p["filing"]["accessionNumber"])
 for p in long_paras:
    problem_accessions.add(p["filing"]["accessionNumber"])
 for p in short_paras:
    problem_accessions.add(p["filing"]["accessionNumber"])
 # Also get generators for ALL filings to compute baseline
 print("  Reading generators for ALL filings in the corpus...")
 all_accessions = set(filing_paragraphs.keys())
 acc_generator = {}
 for acc in all_accessions:
    html_path = HTML_DIR / f"{acc}.html"
    if html_path.exists():
        acc_generator[acc] = read_generator_for_file(html_path)
    else:
        acc_generator[acc] = "FILE_MISSING"
 # Baseline distribution
 baseline_gen_counts = Counter(acc_generator.values())
 print(f"\n  Full corpus generator distribution ({len(acc_generator):,} filings):\n")
 print(f"    {'Generator':<30} {'Count':>6} {'%':>7}")
 print(f"    {'-'*30} {'-'*6} {'-'*7}")
 total_filings = len(acc_generator)
 for gen, count in baseline_gen_counts.most_common(15):
    print(f"    {gen:<30} {count:>6} {count/total_filings*100:>6.1f}%")
 def analyze_problem_category(name, problem_list, acc_generator, baseline_gen_counts, total_filings):
    """Analyze which generators are over-represented in a problem category."""
    print(f"\n  --- {name} ({len(problem_list):,} paragraphs) ---")
    # Count generators for problem paragraphs (by paragraph, not by filing)
    gen_para_counts = Counter()
    for p in problem_list:
        acc = p["filing"]["accessionNumber"]
        gen = acc_generator.get(acc, "UNKNOWN")
        gen_para_counts[gen] += 1
    total_problem = len(problem_list)
    total_all = len(paragraphs)
    print(f"    {'Generator':<30} {'# Problem':>9} {'% of Prob':>9} {'% of All':>9} {'Over-rep':>9}")
    print(f"    {'-'*30} {'-'*9} {'-'*9} {'-'*9} {'-'*9}")
    # Compute total paragraphs per generator
    gen_all_para_counts = Counter()
    for p in paragraphs:
        acc = p["filing"]["accessionNumber"]
        gen = acc_generator.get(acc, "UNKNOWN")
        gen_all_para_counts[gen] += 1
    for gen, prob_count in gen_para_counts.most_common(10):
        pct_of_problem = prob_count / total_problem * 100 if total_problem else 0
        all_count = gen_all_para_counts.get(gen, 1)
        pct_of_all = all_count / total_all * 100 if total_all else 0
        over_rep = pct_of_problem / pct_of_all if pct_of_all else 0
        print(f"    {gen:<30} {prob_count:>9,} {pct_of_problem:>8.1f}% {pct_of_all:>8.1f}% {over_rep:>8.2f}x")
    # Show a few example problem texts
    print(f"\n    Example texts:")
    for p in problem_list[:3]:
        text = p["text"][:120].replace("\n", " ")
        acc = p["filing"]["accessionNumber"]
        gen = acc_generator.get(acc, "?")
        print(f"      [{gen}] {text}...")
 analyze_problem_category("Lowercase starts (orphan words)", lowercase_paras, acc_generator, baseline_gen_counts, total_filings)
 analyze_problem_category("Long paragraphs (>300 words, potential merges)", long_paras, acc_generator, baseline_gen_counts, total_filings)
 analyze_problem_category("Short paragraphs (<25 words, potential fragments)", short_paras, acc_generator, baseline_gen_counts, total_filings)
 # ─────────────────────────────────────────────────────────────────────────────
 # Step 5: Filing size vs extraction quality
 # ─────────────────────────────────────────────────────────────────────────────
 print("\n" + "=" * 80)
 print("STEP 5: FILING SIZE vs EXTRACTION QUALITY")
 print("=" * 80)
 # Compute HTML size and paragraph count for all filings
 size_para_data = []
 for acc, paras_list in filing_paragraphs.items():
    html_path = HTML_DIR / f"{acc}.html"
    try:
        html_size = html_path.stat().st_size
    except Exception:
        continue
    size_para_data.append({
        "acc": acc,
        "html_size": html_size,
        "para_count": len(paras_list),
        "generator": acc_generator.get(acc, "UNKNOWN"),
    })
 # Bin by size ranges
 size_bins = [
    (0, 50_000, "<50KB"),
    (50_000, 200_000, "50-200KB"),
    (200_000, 500_000, "200-500KB"),
    (500_000, 1_000_000, "500KB-1MB"),
    (1_000_000, 5_000_000, "1-5MB"),
    (5_000_000, float("inf"), ">5MB"),
 ]
 print(f"\n  HTML Size vs Extracted Paragraphs:\n")
 print(f"    {'Size Range':<15} {'Files':>6} {'Avg ¶':>7} {'Med ¶':>7} {'Min ¶':>6} {'Max ¶':>6}")
 print(f"    {'-'*15} {'-'*6} {'-'*7} {'-'*7} {'-'*6} {'-'*6}")
 for lo, hi, label in size_bins:
    in_bin = [d for d in size_para_data if lo <= d["html_size"] < hi]
    if not in_bin:
        continue
    counts = sorted([d["para_count"] for d in in_bin])
    avg = sum(counts) / len(counts)
    med = counts[len(counts) // 2]
    print(f"    {label:<15} {len(in_bin):>6} {avg:>7.1f} {med:>7} {min(counts):>6} {max(counts):>6}")
 # Large HTML files with very few paragraphs — likely extraction failures
 print(f"\n  Potential extraction failures (HTML >1MB but ≤2 paragraphs):\n")
 big_few = [d for d in size_para_data if d["html_size"] > 1_000_000 and d["para_count"] <= 2]
 big_few.sort(key=lambda d: d["html_size"], reverse=True)
 if not big_few:
    # Relax threshold
    print("  (None found with >1MB and ≤2 paragraphs. Relaxing to >500KB and ≤3 paragraphs)\n")
    big_few = [d for d in size_para_data if d["html_size"] > 500_000 and d["para_count"] <= 3]
    big_few.sort(key=lambda d: d["html_size"], reverse=True)
 print(f"    {'Accession':<30} {'HTML Size':>12} {'Paras':>6} {'Generator':<25}")
 print(f"    {'-'*30} {'-'*12} {'-'*6} {'-'*25}")
 for d in big_few[:10]:
    size_str = f"{d['html_size']/1024/1024:.2f} MB" if d['html_size'] > 1_000_000 else f"{d['html_size']/1024:.0f} KB"
    print(f"    {d['acc']:<30} {size_str:>12} {d['para_count']:>6} {d['generator']:<25}")
 # Also show the reverse: small HTML with many paragraphs
 print(f"\n  Unusual: Small HTML (<50KB) with many paragraphs (>15):\n")
 small_many = [d for d in size_para_data if d["html_size"] < 50_000 and d["para_count"] > 15]
 small_many.sort(key=lambda d: d["para_count"], reverse=True)
 print(f"    {'Accession':<30} {'HTML Size':>12} {'Paras':>6} {'Generator':<25}")
 print(f"    {'-'*30} {'-'*12} {'-'*6} {'-'*25}")
 for d in small_many[:10]:
    size_str = f"{d['html_size']/1024:.0f} KB"
    print(f"    {d['acc']:<30} {size_str:>12} {d['para_count']:>6} {d['generator']:<25}")
 # ─────────────────────────────────────────────────────────────────────────────
 # Summary
 # ─────────────────────────────────────────────────────────────────────────────
 print("\n" + "=" * 80)
 print("SUMMARY")
 print("=" * 80)
 print("""
 Key findings are printed above. Look for:
 1. Which generators dominate the corpus
 2. Whether any generator has notably worse extraction metrics (low para count,
   high % lowercase starts, low extraction ratio)
 3. Whether problem paragraphs cluster around specific generators (over-rep > 1.5x)
 4. Whether large-HTML / few-paragraph cases cluster on a specific generator
 """)
--- a/scripts/generator_quality_analysis.py
+++ b/scripts/generator_quality_analysis.py
@ -0,0 +1,627 @@
 #!/usr/bin/env python3
 """
 Cross-reference SEC filing generators with paragraph quality metrics.
 Reuses detection logic from detect_generators.py, then computes quality
 metrics per generator from paragraphs-clean.jsonl.
 """
 import json
 import os
 import re
 import sys
 import statistics
 from collections import defaultdict, Counter
 from pathlib import Path
 HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
 PARAGRAPHS_FILE = Path("/home/joey/Documents/sec-cyBERT/data/paragraphs/paragraphs-clean.jsonl")
 READ_BYTES = 20_000
 # ── Generator detection (copied from detect_generators.py) ──
 FILING_AGENT_CIKS = {
    "0000950170": "Donnelley Financial Solutions",
    "0001193125": "Donnelley Financial Solutions",
    "0001558370": "Toppan Merrill",
    "0001654954": "Toppan Merrill",
 }
 def _normalize_generator(raw: str) -> str:
    r = raw.strip().lower()
    if "workiva" in r or "wdesk" in r:
        return "Workiva"
    if "donnelley" in r or "dfin" in r or "rrdonnelley" in r:
        return "Donnelley Financial Solutions"
    if ("toppan" in r) or ("merrill" in r and "bridge" in r):
        return "Toppan Merrill"
    if "word" in r and "microsoft" in r:
        return "Microsoft Word"
    if "excel" in r and "microsoft" in r:
        return "Microsoft Excel"
    if "thunderdome" in r:
        return "ThunderDome"
    if "goxbrl" in r:
        return "GoXBRL"
    if "compsci" in r:
        return "CompSci Transform"
    if "certent" in r:
        return "Certent"
    if "iris carbon" in r:
        return "IRIS Carbon"
    if "broadridge" in r or "profile" in r:
        return "Broadridge PROfile"
    if "sec publisher" in r:
        return "SEC Publisher"
    return raw.strip()
 def detect_generator(filepath: str) -> str:
    """Read first 20KB and return generator name."""
    with open(filepath, "rb") as f:
        raw = f.read(READ_BYTES)
    text = raw.decode("utf-8", errors="replace")
    text_lower = text.lower()
    # meta generator
    m = re.search(r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
    if not m:
        m = re.search(r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']', text, re.I)
    if m:
        return _normalize_generator(m.group(1))
    m = re.search(r'<meta\s+name\s*=\s*["\']Creator["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
    if m:
        return _normalize_generator(m.group(1))
    m = re.search(r'<meta\s+name\s*=\s*["\']Producer["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
    if m:
        return _normalize_generator(m.group(1))
    m = re.search(r'<meta\s+name\s*=\s*["\']ProgId["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
    if m:
        progid = m.group(1)
        if "word" in progid.lower():
            return "Microsoft Word"
        if "excel" in progid.lower():
            return "Microsoft Excel"
        return _normalize_generator(progid)
    # Comment signatures
    if re.search(r"<!--.*Created with the Workiva Platform.*-->", text, re.I):
        return "Workiva"
    if re.search(r"<!--.*Copyright\s+\d{4}\s+Workiva.*-->", text, re.I):
        return "Workiva"
    if re.search(r"<!--.*Document created using Wdesk.*-->", text, re.I):
        return "Workiva"
    if re.search(r"<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->", text, re.I):
        return "Toppan Merrill"
    if re.search(r"<!--.*Merrill\s*Bridge.*-->", text, re.I):
        return "Toppan Merrill"
    if re.search(r"<!--.*Donnelley Financial Solutions.*-->", text, re.I):
        return "Donnelley Financial Solutions"
    if re.search(r"<!--.*RR\s*Donnelley.*-->", text, re.I):
        return "Donnelley Financial Solutions"
    if re.search(r"<!--.*Broadridge\s+PROfile.*-->", text, re.I):
        return "Broadridge PROfile"
    if "broadridge" in text_lower:
        return "Broadridge PROfile"
    m_title = re.search(r"<title[^>]*>([^<]+)</title>", text, re.I)
    title_text = m_title.group(1).strip() if m_title else ""
    if "sec publisher" in text_lower or "sec publisher" in title_text.lower():
        return "SEC Publisher"
    m = re.search(r"<!--.*Powered by IRIS Carbon.*-->", text, re.I)
    if m:
        return "IRIS Carbon"
    if re.search(r"<!--.*Certent\s+Disclosure\s+Management.*-->", text, re.I):
        return "Certent"
    if "certent" in text_lower:
        return "Certent"
    if re.search(r"<!--.*CompSci Resources.*-->", text, re.I):
        return "CompSci Transform"
    if re.search(r"<!--.*RDG Portal.*-->", text, re.I):
        return "RDG Portal"
    if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]:
        return "PDF to EDGAR"
    m = re.search(r"<!--\s*Generated\s+by\s+([^-]+?)-->", text, re.I)
    if m:
        val = m.group(1).strip()
        if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
            return _normalize_generator(val)
    m = re.search(r"<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->", text, re.I)
    if m:
        val = m.group(1).strip()
        if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
            return _normalize_generator(val)
    # Keyword signatures
    if re.search(r"\bwdesk\b", text_lower):
        return "Workiva"
    if re.search(r"\bworkiva\b", text_lower):
        return "Workiva"
    if re.search(r"\brrdonnelley\b", text_lower):
        return "Donnelley Financial Solutions"
    if re.search(r"\bedgar-online\b", text_lower):
        return "Donnelley Financial Solutions"
    if re.search(r"\btoppan\b", text_lower):
        return "Toppan Merrill"
    if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower):
        return "Toppan Merrill"
    if re.search(r"\bbowne\b", text_lower):
        return "Toppan Merrill"
    if re.search(r"\bcompsci\b", text_lower):
        return "CompSci Transform"
    if re.search(r"\bthunderdome\b", text_lower):
        return "ThunderDome"
    if re.search(r"\bgoxbrl\b", text_lower):
        return "GoXBRL"
    if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower):
        return "Workiva"
    # SGML document wrapper
    has_sgml = re.search(r"<DOCUMENT>\s*\n?\s*<TYPE>", text, re.I)
    if has_sgml:
        m_fn = re.search(r"<FILENAME>\s*([\w\-\.]+)", text, re.I)
        if m_fn:
            filename = m_fn.group(1).lower()
            if re.match(r"d\d+", filename):
                return "Donnelley Financial Solutions"
            if re.match(r"tm\d+", filename):
                return "Toppan Merrill"
            if re.match(r"ea\d+", filename):
                return "EFiling/EDGAR Agent"
        if "<!-- field: rule-page" in text_lower or "rule-page" in text_lower[:5000]:
            return "Broadridge PROfile"
        if "field: set; name: xdx" in text_lower:
            return "EFiling XDX"
        if "<!-- field:" in text_lower[:5000]:
            return "EFiling/EDGAR Agent"
        if re.search(r'<Center><DIV STYLE="width:8\.5in"', text):
            return "Donnelley Financial Solutions"
        basename = os.path.basename(filepath)
        accession_prefix = basename.split("-")[0]
        if accession_prefix in FILING_AGENT_CIKS:
            return FILING_AGENT_CIKS[accession_prefix]
        font_count = text_lower.count("<font")
        if font_count > 5:
            return "SGML-wrapped (legacy)"
        return "SGML-wrapped (unknown)"
    # Inline XBRL
    has_ix_ns = "xmlns:ix=" in text_lower or "<ix:header" in text_lower
    if re.search(r'<P STYLE="[^"]*font-family:Times New Roman"', text) and re.search(
        r'<Center><DIV STYLE="width:8\.5in"', text
    ):
        return "Donnelley Financial Solutions"
    if title_text:
        title_lower = title_text.lower()
        if "workiva" in title_lower or "wdesk" in title_lower:
            return "Workiva"
    if has_ix_ns:
        if "field: set; name: xdx" in text_lower:
            return "EFiling XDX"
        if "<!-- field: rule" in text_lower:
            return "Broadridge PROfile"
        if "<!-- field:" in text_lower[:5000]:
            return "EFiling/EDGAR Agent"
        basename = os.path.basename(filepath)
        accession_prefix = basename.split("-")[0]
        if accession_prefix in FILING_AGENT_CIKS:
            return FILING_AGENT_CIKS[accession_prefix]
        if '<?xml version="1.0" encoding="utf-8"' in text_lower[:200]:
            return "Inline XBRL (utf-8 toolchain)"
        if "<?xml version='1.0' encoding='ascii'?>" in text_lower[:200]:
            return "Inline XBRL (SEC/EDGAR standard)"
        return "Inline XBRL (tool unresolved)"
    # Structural fallbacks
    font_count = text_lower.count("<font")
    td_count = text_lower.count("<td")
    span_count = text_lower.count("<span")
    if font_count > 20:
        return "Legacy generator (font-based)"
    if td_count > 50 and span_count < 10:
        return "Table-based generator"
    data_attr_count = len(re.findall(r"\bdata-\w+", text_lower))
    if data_attr_count > 10:
        return "Modern web tooling"
    return "Unknown"
 # ── Consolidate to ~14 families ──
 FAMILY_MAP = {
    "Workiva": "Workiva",
    "Donnelley Financial Solutions": "Donnelley Financial Solutions",
    "Toppan Merrill": "Toppan Merrill",
    "CompSci Transform": "CompSci Transform",
    "ThunderDome": "ThunderDome",
    "EFiling/EDGAR Agent": "EFiling/EDGAR Agent",
    "EFiling XDX": "EFiling/EDGAR Agent",
    "Broadridge PROfile": "Broadridge PROfile",
    "SEC Publisher": "SEC Publisher",
    "IRIS Carbon": "IRIS Carbon",
    "RDG Portal": "RDG Portal",
    "Certent": "Certent",
    "PDF to EDGAR": "PDF to EDGAR",
    "GoXBRL": "GoXBRL",
    "Microsoft Word": "Microsoft Word",
    "Microsoft Excel": "Microsoft Excel",
    "Inline XBRL (SEC/EDGAR standard)": "Inline XBRL (unattributed)",
    "Inline XBRL (utf-8 toolchain)": "Inline XBRL (unattributed)",
    "Inline XBRL (tool unresolved)": "Inline XBRL (unattributed)",
    "SGML-wrapped (legacy)": "SGML-wrapped (unattributed)",
    "SGML-wrapped (unknown)": "SGML-wrapped (unattributed)",
    "Legacy generator (font-based)": "Other/Legacy",
    "Table-based generator": "Other/Legacy",
    "Modern web tooling": "Other/Legacy",
    "Unknown": "Unknown",
 }
 # ── Quality metric helpers ──
 # Common non-heading start words to exclude from title-case detection
 NON_HEADING_STARTS = {
    "we", "our", "the", "in", "a", "an", "as", "to", "on", "at", "by",
    "for", "it", "is", "if", "or", "no", "so", "do", "its", "this",
    "that", "with", "from", "has", "had", "have", "will", "may", "can",
    "all", "any", "are", "was", "were", "been", "not", "but", "each",
    "such", "these", "those", "also", "when", "there", "their",
    "they", "them", "than", "who", "what", "how", "where",
 }
 # Section name fragments for Item 1C
 SECTION_KEYWORDS = [
    "risk management", "board oversight", "governance", "incident",
    "strategy", "third party", "management role", "cybersecurity",
    "risk factors", "material", "overview",
 ]
 RE_ALLCAPS_HEADER = re.compile(r"^[A-Z][A-Z\s,&\-]{10,}[a-z]")
 def is_inlined_header(text: str) -> bool:
    """Check if paragraph starts with an inlined heading pattern."""
    # ALL-CAPS header followed by body text
    if RE_ALLCAPS_HEADER.match(text):
        return True
    # Title-case heading: 2+ consecutive capitalized words at start (not common sentence starters)
    words = text.split()
    if len(words) < 4:
        return False
    cap_count = 0
    for w in words:
        clean = w.strip(".,;:!?()\"'")
        if not clean:
            continue
        if clean[0].isupper() and clean.lower() not in NON_HEADING_STARTS:
            cap_count += 1
        else:
            break
    if cap_count >= 2:
        # Check rest of text continues as a sentence (not just a short title)
        remaining = " ".join(words[cap_count:])
        if len(remaining) > 20:
            return True
    # Section keyword match at start
    text_lower = text[:80].lower()
    for kw in SECTION_KEYWORDS:
        if text_lower.startswith(kw):
            # Must have more text after the heading
            if len(text) > len(kw) + 10:
                return True
    return False
 def is_orphan_word(text: str) -> bool:
    """Check if paragraph starts with lowercase (excluding list patterns)."""
    if not text:
        return False
    first_char = text[0]
    if not first_char.islower():
        return False
    # Exclude list pattern starters
    list_starters = ["and ", "or ", "including ", "i.e.", "e.g."]
    text_lower = text[:15].lower()
    for starter in list_starters:
        if text_lower.startswith(starter):
            return False
    # Exclude bullet-like patterns
    if text[0] in "•·-–—":
        return False
    return True
 RE_TERMINAL = re.compile(r'[.!?;")]\s*$')
 def is_truncated(text: str) -> bool:
    """Paragraph NOT ending with terminal punctuation."""
    return not RE_TERMINAL.search(text)
 def is_fragment(text: str) -> bool:
    return len(text.split()) < 25
 def main():
    # ── Step 1: Detect generators for all HTML files ──
    print("Step 1: Detecting generators for all HTML files...", file=sys.stderr)
    accession_to_generator = {}
    files = sorted(HTML_DIR.glob("*.html"))
    for i, fp in enumerate(files):
        accession = fp.stem
        gen_raw = detect_generator(str(fp))
        gen_family = FAMILY_MAP.get(gen_raw, gen_raw)
        accession_to_generator[accession] = gen_family
        if (i + 1) % 3000 == 0:
            print(f"  {i+1}/{len(files)} files processed...", file=sys.stderr)
    print(f"  Done: {len(files)} files, {len(set(accession_to_generator.values()))} generator families", file=sys.stderr)
    # ── Step 2: Load paragraphs and compute per-filing stats ──
    print("Step 2: Loading paragraphs...", file=sys.stderr)
    # Per-filing data
    filing_paragraphs = defaultdict(list)  # accession -> list of paragraph dicts
    text_hash_counts = Counter()  # textHash -> count of filings containing it
    # First pass: collect all textHashes and their filing counts
    text_hash_filings = defaultdict(set)  # textHash -> set of accessions
    all_paragraphs = []
    with open(PARAGRAPHS_FILE) as f:
        for line in f:
            p = json.loads(line)
            acc = p["filing"]["accessionNumber"]
            all_paragraphs.append(p)
            filing_paragraphs[acc].append(p)
            text_hash_filings[p["textHash"]].add(acc)
    print(f"  {len(all_paragraphs)} paragraphs across {len(filing_paragraphs)} filings", file=sys.stderr)
    # Boilerplate: textHash appearing in 3+ filings
    boilerplate_hashes = {h for h, accs in text_hash_filings.items() if len(accs) >= 3}
    print(f"  {len(boilerplate_hashes)} boilerplate hashes (in 3+ filings)", file=sys.stderr)
    # ── Step 3: Compute metrics per generator ──
    print("Step 3: Computing metrics...", file=sys.stderr)
    # Per-generator aggregate
    gen_stats = defaultdict(lambda: {
        "total_paragraphs": 0,
        "total_filings": 0,
        "paragraphs_per_filing": [],
        "word_counts": [],
        "inlined_header": 0,
        "orphan_word": 0,
        "fragment": 0,
        "truncated": 0,
        "boilerplate": 0,
    })
    # Per-filing issue rates for "most problematic" analysis
    filing_issue_rates = {}  # accession -> {metrics..., combined_rate}
    # Filings not in HTML dir (no generator detected)
    missing_gen = 0
    for acc, paragraphs in filing_paragraphs.items():
        gen = accession_to_generator.get(acc)
        if gen is None:
            missing_gen += 1
            gen = "(no HTML file)"
        stats = gen_stats[gen]
        stats["total_filings"] += 1
        stats["total_paragraphs"] += len(paragraphs)
        stats["paragraphs_per_filing"].append(len(paragraphs))
        # Per-filing counters for issue rate
        f_inlined = 0
        f_orphan = 0
        f_fragment = 0
        f_truncated = 0
        f_boilerplate = 0
        for p in paragraphs:
            text = p["text"]
            wc = p.get("wordCount", len(text.split()))
            stats["word_counts"].append(wc)
            if is_inlined_header(text):
                stats["inlined_header"] += 1
                f_inlined += 1
            if is_orphan_word(text):
                stats["orphan_word"] += 1
                f_orphan += 1
            if is_fragment(text):
                stats["fragment"] += 1
                f_fragment += 1
            if is_truncated(text):
                stats["truncated"] += 1
                f_truncated += 1
            if p["textHash"] in boilerplate_hashes:
                stats["boilerplate"] += 1
                f_boilerplate += 1
        n = len(paragraphs)
        if n > 0:
            filing_issue_rates[acc] = {
                "generator": gen,
                "n_paragraphs": n,
                "inlined_header_rate": f_inlined / n,
                "orphan_word_rate": f_orphan / n,
                "fragment_rate": f_fragment / n,
                "truncation_rate": f_truncated / n,
                "boilerplate_rate": f_boilerplate / n,
                "combined_rate": (f_inlined + f_orphan + f_fragment + f_truncated + f_boilerplate) / (5 * n),
            }
    if missing_gen:
        print(f"  Note: {missing_gen} filings had no matching HTML file", file=sys.stderr)
    # ── Step 4: Output ──
    # Compute corpus-wide averages for flagging
    corpus_total = sum(s["total_paragraphs"] for s in gen_stats.values())
    corpus_inlined = sum(s["inlined_header"] for s in gen_stats.values())
    corpus_orphan = sum(s["orphan_word"] for s in gen_stats.values())
    corpus_fragment = sum(s["fragment"] for s in gen_stats.values())
    corpus_truncated = sum(s["truncated"] for s in gen_stats.values())
    corpus_boilerplate = sum(s["boilerplate"] for s in gen_stats.values())
    corpus_avg_wc = statistics.mean(
        wc for s in gen_stats.values() for wc in s["word_counts"]
    ) if corpus_total > 0 else 0
    avg_rates = {
        "inlined_header": corpus_inlined / corpus_total if corpus_total else 0,
        "orphan_word": corpus_orphan / corpus_total if corpus_total else 0,
        "fragment": corpus_fragment / corpus_total if corpus_total else 0,
        "truncated": corpus_truncated / corpus_total if corpus_total else 0,
        "boilerplate": corpus_boilerplate / corpus_total if corpus_total else 0,
    }
    print()
    print("=" * 180)
    print("GENERATOR QUALITY CROSS-REFERENCE: SEC-cyBERT CORPUS")
    print("=" * 180)
    print(f"\nCorpus totals: {corpus_total:,} paragraphs across {sum(s['total_filings'] for s in gen_stats.values()):,} filings")
    print(f"Corpus averages: InlinedHdr={avg_rates['inlined_header']:.1%}  Orphan={avg_rates['orphan_word']:.1%}  "
          f"Fragment={avg_rates['fragment']:.1%}  Truncated={avg_rates['truncated']:.1%}  "
          f"Boilerplate={avg_rates['boilerplate']:.1%}  AvgWC={corpus_avg_wc:.1f}")
    print(f"(Cells marked with ** are >2x the corpus average)")
    # Sort by total paragraphs descending
    sorted_gens = sorted(gen_stats.items(), key=lambda x: x[1]["total_paragraphs"], reverse=True)
    # Header
    print()
    hdr = (
        f"{'Generator':<35} {'Files':>6} {'Paras':>7} {'Mean/F':>7} {'Med/F':>6} "
        f"{'AvgWC':>6} {'InlHdr%':>8} {'Orphan%':>8} {'Frag%':>8} {'Trunc%':>8} {'Boiler%':>8}"
    )
    print(hdr)
    print("-" * len(hdr))
    for gen, s in sorted_gens:
        n = s["total_paragraphs"]
        if n == 0:
            continue
        nf = s["total_filings"]
        mean_ppf = n / nf if nf else 0
        med_ppf = statistics.median(s["paragraphs_per_filing"]) if s["paragraphs_per_filing"] else 0
        avg_wc = statistics.mean(s["word_counts"]) if s["word_counts"] else 0
        inl_r = s["inlined_header"] / n
        orp_r = s["orphan_word"] / n
        fra_r = s["fragment"] / n
        tru_r = s["truncated"] / n
        boi_r = s["boilerplate"] / n
        # Flag if >2x corpus average
        def fmt_rate(val, avg_key):
            pct = f"{val:.1%}"
            if avg_rates[avg_key] > 0 and val > 2 * avg_rates[avg_key]:
                return f"{pct:>6}**"
            return f"{pct:>8}"
        row = (
            f"{gen:<35} {nf:>6} {n:>7} {mean_ppf:>7.1f} {med_ppf:>6.0f} "
            f"{avg_wc:>6.1f} {fmt_rate(inl_r, 'inlined_header')} {fmt_rate(orp_r, 'orphan_word')} "
            f"{fmt_rate(fra_r, 'fragment')} {fmt_rate(tru_r, 'truncated')} {fmt_rate(boi_r, 'boilerplate')}"
        )
        print(row)
    print("-" * len(hdr))
    # Corpus average row
    corpus_med_ppf = statistics.median(
        ppf for s in gen_stats.values() for ppf in s["paragraphs_per_filing"]
    )
    corpus_mean_ppf = corpus_total / sum(s["total_filings"] for s in gen_stats.values())
    print(
        f"{'CORPUS AVERAGE':<35} "
        f"{sum(s['total_filings'] for s in gen_stats.values()):>6} "
        f"{corpus_total:>7} "
        f"{corpus_mean_ppf:>7.1f} {corpus_med_ppf:>6.0f} "
        f"{corpus_avg_wc:>6.1f} "
        f"{avg_rates['inlined_header']:>7.1%} "
        f"{avg_rates['orphan_word']:>7.1%}  "
        f"{avg_rates['fragment']:>7.1%} "
        f"{avg_rates['truncated']:>7.1%} "
        f"{avg_rates['boilerplate']:>7.1%}"
    )
    # ── 10 Most Problematic Filings ──
    print()
    print("=" * 180)
    print("10 MOST PROBLEMATIC FILINGS (highest combined issue rate across all 5 metrics)")
    print("=" * 180)
    # Only consider filings with at least 3 paragraphs to avoid noisy tiny filings
    eligible = {acc: fr for acc, fr in filing_issue_rates.items() if fr["n_paragraphs"] >= 3}
    worst = sorted(eligible.items(), key=lambda x: x[1]["combined_rate"], reverse=True)[:10]
    print()
    hdr2 = (
        f"{'Accession':<30} {'Generator':<35} {'Paras':>5} "
        f"{'InlHdr':>7} {'Orphan':>7} {'Frag':>7} {'Trunc':>7} {'Boiler':>7} {'Combined':>8}"
    )
    print(hdr2)
    print("-" * len(hdr2))
    for acc, fr in worst:
        print(
            f"{acc:<30} {fr['generator']:<35} {fr['n_paragraphs']:>5} "
            f"{fr['inlined_header_rate']:>6.1%} {fr['orphan_word_rate']:>6.1%} "
            f"{fr['fragment_rate']:>6.1%} {fr['truncation_rate']:>6.1%} "
            f"{fr['boilerplate_rate']:>6.1%} {fr['combined_rate']:>7.1%}"
        )
    # ── Per-metric worst generators summary ──
    print()
    print("=" * 180)
    print("GENERATORS >2x CORPUS AVERAGE (flagged metrics)")
    print("=" * 180)
    metric_names = {
        "inlined_header": "Inlined Header",
        "orphan_word": "Orphan Word",
        "fragment": "Fragment",
        "truncated": "Truncation",
        "boilerplate": "Boilerplate",
    }
    for metric_key, metric_label in metric_names.items():
        flagged = []
        for gen, s in sorted_gens:
            n = s["total_paragraphs"]
            if n < 10:
                continue
            rate = s[metric_key] / n
            if avg_rates[metric_key] > 0 and rate > 2 * avg_rates[metric_key]:
                flagged.append((gen, rate, s[metric_key], n))
        if flagged:
            print(f"\n  {metric_label} rate (corpus avg: {avg_rates[metric_key]:.1%}, threshold >2x = {2*avg_rates[metric_key]:.1%}):")
            for gen, rate, count, total in sorted(flagged, key=lambda x: -x[1]):
                print(f"    {gen:<35} {rate:.1%}  ({count}/{total})")
        else:
            print(f"\n  {metric_label}: No generators >2x corpus average")
 if __name__ == "__main__":
    main()
--- a/ts/scripts/analyze-no-cyber.ts
+++ b/ts/scripts/analyze-no-cyber.ts
@ -0,0 +1,164 @@
 /**
 * Analyze the 348 annotated paragraphs with no cybersecurity keywords.
 * Reports label distribution to decide: keep or exclude from training.
 *
 * Usage: bun ts/scripts/analyze-no-cyber.ts
 */
 import { readFileSync } from "node:fs";
 const DATA_DIR = new URL("../../data", import.meta.url).pathname;
 const QUALITY_PATH = `${DATA_DIR}/paragraphs/quality/quality-scores.jsonl`;
 const ANNOTATIONS_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
 const TRAINING_PATH = `${DATA_DIR}/paragraphs/training.patched.jsonl`;
 interface QualityScore {
  id: string;
  issues: string[];
  quality_tier: string;
 }
 interface Annotation {
  paragraphId: string;
  label: {
    content_category: string;
    specificity_level: number;
    category_confidence: string;
    specificity_confidence: string;
    reasoning: string;
  };
  provenance: { modelId: string };
 }
 // Load quality scores — find no-cyber paragraphs
 const noCyberIds = new Set<string>();
 for (const line of readFileSync(QUALITY_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  const q = JSON.parse(line) as QualityScore;
  if (q.issues.includes("no_cyber_keywords")) {
    noCyberIds.add(q.id);
  }
 }
 console.error(`No-cyber paragraphs (all): ${noCyberIds.size}`);
 // Load training set IDs
 const trainingIds = new Set<string>();
 for (const line of readFileSync(TRAINING_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  const p = JSON.parse(line) as { id: string };
  trainingIds.add(p.id);
 }
 // Filter to annotated no-cyber paragraphs
 const annotatedNoCyber = new Set([...noCyberIds].filter((id) => trainingIds.has(id)));
 console.error(`No-cyber paragraphs (annotated): ${annotatedNoCyber.size}`);
 // Load annotations for these paragraphs
 const annotations = new Map<string, Annotation[]>();
 for (const line of readFileSync(ANNOTATIONS_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  const ann = JSON.parse(line) as Annotation;
  if (annotatedNoCyber.has(ann.paragraphId)) {
    if (!annotations.has(ann.paragraphId)) annotations.set(ann.paragraphId, []);
    annotations.get(ann.paragraphId)!.push(ann);
  }
 }
 console.error(`Paragraphs with annotations: ${annotations.size}\n`);
 // Majority vote per paragraph
 function majority<T>(items: T[]): { value: T; count: number } {
  const counts = new Map<T, number>();
  for (const item of items) counts.set(item, (counts.get(item) ?? 0) + 1);
  let best: T = items[0]!;
  let bestCount = 0;
  for (const [v, c] of counts) {
    if (c > bestCount) { best = v; bestCount = c; }
  }
  return { value: best, count: bestCount };
 }
 // Category distribution (consensus)
 const catDist = new Map<string, number>();
 const specDist = new Map<number, number>();
 const confDist = new Map<string, number>();
 let conflicts = 0;
 // Per-paragraph details for interesting cases
 const nonOther: { pid: string; cat: string; spec: number; anns: Annotation[] }[] = [];
 for (const [pid, anns] of annotations) {
  const catVote = majority(anns.map((a) => a.label.content_category));
  const specVote = majority(anns.map((a) => a.label.specificity_level));
  catDist.set(catVote.value, (catDist.get(catVote.value) ?? 0) + 1);
  specDist.set(specVote.value, (specDist.get(specVote.value) ?? 0) + 1);
  if (catVote.count < 2) conflicts++;
  // Track confidence
  for (const ann of anns) {
    confDist.set(ann.label.category_confidence, (confDist.get(ann.label.category_confidence) ?? 0) + 1);
  }
  if (catVote.value !== "None/Other") {
    nonOther.push({ pid, cat: catVote.value, spec: specVote.value, anns });
  }
 }
 // ── Report ──────────────────────────────────────────────────────────────
 console.log("═══ NO-CYBER-KEYWORD PARAGRAPH ANALYSIS ═══\n");
 console.log(`Total annotated no-cyber paragraphs: ${annotations.size}`);
 console.log(`Conflicts (no majority): ${conflicts}\n`);
 console.log("─── Category Distribution (Consensus) ───");
 for (const [cat, count] of [...catDist.entries()].sort((a, b) => b[1] - a[1])) {
  console.log(`  ${cat.padEnd(30)} ${count} (${((count / annotations.size) * 100).toFixed(1)}%)`);
 }
 console.log("\n─── Specificity Distribution (Consensus) ───");
 for (const level of [1, 2, 3, 4]) {
  const count = specDist.get(level) ?? 0;
  console.log(`  Level ${level}: ${count} (${((count / annotations.size) * 100).toFixed(1)}%)`);
 }
 console.log("\n─── Confidence Distribution (All Models) ───");
 for (const conf of ["high", "medium", "low"]) {
  const count = confDist.get(conf) ?? 0;
  const total = [...confDist.values()].reduce((a, b) => a + b, 0);
  console.log(`  ${conf}: ${count} (${((count / total) * 100).toFixed(1)}%)`);
 }
 console.log(`\n─── Non-"None/Other" Paragraphs: ${nonOther.length} ───`);
 if (nonOther.length > 0) {
  console.log("These are the concerning ones — labeled as real categories but have no cyber keywords.\n");
  // Load actual paragraph text for these
  const textMap = new Map<string, string>();
  const noCyberPidSet = new Set(nonOther.map((n) => n.pid));
  for (const line of readFileSync(TRAINING_PATH, "utf-8").split("\n")) {
    if (!line.trim()) continue;
    const p = JSON.parse(line) as { id: string; text: string };
    if (noCyberPidSet.has(p.id)) textMap.set(p.id, p.text);
  }
  // Show samples
  for (const item of nonOther.slice(0, 10)) {
    const text = textMap.get(item.pid) ?? "(text not found)";
    const modelVotes = item.anns.map((a) => `${a.provenance.modelId.split("/")[1]}: ${a.label.content_category}`).join(", ");
    console.log(`  [${item.cat} / Spec ${item.spec}] ${item.pid}`);
    console.log(`    Models: ${modelVotes}`);
    console.log(`    Text: ${text.substring(0, 150)}...`);
    console.log();
  }
 }
 // Summary recommendation
 const noneOtherCount = catDist.get("None/Other") ?? 0;
 const noneOtherPct = ((noneOtherCount / annotations.size) * 100).toFixed(1);
 console.log("─── RECOMMENDATION ───");
 if (nonOther.length < 50) {
  console.log(`  ${noneOtherPct}% labeled None/Other. Only ${nonOther.length} labeled as real categories.`);
  console.log(`  → EXCLUDE ${nonOther.length} non-None/Other paragraphs from training (likely section bleed).`);
  console.log(`  → KEEP ${noneOtherCount} None/Other paragraphs (correct labels for non-cyber content).`);
 } else {
  console.log(`  WARNING: ${nonOther.length} paragraphs labeled as real categories — investigate further.`);
 }
--- a/ts/scripts/dapt-corpus-analytics.ts
+++ b/ts/scripts/dapt-corpus-analytics.ts
@ -0,0 +1,203 @@
 /**
 * DAPT corpus analytics: document length distribution, token estimates,
 * quality checks, and filter candidates.
 *
 * Usage: bun ts/scripts/dapt-corpus-analytics.ts
 *
 * Input: data/dapt-corpus/shard-*.jsonl
 */
 import { readFileSync, readdirSync } from "node:fs";
 const CORPUS_DIR = new URL("../../data/dapt-corpus", import.meta.url).pathname;
 const CHARS_PER_TOKEN = 4.72; // empirical from ModernBERT tokenizer
 interface Doc {
  accession: string;
  text: string;
 }
 // ── Load all documents ──────────────────────────────────────────────────
 console.error("Loading corpus...");
 const shards = readdirSync(CORPUS_DIR)
  .filter((f) => f.endsWith(".jsonl"))
  .sort();
 const docs: { accession: string; chars: number; lines: number; words: number }[] = [];
 let totalChars = 0;
 for (const shard of shards) {
  const path = `${CORPUS_DIR}/${shard}`;
  for (const line of readFileSync(path, "utf-8").split("\n")) {
    if (!line.trim()) continue;
    const doc = JSON.parse(line) as Doc;
    const chars = doc.text.length;
    const lines = doc.text.split("\n").length;
    const words = doc.text.split(/\s+/).filter(Boolean).length;
    docs.push({ accession: doc.accession, chars, lines, words });
    totalChars += chars;
  }
 }
 console.error(`  ${docs.length} documents loaded from ${shards.length} shards\n`);
 // ── Basic stats ─────────────────────────────────────────────────────────
 const charsSorted = docs.map((d) => d.chars).sort((a, b) => a - b);
 const wordsSorted = docs.map((d) => d.words).sort((a, b) => a - b);
 function percentile(arr: number[], p: number): number {
  const idx = Math.ceil((p / 100) * arr.length) - 1;
  return arr[Math.max(0, idx)]!;
 }
 function mean(arr: number[]): number {
  return arr.reduce((a, b) => a + b, 0) / arr.length;
 }
 const totalTokens = Math.round(totalChars / CHARS_PER_TOKEN);
 console.log("═══ DAPT CORPUS ANALYTICS ═══\n");
 console.log("─── Overview ───");
 console.log(`  Documents: ${docs.length.toLocaleString()}`);
 console.log(`  Shards: ${shards.length}`);
 console.log(`  Total chars: ${(totalChars / 1e9).toFixed(3)}B`);
 console.log(`  Total tokens (est): ${(totalTokens / 1e6).toFixed(1)}M (@ ${CHARS_PER_TOKEN} chars/token)`);
 console.log("\n─── Document Length Distribution (chars) ───");
 console.log(`  Min:    ${percentile(charsSorted, 0).toLocaleString()}`);
 console.log(`  P5:     ${percentile(charsSorted, 5).toLocaleString()}`);
 console.log(`  P10:    ${percentile(charsSorted, 10).toLocaleString()}`);
 console.log(`  P25:    ${percentile(charsSorted, 25).toLocaleString()}`);
 console.log(`  Median: ${percentile(charsSorted, 50).toLocaleString()}`);
 console.log(`  Mean:   ${Math.round(mean(charsSorted)).toLocaleString()}`);
 console.log(`  P75:    ${percentile(charsSorted, 75).toLocaleString()}`);
 console.log(`  P90:    ${percentile(charsSorted, 90).toLocaleString()}`);
 console.log(`  P95:    ${percentile(charsSorted, 95).toLocaleString()}`);
 console.log(`  Max:    ${percentile(charsSorted, 100).toLocaleString()}`);
 console.log("\n─── Document Length Distribution (words) ───");
 console.log(`  Min:    ${percentile(wordsSorted, 0).toLocaleString()}`);
 console.log(`  P5:     ${percentile(wordsSorted, 5).toLocaleString()}`);
 console.log(`  Median: ${percentile(wordsSorted, 50).toLocaleString()}`);
 console.log(`  Mean:   ${Math.round(mean(wordsSorted)).toLocaleString()}`);
 console.log(`  P95:    ${percentile(wordsSorted, 95).toLocaleString()}`);
 console.log(`  Max:    ${percentile(wordsSorted, 100).toLocaleString()}`);
 // ── Token length distribution ───────────────────────────────────────────
 const tokensSorted = docs.map((d) => Math.round(d.chars / CHARS_PER_TOKEN)).sort((a, b) => a - b);
 console.log("\n─── Token Length Distribution (estimated) ───");
 console.log(`  Min:    ${percentile(tokensSorted, 0).toLocaleString()}`);
 console.log(`  P5:     ${percentile(tokensSorted, 5).toLocaleString()}`);
 console.log(`  P10:    ${percentile(tokensSorted, 10).toLocaleString()}`);
 console.log(`  P25:    ${percentile(tokensSorted, 25).toLocaleString()}`);
 console.log(`  Median: ${percentile(tokensSorted, 50).toLocaleString()}`);
 console.log(`  Mean:   ${Math.round(mean(tokensSorted)).toLocaleString()}`);
 console.log(`  P75:    ${percentile(tokensSorted, 75).toLocaleString()}`);
 console.log(`  P90:    ${percentile(tokensSorted, 90).toLocaleString()}`);
 console.log(`  P95:    ${percentile(tokensSorted, 95).toLocaleString()}`);
 console.log(`  Max:    ${percentile(tokensSorted, 100).toLocaleString()}`);
 // ── Sequence count at different max_seq_length ──────────────────────────
 console.log("\n─── Training Sequences by max_seq_length ───");
 for (const seqLen of [512, 1024, 2048, 4096, 8192]) {
  let totalSeqs = 0;
  for (const d of docs) {
    const tokens = Math.round(d.chars / CHARS_PER_TOKEN);
    totalSeqs += Math.ceil(tokens / seqLen);
  }
  const docsExceeding = docs.filter((d) => Math.round(d.chars / CHARS_PER_TOKEN) > seqLen).length;
  console.log(
    `  ${String(seqLen).padStart(5)}: ${totalSeqs.toLocaleString().padStart(10)} sequences` +
      `  (${docsExceeding.toLocaleString()} docs exceed, ${((docsExceeding / docs.length) * 100).toFixed(1)}%)`,
  );
 }
 // ── Filter candidates ───────────────────────────────────────────────────
 const tiny = docs.filter((d) => d.chars < 10_000);
 const small = docs.filter((d) => d.chars < 50_000);
 const empty = docs.filter((d) => d.chars < 100);
 const huge = docs.filter((d) => d.chars > 5_000_000);
 console.log("\n─── Filter Candidates ───");
 console.log(`  <100 chars (empty):    ${empty.length}`);
 console.log(`  <10K chars (covers):   ${tiny.length} (${(tiny.reduce((s, d) => s + d.chars, 0) / totalChars * 100).toFixed(3)}% of corpus)`);
 console.log(`  <50K chars (small):    ${small.length} (${(small.reduce((s, d) => s + d.chars, 0) / totalChars * 100).toFixed(3)}% of corpus)`);
 console.log(`  >5M chars (huge):      ${huge.length}`);
 if (tiny.length > 0 && tiny.length <= 20) {
  console.log("\n  Tiny documents (<10K chars):");
  for (const d of tiny.sort((a, b) => a.chars - b.chars)) {
    console.log(`    ${d.accession}: ${d.chars.toLocaleString()} chars, ${d.words.toLocaleString()} words`);
  }
 }
 // ── Content quality spot checks ─────────────────────────────────────────
 console.log("\n─── Content Quality Checks ───");
 // Check for residual HTML tags
 let docsWithHtml = 0;
 let docsWithXbrl = 0;
 let docsWithPageNums = 0;
 let docsWithUrls = 0;
 let singleBlockDocs = 0;
 for (const shard of shards) {
  const path = `${CORPUS_DIR}/${shard}`;
  for (const line of readFileSync(path, "utf-8").split("\n")) {
    if (!line.trim()) continue;
    const doc = JSON.parse(line) as Doc;
    if (/<[a-z][^>]*>/i.test(doc.text)) docsWithHtml++;
    if (/ix:|xbrl|xmlns/i.test(doc.text)) docsWithXbrl++;
    if (/\n\s*(?:\d{1,3}|[- ]\d{1,3}[- ]|F-\d+)\s*\n/.test(doc.text)) docsWithPageNums++;
    if (/https?:\/\//.test(doc.text)) docsWithUrls++;
    if (doc.text.split("\n\n").length < 3) singleBlockDocs++;
  }
 }
 console.log(`  Residual HTML tags:  ${docsWithHtml} docs (${((docsWithHtml / docs.length) * 100).toFixed(1)}%)`);
 console.log(`  XBRL/xmlns traces:  ${docsWithXbrl} docs (${((docsWithXbrl / docs.length) * 100).toFixed(1)}%)`);
 console.log(`  Page number traces:  ${docsWithPageNums} docs (${((docsWithPageNums / docs.length) * 100).toFixed(1)}%)`);
 console.log(`  URLs present:        ${docsWithUrls} docs (${((docsWithUrls / docs.length) * 100).toFixed(1)}%)`);
 console.log(`  Single-block (<3¶):  ${singleBlockDocs} docs`);
 // ── Shard distribution ──────────────────────────────────────────────────
 console.log("\n─── Shard Distribution ───");
 let shardIdx = 0;
 for (const shard of shards) {
  const path = `${CORPUS_DIR}/${shard}`;
  const lines = readFileSync(path, "utf-8").split("\n").filter((l) => l.trim()).length;
  const sizeBytes = readFileSync(path).length;
  console.log(
    `  ${shard}: ${lines.toLocaleString().padStart(6)} docs, ${(sizeBytes / 1e6).toFixed(0).padStart(4)} MB`,
  );
  shardIdx++;
 }
 // ── Post-filter stats ───────────────────────────────────────────────────
 const filtered = docs.filter((d) => d.chars >= 10_000);
 const filteredChars = filtered.reduce((s, d) => s + d.chars, 0);
 const filteredTokens = Math.round(filteredChars / CHARS_PER_TOKEN);
 console.log("\n─── After Filtering <10K chars ───");
 console.log(`  Documents: ${filtered.length.toLocaleString()} (removed ${docs.length - filtered.length})`);
 console.log(`  Total chars: ${(filteredChars / 1e9).toFixed(3)}B`);
 console.log(`  Total tokens (est): ${(filteredTokens / 1e6).toFixed(1)}M`);
 console.log(`  Token loss: ${((1 - filteredTokens / totalTokens) * 100).toFixed(3)}%`);
 // ── Training time estimates ─────────────────────────────────────────────
 console.log("\n─── Training Time Estimates (RTX 3090, bf16, grad_checkpoint) ───");
 for (const { seqLen, batchSize, gradAccum, secPerStepRange } of [
  { seqLen: 2048, batchSize: 4, gradAccum: 8, secPerStepRange: [1.0, 1.5, 2.0] },
  { seqLen: 8192, batchSize: 1, gradAccum: 32, secPerStepRange: [3.0, 5.0, 7.0] },
 ]) {
  const totalSeqs = filtered.reduce((s, d) => s + Math.ceil(Math.round(d.chars / CHARS_PER_TOKEN) / seqLen), 0);
  const effectiveBatch = batchSize * gradAccum;
  const stepsPerEpoch = Math.ceil(totalSeqs / effectiveBatch);
  console.log(`\n  seq_len=${seqLen}, batch=${batchSize}, grad_accum=${gradAccum} (eff=${effectiveBatch})`);
  console.log(`    Sequences: ${totalSeqs.toLocaleString()}, Steps/epoch: ${stepsPerEpoch.toLocaleString()}`);
  for (const secPerStep of secPerStepRange) {
    const hoursPerEpoch = (stepsPerEpoch * secPerStep) / 3600;
    console.log(`    @ ${secPerStep}s/step: ${hoursPerEpoch.toFixed(1)}h/epoch`);
  }
 }
--- a/ts/scripts/dapt-corpus-prep.ts
+++ b/ts/scripts/dapt-corpus-prep.ts
@ -40,15 +40,20 @@ function cleanForDapt(raw: string): string {
    if (trimmed.length === 0) { cleaned.push(""); continue; }
    // Page numbers: bare digits, "Page N", F-N financial page markers
    if (/^\d{1,3}$/.test(trimmed)) continue;
-    if (/^(page\s+\d+|[-–—]\s*\d+\s*[-–—])$/i.test(trimmed)) continue;
+    if (/^(page\s+\d+)$/i.test(trimmed)) continue;
    if (/^F-\d{1,3}$/.test(trimmed)) continue;
    if (/^table\s+of\s+contents?\s*$/i.test(trimmed)) continue;
-    // XBRL metadata
+    // XBRL metadata lines
    if (/^(0000\d{6}\s|xbrli:|iso4217:|http:\/\/fasb\.org|http:\/\/xbrl\.)/.test(trimmed)) continue;
    if (/^[a-z]{1,5}-\d{8}\s/.test(trimmed)) continue;
    if (/http:\/\/fasb\.org\/us-gaap/.test(trimmed) && trimmed.length > 100) continue;
    if (/^(FY|CY)\d{4,}/.test(trimmed) && /http:/.test(trimmed)) continue;
    // XBRL exhibit listing lines (101.CAL, 101.DEF, cover page XBRL, etc.)
    if (/xbrl/i.test(trimmed) && !/cyber|secur|risk|board|manage|disclos/i.test(trimmed)) continue;
    // Lines that are majority XBRL namespace tokens
    if (trimmed.length > 20) {
      const tokens = trimmed.split(/\s+/);
      const xbrlCount = tokens.filter(t =>
@ -57,13 +62,16 @@ function cleanForDapt(raw: string): string {
      if (tokens.length > 3 && xbrlCount / tokens.length > 0.5) continue;
    }
    // URLs — strip inline URLs (company sites, SEC, investor relations)
    if (/^https?:\/\/\S+$/.test(trimmed)) continue; // standalone URL lines
    // SEC boilerplate / filenames
    if (/^(10-K|10-Q|8-K)\s*$/i.test(trimmed)) continue;
    if (/generated by sec publisher/i.test(trimmed)) continue;
    if (/^\S+\.(htm|html)\s*$/i.test(trimmed)) continue;
    if (/^\S+\.(htm|html)\s+-\s+Generated/i.test(trimmed)) continue;
-    // Repeated headers
+    // Repeated headers (running headers/footers)
    if (trimmed.length > 5 && trimmed.length < 80) {
      if ((shortLineCounts.get(trimmed) ?? 0) >= 5) continue;
    }
@ -72,7 +80,8 @@ function cleanForDapt(raw: string): string {
    if (/^\(?\s*back\s+to\s+(index|top|toc)\s*\)?$/i.test(trimmed)) continue;
    if (/^index$/i.test(trimmed)) continue;
-    cleaned.push(line);
+    // Strip inline URLs from prose (replace with empty string)
    cleaned.push(line.replace(/https?:\/\/\S+/g, ""));
  }
  return cleaned.join("\n").replace(/\n{3,}/g, "\n\n").trim();
--- a/ts/scripts/diff-orphan-annotations.ts
+++ b/ts/scripts/diff-orphan-annotations.ts
@ -0,0 +1,210 @@
 /**
 * Diff original vs re-run annotations for orphan-word paragraphs.
 *
 * Compares stage1.jsonl (original) against stage1-orphan-rerun.jsonl (patched text)
 * to measure label changes, bias correction, and conflict resolution.
 *
 * Usage: bun ts/scripts/diff-orphan-annotations.ts
 */
 import { readFileSync } from "node:fs";
 const DATA_DIR = new URL("../../data", import.meta.url).pathname;
 const ORIG_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
 const RERUN_PATH = `${DATA_DIR}/annotations/stage1-orphan-rerun.jsonl`;
 const PATCHES_PATH = `${DATA_DIR}/paragraphs/patches/orphan-word-patches.jsonl`;
 interface Annotation {
  paragraphId: string;
  label: {
    content_category: string;
    specificity_level: number;
    category_confidence: string;
    specificity_confidence: string;
  };
  provenance: {
    modelId: string;
  };
 }
 function loadAnnotations(path: string): Map<string, Annotation[]> {
  const map = new Map<string, Annotation[]>();
  for (const line of readFileSync(path, "utf-8").split("\n")) {
    if (!line.trim()) continue;
    const ann = JSON.parse(line) as Annotation;
    const key = ann.paragraphId;
    if (!map.has(key)) map.set(key, []);
    map.get(key)!.push(ann);
  }
  return map;
 }
 function majorityVote(annotations: Annotation[], field: "content_category" | "specificity_level"): { value: string | number; unanimous: boolean; count: number } {
  const counts = new Map<string | number, number>();
  for (const ann of annotations) {
    const v = ann.label[field];
    counts.set(v, (counts.get(v) ?? 0) + 1);
  }
  let best: string | number = "";
  let bestCount = 0;
  for (const [v, c] of counts) {
    if (c > bestCount) { best = v; bestCount = c; }
  }
  return { value: best, unanimous: bestCount === annotations.length, count: bestCount };
 }
 // ── Main ────────────────────────────────────────────────────────────────
 const patchIds = new Set<string>();
 for (const line of readFileSync(PATCHES_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  patchIds.add((JSON.parse(line) as { id: string }).id);
 }
 const origAll = loadAnnotations(ORIG_PATH);
 const rerunAll = loadAnnotations(RERUN_PATH);
 // Filter original annotations to only orphan-word paragraphs
 const origFiltered = new Map<string, Annotation[]>();
 for (const [pid, anns] of origAll) {
  if (patchIds.has(pid)) origFiltered.set(pid, anns);
 }
 console.error(`Orphan-word paragraphs: ${patchIds.size}`);
 console.error(`Original annotations found: ${origFiltered.size} paragraphs`);
 console.error(`Re-run annotations found: ${rerunAll.size} paragraphs`);
 // Compare paragraphs that have BOTH original and re-run annotations
 const comparable = [...rerunAll.keys()].filter((pid) => origFiltered.has(pid));
 console.error(`Comparable paragraphs: ${comparable.length}\n`);
 // Track changes
 let catChanged = 0;
 let specChanged = 0;
 let eitherChanged = 0;
 // Per-model changes
 const perModelCatChanges = new Map<string, number>();
 const perModelSpecChanges = new Map<string, number>();
 // Category transition matrix
 const catTransitions = new Map<string, Map<string, number>>();
 // Consensus changes
 let origConflicts = 0;
 let rerunConflicts = 0;
 let conflictsResolved = 0;
 let consensusBroken = 0;
 // Category distribution
 const origCatDist = new Map<string, number>();
 const rerunCatDist = new Map<string, number>();
 // Specificity distribution
 const origSpecDist = new Map<number, number>();
 const rerunSpecDist = new Map<number, number>();
 for (const pid of comparable) {
  const origAnns = origFiltered.get(pid)!;
  const rerunAnns = rerunAll.get(pid)!;
  // Per-model comparison
  for (const rerunAnn of rerunAnns) {
    const modelId = rerunAnn.provenance.modelId;
    const origAnn = origAnns.find((a) => a.provenance.modelId === modelId);
    if (!origAnn) continue;
    if (origAnn.label.content_category !== rerunAnn.label.content_category) {
      perModelCatChanges.set(modelId, (perModelCatChanges.get(modelId) ?? 0) + 1);
      // Track transition
      const from = origAnn.label.content_category;
      const to = rerunAnn.label.content_category;
      if (!catTransitions.has(from)) catTransitions.set(from, new Map());
      catTransitions.get(from)!.set(to, (catTransitions.get(from)!.get(to) ?? 0) + 1);
    }
    if (origAnn.label.specificity_level !== rerunAnn.label.specificity_level) {
      perModelSpecChanges.set(modelId, (perModelSpecChanges.get(modelId) ?? 0) + 1);
    }
  }
  // Consensus comparison (majority vote)
  const origCatVote = majorityVote(origAnns, "content_category");
  const rerunCatVote = majorityVote(rerunAnns, "content_category");
  const origSpecVote = majorityVote(origAnns, "specificity_level");
  const rerunSpecVote = majorityVote(rerunAnns, "specificity_level");
  origCatDist.set(origCatVote.value as string, (origCatDist.get(origCatVote.value as string) ?? 0) + 1);
  rerunCatDist.set(rerunCatVote.value as string, (rerunCatDist.get(rerunCatVote.value as string) ?? 0) + 1);
  origSpecDist.set(origSpecVote.value as number, (origSpecDist.get(origSpecVote.value as number) ?? 0) + 1);
  rerunSpecDist.set(rerunSpecVote.value as number, (rerunSpecDist.get(rerunSpecVote.value as number) ?? 0) + 1);
  if (origCatVote.value !== rerunCatVote.value) catChanged++;
  if (origSpecVote.value !== rerunSpecVote.value) specChanged++;
  if (origCatVote.value !== rerunCatVote.value || origSpecVote.value !== rerunSpecVote.value) eitherChanged++;
  // Conflict tracking (no majority = conflict)
  const origHasConflict = origCatVote.count < 2 || origSpecVote.count < 2;
  const rerunHasConflict = rerunCatVote.count < 2 || rerunSpecVote.count < 2;
  if (origHasConflict) origConflicts++;
  if (rerunHasConflict) rerunConflicts++;
  if (origHasConflict && !rerunHasConflict) conflictsResolved++;
  if (!origHasConflict && rerunHasConflict) consensusBroken++;
 }
 // ── Report ──────────────────────────────────────────────────────────────
 console.log("═══ ORPHAN WORD RE-ANNOTATION DIFF REPORT ═══\n");
 console.log(`Paragraphs compared: ${comparable.length}`);
 console.log(`  Category consensus changed: ${catChanged} (${((catChanged / comparable.length) * 100).toFixed(1)}%)`);
 console.log(`  Specificity consensus changed: ${specChanged} (${((specChanged / comparable.length) * 100).toFixed(1)}%)`);
 console.log(`  Either dimension changed: ${eitherChanged} (${((eitherChanged / comparable.length) * 100).toFixed(1)}%)`);
 console.log(`\n─── Per-Model Category Changes ───`);
 for (const [model, count] of [...perModelCatChanges.entries()].sort((a, b) => b[1] - a[1])) {
  const short = model.split("/")[1] ?? model;
  console.log(`  ${short}: ${count} (${((count / comparable.length) * 100).toFixed(1)}%)`);
 }
 console.log(`\n─── Per-Model Specificity Changes ───`);
 for (const [model, count] of [...perModelSpecChanges.entries()].sort((a, b) => b[1] - a[1])) {
  const short = model.split("/")[1] ?? model;
  console.log(`  ${short}: ${count} (${((count / comparable.length) * 100).toFixed(1)}%)`);
 }
 console.log(`\n─── Conflict Resolution ───`);
 console.log(`  Original conflicts: ${origConflicts}`);
 console.log(`  Re-run conflicts: ${rerunConflicts}`);
 console.log(`  Conflicts resolved (orig conflict → rerun consensus): ${conflictsResolved}`);
 console.log(`  Consensus broken (orig consensus → rerun conflict): ${consensusBroken}`);
 console.log(`  Net conflict change: ${conflictsResolved - consensusBroken > 0 ? "-" : "+"}${Math.abs(conflictsResolved - consensusBroken)}`);
 console.log(`\n─── Category Distribution (Consensus) ───`);
 console.log(`  ${"Category".padEnd(30)} ${"Original".padStart(8)} ${"Re-run".padStart(8)} ${"Delta".padStart(8)}`);
 const allCats = new Set([...origCatDist.keys(), ...rerunCatDist.keys()]);
 for (const cat of [...allCats].sort()) {
  const orig = origCatDist.get(cat) ?? 0;
  const rerun = rerunCatDist.get(cat) ?? 0;
  const delta = rerun - orig;
  const sign = delta > 0 ? "+" : "";
  console.log(`  ${cat.padEnd(30)} ${String(orig).padStart(8)} ${String(rerun).padStart(8)} ${(sign + delta).padStart(8)}`);
 }
 console.log(`\n─── Specificity Distribution (Consensus) ───`);
 console.log(`  ${"Level".padEnd(10)} ${"Original".padStart(8)} ${"Re-run".padStart(8)} ${"Delta".padStart(8)}`);
 for (const level of [1, 2, 3, 4]) {
  const orig = origSpecDist.get(level) ?? 0;
  const rerun = rerunSpecDist.get(level) ?? 0;
  const delta = rerun - orig;
  const sign = delta > 0 ? "+" : "";
  console.log(`  ${String(level).padEnd(10)} ${String(orig).padStart(8)} ${String(rerun).padStart(8)} ${(sign + delta).padStart(8)}`);
 }
 console.log(`\n─── Top Category Transitions ───`);
 const transitions: [string, string, number][] = [];
 for (const [from, tos] of catTransitions) {
  for (const [to, count] of tos) {
    transitions.push([from, to, count]);
  }
 }
 transitions.sort((a, b) => b[2] - a[2]);
 for (const [from, to, count] of transitions.slice(0, 15)) {
  console.log(`  ${from} → ${to}: ${count}`);
 }
--- a/ts/scripts/extract-html-headings.ts
+++ b/ts/scripts/extract-html-headings.ts
@ -0,0 +1,190 @@
 /**
 * Extract styled headings (bold, underline, h-tags) from SEC filing HTML.
 * Produces a per-filing heading cache for paragraph heading detection.
 *
 * Usage: bun run ts/scripts/extract-html-headings.ts
 *
 * Input:  data/raw/html/*.html + data/paragraphs/quality/ambiguous-filings.txt
 * Output: data/paragraphs/quality/filing-headings.jsonl
 *         Each line: {"accession": "...", "headings": ["heading1", "heading2", ...]}
 */
 import { readFileSync, writeFileSync, mkdirSync, existsSync } from "node:fs";
 import { cpus } from "node:os";
 const HTML_DIR = "data/raw/html";
 const FILING_LIST = "data/paragraphs/quality/ambiguous-filings.txt";
 const OUTPUT = "data/paragraphs/quality/filing-headings.jsonl";
 /**
 * Extract styled text (bold, underline, h-tags) from HTML within Item 1C.
 * Returns an array of heading strings found.
 */
 function extractStyledHeadings(html: string): string[] {
  // Find Item 1C region (rough — look for "Item 1C" and take the next ~200KB)
  const item1cMatch = html.match(/item\s*1c/i);
  if (!item1cMatch || item1cMatch.index === undefined) return [];
  const startIdx = item1cMatch.index;
  // Look for next Item boundary or end of filing
  const nextItemMatch = html.slice(startIdx + 20).match(/item\s+(?:2|1[a-bd-z]|[3-9])/i);
  const endIdx = nextItemMatch?.index
    ? startIdx + 20 + nextItemMatch.index
    : Math.min(startIdx + 200000, html.length);
  const section = html.slice(startIdx, endIdx);
  const headings: string[] = [];
  // Pattern 1: <b> or <strong> tags
  const boldRegex = /<(?:b|strong)[^>]*>([\s\S]*?)<\/(?:b|strong)>/gi;
  for (const m of section.matchAll(boldRegex)) {
    const text = stripTags(m[1]!).trim();
    if (isHeadingCandidate(text)) headings.push(text);
  }
  // Pattern 2: font-weight: bold or font-weight: 700 in inline styles
  const boldStyleRegex = /<[^>]+font-weight\s*:\s*(?:bold|[6-9]00)[^>]*>([\s\S]*?)<\/[^>]+>/gi;
  for (const m of section.matchAll(boldStyleRegex)) {
    const text = stripTags(m[1]!).trim();
    if (isHeadingCandidate(text)) headings.push(text);
  }
  // Pattern 3: text-decoration: underline
  const underlineRegex = /<[^>]+text-decoration\s*:\s*underline[^>]*>([\s\S]*?)<\/[^>]+>/gi;
  for (const m of section.matchAll(underlineRegex)) {
    const text = stripTags(m[1]!).trim();
    if (isHeadingCandidate(text)) headings.push(text);
  }
  // Pattern 4: h1-h6 tags
  const hRegex = /<h[1-6][^>]*>([\s\S]*?)<\/h[1-6]>/gi;
  for (const m of section.matchAll(hRegex)) {
    const text = stripTags(m[1]!).trim();
    if (isHeadingCandidate(text)) headings.push(text);
  }
  // Deduplicate and normalize
  const seen = new Set<string>();
  const unique: string[] = [];
  for (const h of headings) {
    const normalized = h.replace(/\s+/g, " ").trim();
    if (normalized.length < 3) continue;
    const key = normalized.toLowerCase();
    if (!seen.has(key)) {
      seen.add(key);
      unique.push(normalized);
    }
  }
  return unique;
 }
 /** Strip HTML tags from a string. */
 function stripTags(html: string): string {
  return html
    .replace(/<[^>]+>/g, " ")
    .replace(/&nbsp;|&#160;/gi, " ")
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;/g, '"')
    .replace(/&#39;|&apos;/g, "'")
    .replace(/&mdash;|&#8212;/g, "—")
    .replace(/&ndash;|&#8211;/g, "–")
    .replace(/\s+/g, " ")
    .trim();
 }
 /** Check if extracted styled text looks like a heading (not body text). */
 function isHeadingCandidate(text: string): boolean {
  if (text.length < 3 || text.length > 150) return false;
  const words = text.split(/\s+/);
  if (words.length > 15) return false;
  // Must contain at least one heading-like keyword
  if (!/(?:risk|management|strategy|cybersecurity|cyber|governance|oversight|board|directors?|incident|response|recovery|planning|detection|program|process|third[- ]party|security|threats?|assessment|compliance|safeguards?|awareness|training|education|monitoring|integration|framework|practices|personnel|role|controls|policies|procedures|reporting|identification|disclosure|material|enterprise|technology|overview|impact|effects?|vulnerabilit)/i.test(text)) {
    return false;
  }
  return true;
 }
 // ─── Worker mode ───
 const args = process.argv.slice(2);
 if (args[0] === "--worker") {
  const startIdx = parseInt(args[1]!);
  const endIdx = parseInt(args[2]!);
  const outFile = args[3]!;
  const filings = readFileSync(FILING_LIST, "utf-8").trim().split("\n").slice(startIdx, endIdx);
  const results: string[] = [];
  for (const acc of filings) {
    const htmlPath = `${HTML_DIR}/${acc}.html`;
    if (!existsSync(htmlPath)) continue;
    const html = readFileSync(htmlPath, "utf-8");
    const headings = extractStyledHeadings(html);
    results.push(JSON.stringify({ accession: acc, headings }));
  }
  writeFileSync(outFile, results.join("\n") + (results.length > 0 ? "\n" : ""));
  process.exit(0);
 }
 // ─── Main mode ───
 const start = Date.now();
 const filings = readFileSync(FILING_LIST, "utf-8").trim().split("\n");
 const nproc = cpus().length;
 const chunkSize = Math.ceil(filings.length / nproc);
 process.stderr.write(`  ${filings.length} filings, ${nproc} workers\n`);
 const tmpFiles: string[] = [];
 const workers: ReturnType<typeof Bun.spawn>[] = [];
 for (let i = 0; i < nproc; i++) {
  const s = i * chunkSize;
  const e = Math.min(s + chunkSize, filings.length);
  if (s >= filings.length) break;
  const tmpFile = `${OUTPUT}.tmp-${i}`;
  tmpFiles.push(tmpFile);
  workers.push(
    Bun.spawn(
      ["bun", "run", import.meta.filename, "--worker", String(s), String(e), tmpFile],
      { stderr: "inherit" },
    )
  );
 }
 for (const w of workers) await w.exited;
 // Merge
 const allResults: string[] = [];
 for (const tmp of tmpFiles) {
  if (existsSync(tmp)) {
    const content = readFileSync(tmp, "utf-8").trim();
    if (content) allResults.push(content);
    try { require("node:fs").unlinkSync(tmp); } catch {}
  }
 }
 writeFileSync(OUTPUT, allResults.join("\n") + "\n");
 const elapsed = ((Date.now() - start) / 1000).toFixed(1);
 // Count stats
 let totalHeadings = 0;
 let filingsWithHeadings = 0;
 for (const line of allResults.join("\n").split("\n")) {
  if (!line.trim()) continue;
  const d = JSON.parse(line);
  if (d.headings.length > 0) {
    filingsWithHeadings++;
    totalHeadings += d.headings.length;
  }
 }
 process.stderr.write(
  `\n  Done in ${elapsed}s\n` +
  `  ${filings.length} filings processed\n` +
  `  ${filingsWithHeadings} filings with styled headings\n` +
  `  ${totalHeadings} total heading instances\n` +
  `  Output: ${OUTPUT}\n`,
 );
--- a/ts/scripts/merge-annotations.ts
+++ b/ts/scripts/merge-annotations.ts
@ -0,0 +1,73 @@
 /**
 * Merge original Stage 1 annotations with orphan-word re-run annotations.
 *
 * For paragraphs that were re-annotated, replaces original annotations with
 * re-run annotations. For all other paragraphs, keeps original annotations.
 * Original stage1.jsonl is NOT modified.
 *
 * Usage: bun ts/scripts/merge-annotations.ts
 *
 * Output: data/annotations/stage1.patched.jsonl
 */
 import { readFileSync, writeFileSync } from "node:fs";
 const DATA_DIR = new URL("../../data", import.meta.url).pathname;
 const ORIG_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
 const RERUN_PATH = `${DATA_DIR}/annotations/stage1-orphan-rerun.jsonl`;
 const OUTPUT_PATH = `${DATA_DIR}/annotations/stage1.patched.jsonl`;
 interface Annotation {
  paragraphId: string;
  provenance: { modelId: string };
  [key: string]: unknown;
 }
 // Load re-run annotations, keyed by paragraphId|modelId
 const rerunMap = new Map<string, string>(); // key -> raw JSON line
 const rerunPids = new Set<string>();
 for (const line of readFileSync(RERUN_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  const ann = JSON.parse(line) as Annotation;
  const key = `${ann.paragraphId}|${ann.provenance.modelId}`;
  rerunMap.set(key, line);
  rerunPids.add(ann.paragraphId);
 }
 console.error(`Re-run annotations: ${rerunMap.size} (${rerunPids.size} paragraphs)`);
 // Stream through original, replacing where re-run exists
 let kept = 0;
 let replaced = 0;
 const output: string[] = [];
 for (const line of readFileSync(ORIG_PATH, "utf-8").split("\n")) {
  if (!line.trim()) continue;
  const ann = JSON.parse(line) as Annotation;
  const key = `${ann.paragraphId}|${ann.provenance.modelId}`;
  if (rerunMap.has(key)) {
    output.push(rerunMap.get(key)!);
    rerunMap.delete(key); // mark as used
    replaced++;
  } else {
    output.push(line);
    kept++;
  }
 }
 // Any re-run annotations not matched to originals (shouldn't happen, but be safe)
 let added = 0;
 for (const [, line] of rerunMap) {
  output.push(line);
  added++;
 }
 writeFileSync(OUTPUT_PATH, output.join("\n") + "\n");
 console.error(
  `\nMerge complete:` +
    `\n  ${kept} original annotations kept` +
    `\n  ${replaced} annotations replaced with re-run` +
    `\n  ${added} new annotations added` +
    `\n  ${output.length} total annotations` +
    `\n  Output: ${OUTPUT_PATH}`,
 );
--- a/ts/scripts/patch-orphan-words.ts
+++ b/ts/scripts/patch-orphan-words.ts
@ -0,0 +1,174 @@
 /**
 * Expanded orphan word patch: recover dropped leading words for all
 * paragraphs that start with lowercase (non-list patterns).
 *
 * For each candidate paragraph:
 * 1. Read the source HTML for the filing
 * 2. Strip HTML to plain text
 * 3. Find the paragraph text in the stripped output
 * 4. Look backwards to find the orphaned word on its own line
 * 5. Validate: orphaned word must be short (1-3 words), start with uppercase
 * 6. Output patch record
 *
 * Usage: bun run ts/scripts/patch-orphan-words.ts
 * Input:  data/paragraphs/paragraphs-clean.jsonl
 * Output: data/paragraphs/patches/orphan-word-patches.jsonl
 */
 import { readFileSync, writeFileSync, mkdirSync, existsSync } from "node:fs";
 import { stripHtml } from "../src/extract/html-cleaner.ts";
 const PARAGRAPHS_PATH = "data/paragraphs/paragraphs-clean.jsonl";
 const HTML_DIR = "data/raw/html";
 const OUTPUT_PATH = "data/paragraphs/patches/orphan-word-patches.jsonl";
 // List patterns to exclude (legitimate lowercase starts)
 const LIST_PATTERNS = /^(and |or |including |such as |as well as |along with |that |which |where |whether |as described |for example|for more |pursuant to |in addition )/i;
 interface Paragraph {
  id: string;
  text: string;
  textHash: string;
  wordCount: number;
  paragraphIndex: number;
  filing: {
    accessionNumber: string;
    companyName: string;
    [key: string]: unknown;
  };
 }
 interface PatchRecord {
  id: string;
  accession: string;
  paragraphIndex: number;
  orphanWord: string;
  originalStart: string;
  patchedStart: string;
  method: string;
 }
 // Cache stripped HTML per filing
 const strippedCache = new Map<string, string>();
 function getStrippedHtml(accession: string): string | null {
  if (strippedCache.has(accession)) return strippedCache.get(accession)!;
  const htmlPath = `${HTML_DIR}/${accession}.html`;
  if (!existsSync(htmlPath)) return null;
  const html = readFileSync(htmlPath, "utf-8");
  const stripped = stripHtml(html);
  strippedCache.set(accession, stripped);
  return stripped;
 }
 function findOrphanWord(stripped: string, paragraphText: string): string | null {
  // Use first 80 chars to search — avoids paragraph-end differences
  const searchText = paragraphText.substring(0, Math.min(80, paragraphText.length));
  const idx = stripped.indexOf(searchText);
  if (idx === -1) return null;
  // Look backwards to find the orphaned word
  const before = stripped.substring(Math.max(0, idx - 200), idx);
  const lines = before.split("\n");
  const candidates = lines.filter((l) => l.trim().length > 0);
  if (candidates.length === 0) return null;
  const lastLine = candidates[candidates.length - 1]!.trim();
  // Validate: short (1-3 words), starts with uppercase
  const words = lastLine.split(/\s+/);
  if (words.length > 3 || words.length === 0) return null;
  if (!/^[A-Z]/.test(words[0]!)) return null;
  // Reject all-caps headings (>15 chars)
  if (lastLine === lastLine.toUpperCase() && lastLine.length > 15) return null;
  // Reject section/item references and page artifacts
  if (/^(item|part|section)\s/i.test(lastLine)) return null;
  if (/^\d+[\.\)]/.test(lastLine)) return null;
  if (/^table of contents$/i.test(lastLine)) return null;
  return lastLine;
 }
 // ─── Main ───
 const start = Date.now();
 mkdirSync("data/paragraphs/patches", { recursive: true });
 process.stderr.write("  Loading paragraphs...\n");
 const paragraphs: Paragraph[] = [];
 for (const line of readFileSync(PARAGRAPHS_PATH, "utf-8").split("\n")) {
  if (line.trim()) paragraphs.push(JSON.parse(line));
 }
 process.stderr.write(`  ${paragraphs.length} paragraphs loaded\n`);
 // Find candidates
 const candidateParas = paragraphs.filter((p) => {
  if (!p.text || p.text.length === 0) return false;
  if (!/^[a-z]/.test(p.text)) return false;
  if (LIST_PATTERNS.test(p.text)) return false;
  return true;
 });
 process.stderr.write(`  ${candidateParas.length} orphan word candidates\n\n`);
 // Process
 const patches: PatchRecord[] = [];
 let notFound = 0;
 let noOrphan = 0;
 let lastAcc = "";
 for (let i = 0; i < candidateParas.length; i++) {
  const p = candidateParas[i]!;
  const acc = p.filing.accessionNumber;
  if (acc !== lastAcc) {
    if (strippedCache.size > 20) strippedCache.clear();
    lastAcc = acc;
  }
  const stripped = getStrippedHtml(acc);
  if (!stripped) { notFound++; continue; }
  const orphan = findOrphanWord(stripped, p.text);
  if (!orphan) { noOrphan++; continue; }
  patches.push({
    id: p.id,
    accession: acc,
    paragraphIndex: p.paragraphIndex,
    orphanWord: orphan,
    originalStart: p.text.substring(0, 60),
    patchedStart: orphan + " " + p.text.substring(0, 60),
    method: "html-lookback",
  });
  if ((i + 1) % 200 === 0) {
    process.stderr.write(
      `\x1b[2K\r  ${i + 1}/${candidateParas.length} | ${patches.length} patched | ${noOrphan} no orphan | ${notFound} no HTML`,
    );
  }
 }
 writeFileSync(OUTPUT_PATH, patches.map((p) => JSON.stringify(p)).join("\n") + "\n");
 const elapsed = ((Date.now() - start) / 1000).toFixed(1);
 process.stderr.write(
  `\n\n  Done in ${elapsed}s\n` +
    `  ${candidateParas.length} candidates → ${patches.length} patches found\n` +
    `  ${noOrphan} candidates: no orphan word found in HTML\n` +
    `  ${notFound} candidates: HTML file not found\n` +
    `  Output: ${OUTPUT_PATH}\n`,
 );
 // Word frequency summary
 const wordCounts = new Map<string, number>();
 for (const p of patches) {
  wordCounts.set(p.orphanWord, (wordCounts.get(p.orphanWord) ?? 0) + 1);
 }
 const sorted = [...wordCounts.entries()].sort((a, b) => b[1] - a[1]);
 process.stderr.write("\n  Top orphan words:\n");
 for (const [word, count] of sorted.slice(0, 15)) {
  process.stderr.write(`    ${word}: ${count}\n`);
 }
--- a/ts/scripts/rerun-orphan-stage1.ts
+++ b/ts/scripts/rerun-orphan-stage1.ts
@ -0,0 +1,175 @@
 /**
 * Re-run Stage 1 annotations on orphan-word-patched paragraphs.
 *
 * Loads paragraphs that had orphan words restored, runs all 3 Stage 1 models
 * on the PATCHED text, and saves to a separate annotation file.
 * Original annotations in stage1.jsonl are NOT modified.
 *
 * Usage:
 *   bun ts/scripts/rerun-orphan-stage1.ts [--concurrency 60]
 *
 * Input:
 *   data/paragraphs/training.patched.jsonl  — patched paragraph text
 *   data/paragraphs/patches/orphan-word-patches.jsonl — patch records (for ID filtering)
 *
 * Output:
 *   data/annotations/stage1-orphan-rerun.jsonl — new annotations (separate file)
 */
 import { readJsonl, readJsonlRaw, appendJsonl } from "../src/lib/jsonl.ts";
 import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
 import { STAGE1_MODELS } from "../src/lib/openrouter.ts";
 import { annotateParagraph, type AnnotateOpts } from "../src/label/annotate.ts";
 import { PROMPT_VERSION } from "../src/label/prompts.ts";
 import { v4 as uuidv4 } from "uuid";
 import { mkdir } from "node:fs/promises";
 import { existsSync, readFileSync } from "node:fs";
 import pLimit from "p-limit";
 // ── Args ────────────────────────────────────────────────────────────────
 const args = process.argv.slice(2);
 function flag(name: string): string | undefined {
  const idx = args.indexOf(`--${name}`);
  return idx === -1 ? undefined : args[idx + 1];
 }
 const CONCURRENCY = parseInt(flag("concurrency") ?? "60", 10);
 const DATA_DIR = new URL("../../data", import.meta.url).pathname;
 const TRAINING_PATH = `${DATA_DIR}/paragraphs/training.patched.jsonl`;
 const PATCHES_PATH = `${DATA_DIR}/paragraphs/patches/orphan-word-patches.jsonl`;
 const OUTPUT_DIR = `${DATA_DIR}/annotations`;
 const OUTPUT_PATH = `${OUTPUT_DIR}/stage1-orphan-rerun.jsonl`;
 // ── Main ────────────────────────────────────────────────────────────────
 async function main() {
  if (!existsSync(OUTPUT_DIR)) await mkdir(OUTPUT_DIR, { recursive: true });
  // Load orphan-word patch IDs
  console.error("Loading orphan-word patch IDs...");
  const patchIds = new Set<string>();
  for (const line of readFileSync(PATCHES_PATH, "utf-8").split("\n")) {
    if (!line.trim()) continue;
    const rec = JSON.parse(line) as { id: string };
    patchIds.add(rec.id);
  }
  console.error(`  ${patchIds.size} patched paragraph IDs`);
  // Load patched training data, filter to orphan-word paragraphs only
  console.error(`Loading patched paragraphs from ${TRAINING_PATH}...`);
  const { records: allParagraphs, skipped } = await readJsonl(TRAINING_PATH, Paragraph);
  if (skipped > 0) console.error(`  ⚠ Skipped ${skipped} invalid lines`);
  const paragraphs = allParagraphs.filter((p) => patchIds.has(p.id));
  console.error(`  ${paragraphs.length} orphan-word paragraphs in training set`);
  console.error(`  Models: ${STAGE1_MODELS.join(", ")}`);
  console.error(`  Prompt: ${PROMPT_VERSION}`);
  console.error(`  Concurrency: ${CONCURRENCY}`);
  const totalJobs = paragraphs.length * STAGE1_MODELS.length;
  console.error(`  Total annotations needed: ${totalJobs.toLocaleString()}`);
  // Load existing results for resume
  const doneKeys = new Set<string>();
  let resumedCost = 0;
  if (existsSync(OUTPUT_PATH)) {
    const { records: existing } = await readJsonlRaw(OUTPUT_PATH);
    for (const rec of existing) {
      const r = rec as { paragraphId?: string; provenance?: { modelId?: string; costUsd?: number } };
      if (r.paragraphId && r.provenance?.modelId) {
        doneKeys.add(`${r.paragraphId}|${r.provenance.modelId}`);
        resumedCost += r.provenance.costUsd ?? 0;
      }
    }
    if (doneKeys.size > 0) {
      console.error(`  Resuming: ${doneKeys.size} already done ($${resumedCost.toFixed(2)}), ${totalJobs - doneKeys.size} remaining`);
    }
  }
  if (doneKeys.size >= totalJobs) {
    console.error("  All annotations already complete!");
    return;
  }
  // Build job list
  type Job = { paragraph: Paragraph; modelId: string };
  const jobs: Job[] = [];
  for (const paragraph of paragraphs) {
    for (const modelId of STAGE1_MODELS) {
      if (!doneKeys.has(`${paragraph.id}|${modelId}`)) {
        jobs.push({ paragraph, modelId });
      }
    }
  }
  console.error(`  Jobs to run: ${jobs.length.toLocaleString()}\n`);
  // Run with concurrency limiter
  const runId = uuidv4();
  const limit = pLimit(CONCURRENCY);
  let completed = doneKeys.size;
  let failed = 0;
  let sessionCost = 0;
  const startTime = Date.now();
  // Progress logging
  const logInterval = setInterval(() => {
    const elapsed = (Date.now() - startTime) / 1000;
    const done = completed - doneKeys.size;
    const rate = done / elapsed;
    const remaining = totalJobs - completed;
    const eta = rate > 0 ? remaining / rate : Infinity;
    const etaMin = Math.floor(eta / 60);
    const etaSec = Math.round(eta % 60);
    process.stderr.write(
      `\x1b[2K\r  ${completed.toLocaleString()}/${totalJobs.toLocaleString()} (${((completed / totalJobs) * 100).toFixed(1)}%)` +
        `  $${(resumedCost + sessionCost).toFixed(4)}` +
        `  ${rate.toFixed(1)}/s` +
        `  ETA ${etaMin}m${etaSec.toString().padStart(2, "0")}s` +
        `  ${failed} failed`,
    );
  }, 2000);
  const tasks = jobs.map((job) =>
    limit(async () => {
      const opts: AnnotateOpts = {
        modelId: job.modelId,
        stage: "stage1",
        runId,
        promptVersion: PROMPT_VERSION,
        reasoningEffort: "low",
      };
      try {
        const ann = await annotateParagraph(job.paragraph, opts);
        await appendJsonl(OUTPUT_PATH, ann);
        sessionCost += ann.provenance.costUsd;
        completed++;
      } catch (error) {
        failed++;
        const msg = error instanceof Error ? error.message : String(error);
        console.error(`\n  ✖ ${job.modelId} × ${job.paragraph.id}: ${msg}`);
      }
    }),
  );
  await Promise.all(tasks);
  clearInterval(logInterval);
  const elapsed = ((Date.now() - startTime) / 1000).toFixed(0);
  console.error(
    `\n\n  ═══ ORPHAN WORD RE-ANNOTATION COMPLETE ═══` +
      `\n  Annotations: ${completed.toLocaleString()}/${totalJobs.toLocaleString()}` +
      `\n  Failed: ${failed}` +
      `\n  Session cost: $${sessionCost.toFixed(4)}` +
      `\n  Total cost: $${(resumedCost + sessionCost).toFixed(4)}` +
      `\n  Wall time: ${elapsed}s` +
      `\n  Output: ${OUTPUT_PATH}`,
  );
  if (failed > 0) {
    console.error(`\n  ⚠ ${failed} failures — re-run this script to retry them.`);
  }
 }
 main().catch((err) => {
  console.error(err);
  process.exit(1);
 });
--- a/ts/scripts/tag-generators.ts
+++ b/ts/scripts/tag-generators.ts
@ -0,0 +1,393 @@
 /**
 * Tag every SEC filing HTML with its generator tool.
 *
 * Usage: bun run ts/scripts/tag-generators.ts
 *
 * Reads first 20KB of each HTML file in data/raw/html/, detects the
 * generator using heuristics ported from scripts/detect_generators.py,
 * and writes a mapping JSONL to data/paragraphs/quality/generator-tags.jsonl.
 *
 * Uses Bun.spawn worker parallelism (same pattern as dapt-corpus-prep.ts).
 */
 import {
  readdirSync,
  readFileSync,
  writeFileSync,
  mkdirSync,
  unlinkSync,
  createReadStream,
 } from "node:fs";
 import { createInterface } from "node:readline";
 import { cpus } from "node:os";
 import { basename } from "node:path";
 const HTML_DIR = "data/raw/html";
 const OUTPUT_DIR = "data/paragraphs/quality";
 const OUTPUT_FILE = `${OUTPUT_DIR}/generator-tags.jsonl`;
 const READ_BYTES = 20_000;
 // Known SEC filing agent CIKs (accession number prefixes)
 const FILING_AGENT_CIKS: Record<string, string> = {
  "0000950170": "Donnelley Financial Solutions",
  "0001193125": "Donnelley Financial Solutions",
  "0001558370": "Toppan Merrill",
  "0001654954": "Toppan Merrill",
  "0001104659": "Toppan Merrill",
 };
 // ─── Generator normalization ───
 function normalizeGenerator(raw: string): string {
  const r = raw.trim().toLowerCase();
  if (r.includes("workiva") || r.includes("wdesk")) return "Workiva";
  if (r.includes("donnelley") || r.includes("dfin") || r.includes("rrdonnelley"))
    return "Donnelley Financial Solutions";
  if (r.includes("toppan") || (r.includes("merrill") && r.includes("bridge")))
    return "Toppan Merrill";
  if (r.includes("word") && r.includes("microsoft")) return "Microsoft Word";
  if (r.includes("excel") && r.includes("microsoft")) return "Microsoft Excel";
  if (r.includes("thunderdome")) return "ThunderDome";
  if (r.includes("goxbrl")) return "GoXBRL";
  if (r.includes("compsci")) return "CompSci Transform";
  if (r.includes("certent")) return "Certent";
  if (r.includes("iris carbon")) return "IRIS Carbon";
  if (r.includes("broadridge") || r.includes("profile")) return "Broadridge PROfile";
  if (r.includes("sec publisher")) return "SEC Publisher";
  return raw.trim();
 }
 // ─── Generator detection (ported from detect_generators.py) ───
 function detectGenerator(filepath: string): string {
  const buf = Buffer.alloc(READ_BYTES);
  const fd = Bun.file(filepath);
  // Use sync read for worker perf
  const raw = readFileSync(filepath);
  const text = raw.subarray(0, READ_BYTES).toString("utf-8");
  const textLower = text.toLowerCase();
  // --- Explicit generator metadata ---
  // 1. <meta name="generator" content="...">
  let m: RegExpMatchArray | null;
  m =
    text.match(
      /<meta\s+name\s*=\s*["']generator["']\s+content\s*=\s*["']([^"']+)["']/i,
    ) ??
    text.match(
      /<meta\s+content\s*=\s*["']([^"']+)["']\s+name\s*=\s*["']generator["']/i,
    );
  if (m) return normalizeGenerator(m[1]!);
  // 2. <meta name="Creator" content="...">
  m = text.match(
    /<meta\s+name\s*=\s*["']Creator["']\s+content\s*=\s*["']([^"']+)["']/i,
  );
  if (m) return normalizeGenerator(m[1]!);
  // 3. <meta name="Producer" content="...">
  m = text.match(
    /<meta\s+name\s*=\s*["']Producer["']\s+content\s*=\s*["']([^"']+)["']/i,
  );
  if (m) return normalizeGenerator(m[1]!);
  // 4. ProgId meta tag
  m = text.match(
    /<meta\s+name\s*=\s*["']ProgId["']\s+content\s*=\s*["']([^"']+)["']/i,
  );
  if (m) {
    const progid = m[1]!;
    if (/word/i.test(progid)) return "Microsoft Word";
    if (/excel/i.test(progid)) return "Microsoft Excel";
    return normalizeGenerator(progid);
  }
  // --- HTML comment signatures ---
  // Workiva / Wdesk
  if (/<!--.*Created with the Workiva Platform.*-->/i.test(text)) return "Workiva";
  if (/<!--.*Copyright\s+\d{4}\s+Workiva.*-->/i.test(text)) return "Workiva";
  if (/<!--.*Document created using Wdesk.*-->/i.test(text)) return "Workiva";
  // Toppan Merrill / Bridge
  if (/<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->/i.test(text))
    return "Toppan Merrill";
  if (/<!--.*Merrill\s*Bridge.*-->/i.test(text)) return "Toppan Merrill";
  // Donnelley Financial Solutions / RR Donnelley
  if (/<!--.*Donnelley Financial Solutions.*-->/i.test(text))
    return "Donnelley Financial Solutions";
  if (/<!--.*RR\s*Donnelley.*-->/i.test(text)) return "Donnelley Financial Solutions";
  // Broadridge PROfile
  if (/<!--.*Broadridge\s+PROfile.*-->/i.test(text)) return "Broadridge PROfile";
  if (textLower.includes("broadridge")) return "Broadridge PROfile";
  // SEC Publisher
  const titleMatch = text.match(/<title[^>]*>([^<]+)<\/title>/i);
  const titleText = titleMatch ? titleMatch[1]!.trim() : "";
  if (textLower.includes("sec publisher") || titleText.toLowerCase().includes("sec publisher"))
    return "SEC Publisher";
  // IRIS Carbon
  if (/<!--.*Powered by IRIS Carbon.*-->/i.test(text)) return "IRIS Carbon";
  // Certent
  if (/<!--.*Certent\s+Disclosure\s+Management.*-->/i.test(text)) return "Certent";
  if (textLower.includes("certent")) return "Certent";
  // CompSci Resources
  if (/<!--.*CompSci Resources.*-->/i.test(text)) return "CompSci Transform";
  // RDG Portal
  if (/<!--.*RDG Portal.*-->/i.test(text)) return "RDG Portal";
  // PDF to EDGAR
  if (titleText.toLowerCase() === "pdf to edgar" || textLower.slice(0, 2000).includes("pdf to edgar"))
    return "PDF to EDGAR";
  // Generic generated/created by comments
  m = text.match(/<!--\s*Generated\s+by\s+([^-]+?)-->/i);
  if (m) {
    const val = m[1]!.trim();
    if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val);
  }
  m = text.match(/<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->/i);
  if (m) {
    const val = m[1]!.trim();
    if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val);
  }
  // --- Keyword signatures ---
  if (/\bwdesk\b/.test(textLower)) return "Workiva";
  if (/\bworkiva\b/.test(textLower)) return "Workiva";
  if (/\brrdonnelley\b/.test(textLower)) return "Donnelley Financial Solutions";
  if (/\bedgar-online\b/.test(textLower)) return "Donnelley Financial Solutions";
  if (/\btoppan\b/.test(textLower)) return "Toppan Merrill";
  if (/\bmerrill\b/.test(textLower) && /\b(?:bridge|ixbrl|xbrl)\b/.test(textLower))
    return "Toppan Merrill";
  if (/\bbowne\b/.test(textLower)) return "Toppan Merrill";
  if (/\bcompsci\b/.test(textLower)) return "CompSci Transform";
  if (/\bthunderdome\b/.test(textLower)) return "ThunderDome";
  if (/\bgoxbrl\b/.test(textLower)) return "GoXBRL";
  // CSS class naming patterns
  if (/class\s*=\s*["'][^"']*\bwk_\w+/.test(textLower)) return "Workiva";
  // --- SGML document wrapper detection ---
  const hasSgml = /<DOCUMENT>\s*\n?\s*<TYPE>/i.test(text);
  if (hasSgml) {
    const fnMatch = text.match(/<FILENAME>\s*([\w\-\.]+)/i);
    if (fnMatch) {
      const filename = fnMatch[1]!.toLowerCase();
      if (/^d\d+/.test(filename)) return "Donnelley Financial Solutions";
      if (/^tm\d+/.test(filename)) return "Toppan Merrill";
      if (/^ea\d+/.test(filename)) return "EFiling/EDGAR Agent";
    }
    if (
      textLower.includes("<!-- field: rule-page") ||
      textLower.slice(0, 5000).includes("rule-page")
    )
      return "Broadridge PROfile";
    if (textLower.includes("field: set; name: xdx")) return "EFiling XDX";
    if (textLower.slice(0, 5000).includes("<!-- field:")) return "EFiling/EDGAR Agent";
    if (/<Center><DIV STYLE="width:8\.5in"/.test(text))
      return "Donnelley Financial Solutions";
    // Check accession prefix
    const bn = basename(filepath);
    const accessionPrefix = bn.split("-")[0]!;
    if (accessionPrefix in FILING_AGENT_CIKS)
      return FILING_AGENT_CIKS[accessionPrefix]!;
    // Legacy font-based
    const fontCount = (textLower.match(/<font/g) ?? []).length;
    if (fontCount > 5) return "SGML-wrapped (legacy/font-based)";
    return "SGML-wrapped (unknown)";
  }
  // --- Inline XBRL detection ---
  const hasIxNs =
    textLower.includes("xmlns:ix=") || textLower.includes("<ix:header");
  // Structural: Donnelley uppercase P STYLE + Center DIV 8.5in
  if (
    /<P STYLE="[^"]*font-family:Times New Roman"/.test(text) &&
    /<Center><DIV STYLE="width:8\.5in"/.test(text)
  )
    return "Donnelley Financial Solutions";
  // Title tag tool names
  if (titleText) {
    const tl = titleText.toLowerCase();
    if (tl.includes("workiva") || tl.includes("wdesk")) return "Workiva";
  }
  if (hasIxNs) {
    if (textLower.includes("field: set; name: xdx")) return "EFiling XDX";
    if (textLower.includes("<!-- field: rule")) return "Broadridge PROfile";
    if (textLower.slice(0, 5000).includes("<!-- field:")) return "EFiling/EDGAR Agent";
    // Filing agent CIK-based
    const bn = basename(filepath);
    const accessionPrefix = bn.split("-")[0]!;
    if (accessionPrefix in FILING_AGENT_CIKS)
      return FILING_AGENT_CIKS[accessionPrefix]!;
    // XML declaration encoding
    if (textLower.slice(0, 200).includes('<?xml version="1.0" encoding="utf-8"'))
      return "Inline XBRL (utf-8 toolchain)";
    if (textLower.slice(0, 200).includes("<?xml version='1.0' encoding='ascii'?>")) {
      if (/<div style="display:none"><ix:header>/i.test(textLower.slice(0, 3000)))
        return "Inline XBRL (SEC/EDGAR standard)";
      return "Inline XBRL (SEC/EDGAR standard)";
    }
    return "Inline XBRL (tool unresolved)";
  }
  // --- Structural fallbacks ---
  const fontCount = (textLower.match(/<font/g) ?? []).length;
  const tdCount = (textLower.match(/<td/g) ?? []).length;
  const spanCount = (textLower.match(/<span/g) ?? []).length;
  if (fontCount > 20) return "Legacy generator (font-based)";
  if (tdCount > 50 && spanCount < 10) return "Table-based generator";
  const dataAttrCount = (textLower.match(/\bdata-\w+/g) ?? []).length;
  if (dataAttrCount > 10) return "Modern web tooling";
  return "Unknown";
 }
 // ─── Worker mode ───
 const args = process.argv.slice(2);
 if (args[0] === "--worker") {
  const startIdx = parseInt(args[1]!);
  const endIdx = parseInt(args[2]!);
  const outFile = args[3]!;
  const htmlFiles = readdirSync(HTML_DIR)
    .filter((f: string) => f.endsWith(".html"))
    .sort()
    .slice(startIdx, endIdx);
  const records: string[] = [];
  for (const file of htmlFiles) {
    const accession = file.replace(".html", "");
    const generator = detectGenerator(`${HTML_DIR}/${file}`);
    records.push(JSON.stringify({ accession, generator }));
  }
  writeFileSync(outFile, records.join("\n") + (records.length > 0 ? "\n" : ""));
  process.exit(0);
 }
 // ─── Main mode: orchestrate workers ───
 const start = Date.now();
 mkdirSync(OUTPUT_DIR, { recursive: true });
 const htmlFiles = readdirSync(HTML_DIR)
  .filter((f: string) => f.endsWith(".html"))
  .sort();
 const nproc = cpus().length;
 const chunkSize = Math.ceil(htmlFiles.length / nproc);
 process.stderr.write(
  `  Tagging generators for ${htmlFiles.length} HTML files with ${nproc} workers...\n\n`,
 );
 const tmpFiles: string[] = [];
 const workers: ReturnType<typeof Bun.spawn>[] = [];
 for (let i = 0; i < nproc; i++) {
  const startIdx = i * chunkSize;
  const endIdx = Math.min(startIdx + chunkSize, htmlFiles.length);
  if (startIdx >= htmlFiles.length) break;
  const tmpFile = `${OUTPUT_DIR}/.tmp-gen-${i}.jsonl`;
  tmpFiles.push(tmpFile);
  workers.push(
    Bun.spawn(
      [
        "bun",
        "run",
        import.meta.filename,
        "--worker",
        String(startIdx),
        String(endIdx),
        tmpFile,
      ],
      { stderr: "inherit" },
    ),
  );
 }
 for (const worker of workers) {
  await worker.exited;
 }
 process.stderr.write(`  Workers done, merging results...\n`);
 // Merge and sort
 type TagRecord = { accession: string; generator: string };
 const allRecords: TagRecord[] = [];
 for (const tmpFile of tmpFiles) {
  const rl = createInterface({ input: createReadStream(tmpFile) });
  for await (const line of rl) {
    if (line.trim()) allRecords.push(JSON.parse(line));
  }
 }
 allRecords.sort((a, b) => a.accession.localeCompare(b.accession));
 // Write final output
 writeFileSync(
  OUTPUT_FILE,
  allRecords.map((r) => JSON.stringify(r)).join("\n") + "\n",
 );
 // Cleanup
 for (const tmpFile of tmpFiles) {
  try {
    unlinkSync(tmpFile);
  } catch {}
 }
 // Print summary
 const counts = new Map<string, number>();
 for (const r of allRecords) {
  counts.set(r.generator, (counts.get(r.generator) ?? 0) + 1);
 }
 const sorted = [...counts.entries()].sort((a, b) => b[1] - a[1]);
 const elapsed = ((Date.now() - start) / 1000).toFixed(1);
 const total = allRecords.length;
 console.log(`\n${"=".repeat(70)}`);
 console.log(`Generator Tags Summary (${total} files, ${elapsed}s)`);
 console.log(`${"=".repeat(70)}`);
 console.log(`${"Generator".padEnd(45)} ${"Count".padStart(7)} ${"  %".padStart(7)}`);
 console.log("-".repeat(70));
 for (const [gen, count] of sorted) {
  const pct = ((count / total) * 100).toFixed(1);
  console.log(`${gen.padEnd(45)} ${String(count).padStart(7)} ${(pct + "%").padStart(7)}`);
 }
 console.log("-".repeat(70));
 console.log(`${"TOTAL".padEnd(45)} ${String(total).padStart(7)} ${"100.0%".padStart(7)}`);
 console.log(`\nOutput: ${OUTPUT_FILE}`);
--- a/ts/src/extract/fast-reparse.ts
+++ b/ts/src/extract/fast-reparse.ts
@ -4,59 +4,13 @@
 */
 import { readdirSync, readFileSync, writeFileSync } from "node:fs";
 import { segmentParagraphs } from "./segment.ts";
-import type { FilingMeta, Paragraph } from "@sec-cybert/schemas/paragraph.ts";
+import { stripHtml } from "./html-cleaner.ts";
 import type { FilingMeta } from "@sec-cybert/schemas/paragraph.ts";
 const HTML_CACHE_DIR = "../data/raw/html";
 const OUTPUT_PATH = "../data/paragraphs/paragraphs.jsonl";
 const ACCESSION_META_PATH = "../data/bulk/accession-meta.json";
 // ─── Fast HTML→text (regex, no DOM) ───
 function stripHtml(html: string): string {
  return html
    .replace(/<script[\s\S]*?<\/script>/gi, "")
    .replace(/<style[\s\S]*?<\/style>/gi, "")
    .replace(/<noscript[\s\S]*?<\/noscript>/gi, "")
    // Collapse adjacent inline element boundaries to prevent word splitting
    .replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi, (_m, _tag, ws) => ws.length > 0 ? " " : "")
    .replace(/<\/ix:[a-z]+>(\s*)<ix:[a-z]+[^>]*>/gi, (_m, ws) => ws.length > 0 ? " " : "")
    .replace(/<\/(p|div|tr|li|h[1-6]|td|th)>/gi, "\n")
    .replace(/<(br|hr)\s*\/?>/gi, "\n")
    .replace(/<[^>]+>/g, " ")
    .replace(/&nbsp;|&#160;|&#xa0;/gi, " ")
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;|&ldquo;|&rdquo;|&#8220;|&#8221;|&#147;|&#148;/g, '"')
    .replace(/&#39;|&apos;|&rsquo;|&lsquo;|&#8216;|&#8217;|&#146;/g, "'")
    .replace(/&mdash;|&#8212;|&#151;/g, "—")
    .replace(/&ndash;|&#8211;|&#150;/g, "–")
    .replace(/&bull;|&#8226;|&#149;/g, "•")
    .replace(/&minus;|&#8722;/g, "-")
    .replace(/&sect;|&#167;/g, "§")
    .replace(/&#153;/g, "™")
    .replace(/&#x([0-9a-fA-F]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
    .replace(/&#\d+;/g, " ")
    .replace(/&\w+;/g, " ")
    .replace(/[^\S\n]+/g, " ")
    .replace(/([a-z])\.([A-Z])/g, "$1. $2")
    .replace(/([a-z]),([A-Z])/g, "$1, $2")
    .replace(/([a-z]);([A-Z])/g, "$1; $2")
    .replace(/•([A-Za-z])/g, "• $1")
    .replace(/\b([a-z])\.([A-Z])/g, "$1. $2")
    // Greek question mark (U+037E) → semicolon
    .replace(/\u037e/g, ";")
    // Fix inline element joins that created camelCase with common English words
    .replace(/([a-z])(The|Our|We|This|These|That|Its|His|Her|In|As|For|And|Or|If|An|It|To|By|On|At|No|Of|All|Any|Has|Was|Is|Are|Not|May|Can|Will|Such|Also|But|Each|New|So|Up|With|From)\b/g, "$1 $2")
    // Fix colon-joins: word:Word → word: Word (exclude URLs)
    .replace(/([a-z]):([A-Z])/g, "$1: $2")
    // Fix ISO standard joins: ISO/IEC27001 → ISO/IEC 27001, ISO27001 → ISO 27001
    .replace(/\b(ISO(?:\/IEC)?)(\d)/g, "$1 $2")
    .replace(/(Standardization)(\d)/g, "$1 $2")
    // Fix PDF extraction artifact: space before punctuation ("Director ," → "Director,")
    .replace(/ ([,;:.!?)])/g, "$1");
 }
 // ─── Item 1C extraction (regex on stripped text) ───
 const ITEM_1C = /^\s*(\u2022\s*)?item\s*1c[\.\s\u00a0—–:-]/i;
--- a/ts/src/extract/html-cleaner.ts
+++ b/ts/src/extract/html-cleaner.ts
@ -0,0 +1,50 @@
 /**
 * HTML → plain text cleaning for SEC filings.
 * Used by both paragraph extraction (fast-reparse) and DAPT corpus preparation.
 */
 /** Strip HTML tags, decode entities, fix word-boundary artifacts from SEC EDGAR HTML. */
 export function stripHtml(html: string): string {
  return html
    .replace(/<script[\s\S]*?<\/script>/gi, "")
    .replace(/<style[\s\S]*?<\/style>/gi, "")
    .replace(/<noscript[\s\S]*?<\/noscript>/gi, "")
    // Collapse adjacent inline element boundaries to prevent word splitting
    .replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi, (_m, _tag, ws) => ws.length > 0 ? " " : "")
    .replace(/<\/ix:[a-z]+>(\s*)<ix:[a-z]+[^>]*>/gi, (_m, ws) => ws.length > 0 ? " " : "")
    .replace(/<\/(p|div|tr|li|h[1-6]|td|th)>/gi, "\n")
    .replace(/<(br|hr)\s*\/?>/gi, "\n")
    .replace(/<[^>]+>/g, " ")
    .replace(/&nbsp;|&#160;|&#xa0;/gi, " ")
    .replace(/&amp;/g, "&")
    .replace(/&lt;/g, "<")
    .replace(/&gt;/g, ">")
    .replace(/&quot;|&ldquo;|&rdquo;|&#8220;|&#8221;|&#147;|&#148;/g, '"')
    .replace(/&#39;|&apos;|&rsquo;|&lsquo;|&#8216;|&#8217;|&#146;/g, "'")
    .replace(/&mdash;|&#8212;|&#151;/g, "—")
    .replace(/&ndash;|&#8211;|&#150;/g, "–")
    .replace(/&bull;|&#8226;|&#149;/g, "•")
    .replace(/&minus;|&#8722;/g, "-")
    .replace(/&sect;|&#167;/g, "§")
    .replace(/&#153;/g, "™")
    .replace(/&#x([0-9a-fA-F]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
    .replace(/&#\d+;/g, " ")
    .replace(/&\w+;/g, " ")
    .replace(/[^\S\n]+/g, " ")
    .replace(/([a-z])\.([A-Z])/g, "$1. $2")
    .replace(/([a-z]),([A-Z])/g, "$1, $2")
    .replace(/([a-z]);([A-Z])/g, "$1; $2")
    .replace(/•([A-Za-z])/g, "• $1")
    .replace(/\b([a-z])\.([A-Z])/g, "$1. $2")
    // Greek question mark (U+037E) → semicolon
    .replace(/\u037e/g, ";")
    // Fix inline element joins that created camelCase with common English words
    .replace(/([a-z])(The|Our|We|This|These|That|Its|His|Her|In|As|For|And|Or|If|An|It|To|By|On|At|No|Of|All|Any|Has|Was|Is|Are|Not|May|Can|Will|Such|Also|But|Each|New|So|Up|With|From)\b/g, "$1 $2")
    // Fix colon-joins: word:Word → word: Word (exclude URLs)
    .replace(/([a-z]):([A-Z])/g, "$1: $2")
    // Fix ISO standard joins: ISO/IEC27001 → ISO/IEC 27001, ISO27001 → ISO 27001
    .replace(/\b(ISO(?:\/IEC)?)(\d)/g, "$1 $2")
    .replace(/(Standardization)(\d)/g, "$1 $2")
    // Fix PDF extraction artifact: space before punctuation ("Director ," → "Director,")
    .replace(/ ([,;:.!?)])/g, "$1");
 }
--- a/ts/src/extract/segment.ts
+++ b/ts/src/extract/segment.ts
@ -177,12 +177,34 @@ export function segmentParagraphs(
    }
  }
  // Buffer for orphan first-words: SEC HTML sometimes splits the first word of a
  // sentence onto its own line within a <span> tag. These single-word blocks are
  // below MIN_WORDS and would be dropped. Instead, buffer them and prepend to the
  // next block so the sentence stays intact.
  let orphanBuffer = "";
  for (const block of blocks) {
-    const stripped = block.replace(LEADING_PUNCT, "");
+    let stripped = block.replace(LEADING_PUNCT, "");
    if (stripped.length === 0) continue;
    // Prepend any buffered orphan word, but only if this block starts lowercase
    // (confirming it's a sentence continuation, not a new heading)
    if (orphanBuffer) {
      if (STARTS_LOWERCASE.test(stripped)) {
        stripped = orphanBuffer + " " + stripped;
      }
      // Either way, clear the buffer — don't carry it across multiple blocks
      orphanBuffer = "";
    }
    const wc = wordCount(stripped);
    // Single-word orphan: buffer for prepending to the next block
    if (wc === 1 && /^[A-Za-z]/.test(stripped) && !TERMINAL_PUNCT.test(stripped)) {
      orphanBuffer = stripped;
      continue;
    }
    // Short blocks: append to previous paragraph instead of dropping,
    // but only if it completes a sentence or previous was already broken
    if (wc < MIN_WORDS && paragraphs.length > 0) {
--- a/ts/src/label/prompts.ts
+++ b/ts/src/label/prompts.ts
@ -117,8 +117,9 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
  • The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
  "None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — ask: is there substantive cybersecurity disclosure?
-  • None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate.
+  • None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate, generic regulatory compliance language ("subject to various regulatory requirements... non-compliance could result in penalties").
  • Strategy Integration = actual discussion of business/financial impact, cyber insurance, budget allocation, or materiality assessment.
  • Generic regulatory risk language (acknowledging regulations exist, non-compliance would be bad) is None/Other — it makes no materiality assessment and describes no strategy. It only becomes Strategy Integration if it explicitly assesses whether regulatory risks have "materially affected" the business.
  • If the paragraph only establishes that the company has IT systems and data without describing any program, process, or strategy → None/Other.`,
  "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: who is the grammatical subject?
@ -133,7 +134,8 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
  "None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
  • Describing actual processes (monitoring, assessment, vulnerability management, training programs) → RMP.
-  • Only stating the company has IT systems, collects data, or faces cyber risks — without describing what it DOES about them → None/Other.`,
+  • Only stating the company has IT systems, collects data, or faces cyber risks — without describing what it DOES about them → None/Other.
  • Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other — it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described → RMP at Specificity 2 (named standard).`,
  "Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
  • Describing HOW risks are assessed, monitored, mitigated → Risk Management Process.