diff --git a/docs/CODEBOOK-RATIONALE.md b/docs/CODEBOOK-RATIONALE.md new file mode 100644 index 0000000..6f20068 --- /dev/null +++ b/docs/CODEBOOK-RATIONALE.md @@ -0,0 +1,87 @@ +# Codebook Rationale & Interpretive Guide + +Companion to `LABELING-CODEBOOK.md`. Covers the "why" behind design decisions and common interpretive pitfalls that aren't obvious from the codebook itself. + +--- + +## Category Design: Mapping to SEC Regulation S-K Item 106 + +The six substantive categories map directly to the structure of the SEC's cybersecurity disclosure rule (adopted July 2023): + +| Codebook Category | SEC Basis | What the SEC is asking | +|---|---|---| +| Board Governance | Item 106(c)(1) | How does the board oversee cyber risk? | +| Management Role | Item 106(c)(2) | Who in management is responsible, and what qualifies them? | +| Risk Management Process | Item 106(b) | What processes do you use to assess, identify, and manage cyber risk? | +| Third-Party Risk | Item 106(b) | How do you handle vendor/supply chain cyber risk? | +| Strategy Integration | Item 106(b)(2) | Has cyber risk materially affected your business or financials? | +| Incident Disclosure | 8-K Item 1.05 | What happened in an actual cybersecurity incident? | +| None/Other | N/A | Classifier catch-all for non-substantive content | + +### Editorial choice: Third-Party Risk as a separate category + +The SEC does not give Third-Party Risk its own subsection — vendor/supply chain oversight is part of 106(b) alongside general risk management. The codebook carves it out as a distinct class because it represents a sufficiently different disclosure pattern to be analytically useful. + +### "Risk Management" is broader than it sounds + +The SEC's 106(b) definition of risk management encompasses the full lifecycle: assessing, identifying, **and managing** cybersecurity risks. Under frameworks like NIST CSF (which the SEC references), "managing" includes Respond and Recover functions — not just preventive controls. + +This means incident response **procedures** (escalation chains, playbooks, notification workflows, materiality determination processes) are Risk Management Process, not Incident Disclosure. The test: + +| What the paragraph describes | Category | +|---|---| +| Pre-established process for handling incidents (playbooks, escalation chains, "in the event of...") | **Risk Management Process** | +| An actual incident that occurred (dates, scope, remediation of a real event) | **Incident Disclosure** | + +Conditional language ("in the event of," "if necessary," "if and when") is a strong signal that the paragraph describes a process, not an event. + +### "Strategy Integration" is narrower than it sounds + +Strategy Integration does not mean "strategic approach to cybersecurity." It specifically covers the **business and financial consequences** of cyber risk — the SEC 106(b)(2) question of whether cyber risk hit the bottom line or changed business strategy. + +What qualifies: +- Materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") +- Cybersecurity spending and investment (budgets, dollar amounts, year-over-year changes) +- Insurance coverage (carriers, limits, deductibles) +- Financial impact of incidents (costs, revenue loss, insurance claims) + +What does not qualify: +- Describing a sophisticated incident response process (that's Risk Management Process even though it's "strategic" in the colloquial sense) +- Describing a materiality **determination process** (the process for deciding if something is material is Risk Management Process; the actual materiality **conclusion** is Strategy Integration) + +--- + +## Specificity Scale: Design Rationale + +### The four levels measure disclosure quality progression + +| Level | What it tells you | +|---|---| +| 1 — Generic Boilerplate | Company said nothing substantive. Could paste into any filing unchanged. | +| 2 — Sector-Adapted | Company name-dropped a recognized standard (NIST, ISO 27001, SOC 2, etc.) but nothing unique to their organization. | +| 3 — Firm-Specific | Company disclosed at least one fact unique to their organization. | +| 4 — Quantified-Verifiable | Company disclosed two or more independently verifiable hard facts. | + +### "Sector-Adapted" refers to the cybersecurity sector, not the company's industry + +The name is misleading. "Sector-Adapted" does not mean "the company adapted its disclosure to its industry" (e.g., a bank discussing financial-sector cyber risks). It means the company referenced a recognized **cybersecurity** standard or framework — NIST CSF, ISO 27001, SOC 2, PCI DSS, HIPAA, etc. The "sector" is cybersecurity itself. A utility company mentioning NERC CIP and a retailer mentioning PCI DSS both qualify for Level 2 the same way — they named a standard. The company's own industry is irrelevant to the specificity score. + +### Level 2 is intentionally narrow + +Level 2 requires naming a recognized standard but having zero firm-specific facts. In practice this is uncommon — most filings either say nothing specific (Level 1) or name a framework alongside a CISO or named committee in the same paragraph (Level 3). + +This is a feature, not a bug. The analytically interesting distinction is between Level 1 (boilerplate box-checking) and Level 3/4 (substantive disclosure). Level 2 is a real but thin middle ground. A mushier middle would make the classifier's job harder without adding research value. + +### The research contribution is the specificity dimension itself + +The SEC requires cybersecurity disclosure but does not grade its quality. The 1-4 specificity scale measures something the SEC doesn't: how much substance is actually in the disclosure versus boilerplate. The core research question is whether companies are genuinely disclosing or just filling the regulatory box. + +### Common specificity pitfalls + +**Generic practices are not specific.** Penetration testing, vulnerability scanning, tabletop exercises, phishing simulations, security awareness training, encryption, logging and monitoring — all Level 1. These are standard activities that appear in nearly every filing. + +**Long paragraphs can still be Level 1.** A paragraph can list ten generic security practices and still be boilerplate. Length and detail are not the same as specificity. + +**Cross-references and section titles don't add specificity.** Quoting a long Risk Factors section title with specific-sounding language ("collaborators, contract research organizations, third-party logistics providers") is just metadata, not disclosure substance. + +**The materiality boilerplate is Level 1.** The phrase "have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition" appears nearly verbatim in thousands of filings. It is Strategy Integration (it makes a materiality assessment) but Specificity 1 (the assessment is template language). diff --git a/docs/DAPT-PROCEDURE.md b/docs/DAPT-PROCEDURE.md new file mode 100644 index 0000000..00ef8bd --- /dev/null +++ b/docs/DAPT-PROCEDURE.md @@ -0,0 +1,184 @@ +# DAPT/TAPT Training Procedure + +**Date:** 2026-03-29 +**Hardware:** NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128 + +--- + +## Pre-flight Checklist + +| Check | Status | +|-------|--------| +| PyTorch 2.10.0+cu128, CUDA available | Verified | +| RTX 3090, 25.3 GB VRAM, bf16 supported | Verified | +| CUDA driver 13.2 / runtime 12.8 forward compatible | Verified (GPU matmul test passed) | +| ModernBERT-large loads: 396M params, max_position_embeddings=8192 | Verified | +| Corpus: 14,756 docs, ~1.06B tokens, 15 shards | Verified | +| After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss) | Verified | +| Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens | Verified | +| Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30 | Set | + +## DAPT Corpus Summary + +- **14,568 documents** (after filtering 188 cover pages <10K chars) +- **~1.056 billion tokens** (ModernBERT tokenizer, 4.72 chars/token) +- **~136K training sequences** at seq_len=8192 +- **Median document: ~73K tokens** (347K chars) — 90.6% of docs exceed 8192 tokens +- Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed +- Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by `ts/scripts/dapt-corpus-prep.ts` + +## Training Configuration + +**Config file:** `python/configs/dapt/modernbert.yaml` + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| `max_seq_length` | 8192 | Match ModernBERT's pre-training context length | +| `per_device_train_batch_size` | 1 | Memory-limited at 8192 seq_len on 24GB | +| `gradient_accumulation_steps` | 32 | Effective batch size = 32 | +| `num_train_epochs` | 1 | Single pass per Gururangan et al. (2020) and Ponnock (2025) | +| `learning_rate` | 5e-5 | Standard for continued pre-training | +| `mlm_probability` | 0.30 | ModernBERT's pre-training masking rate | +| `warmup_ratio` | 0.05 | ~213 warmup steps | +| `gradient_checkpointing` | true | Required for 8192 seq_len on 24GB | +| `bf16` | true | Native RTX 3090 support | +| `save_steps` | 1000 | Checkpoint every ~1000 steps | +| `eval_steps` | 1000 | Evaluate every ~1000 steps | +| `save_total_limit` | 3 | Keep last 3 checkpoints | + +### Epoch Decision Justification + +We train for 1 epoch (single pass over the corpus), following the empirical consensus: + +- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks. + +- **Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384):** Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold. + +Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass. + +### Sequence Length Decision + +ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT. + +The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time. + +For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability. + +## Step 1: DAPT + +### Command + +```bash +cd python +bun run py:train dapt --config configs/dapt/modernbert.yaml +``` + +Equivalent to: `uv run main.py dapt --config configs/dapt/modernbert.yaml` + +### What happens + +1. Loads ModernBERT-large from HuggingFace (cached after first download) +2. Loads 14,756 docs from `data/dapt-corpus/`, filters 188 < 10K chars +3. Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens +4. Splits 2% validation (~2,700 sequences), 98% train (~133K sequences) +5. Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing +6. ~4,257 steps total, logging every 50, checkpoint+eval every 1,000 +7. Saves final model + tokenizer to `checkpoints/dapt/modernbert-large/final/` +8. Reports final eval loss and perplexity + +### Expected duration + +~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing). + +### Resume if interrupted + +HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically. + +### Output + +``` +checkpoints/dapt/modernbert-large/ + checkpoint-1000/ + checkpoint-2000/ + checkpoint-3000/ + final/ <- final model + tokenizer + config.json + model.safetensors + tokenizer.json + ... +``` + +## Step 2: TAPT + +After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically. + +### Command + +```bash +bun run py:train dapt --config configs/dapt/modernbert.yaml \ + --model-path ../checkpoints/dapt/modernbert-large/final \ + --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \ + --output-dir ../checkpoints/tapt/modernbert-large \ + --stage tapt +``` + +### What happens + +1. Loads the DAPT checkpoint (not the base ModernBERT) +2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl` +3. Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens) +4. Trains MLM with same hyperparameters +5. Saves to `checkpoints/tapt/modernbert-large/final/` + +### Expected duration + +~2-3 hours (much smaller corpus). + +### Output + +``` +checkpoints/tapt/modernbert-large/ + final/ <- SEC-cyBERT-large (DAPT + TAPT) +``` + +## Step 3: Ablation Checkpoints + +The training pipeline produces clean ablation rows for the paper: + +| Model | Checkpoint | Description | +|-------|-----------|-------------| +| Base | `answerdotai/ModernBERT-large` | Off-the-shelf, no domain adaptation | +| +DAPT | `checkpoints/dapt/modernbert-large/final` | After domain pre-training on 14.5K filings | +| +DAPT+TAPT | `checkpoints/tapt/modernbert-large/final` | After task pre-training on 72K paragraphs | + +Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage. + +## Monitoring + +During training, the Trainer logs to stderr every 50 steps: +- `loss` — training MLM loss (cross-entropy on masked tokens) +- `learning_rate` — current LR (ramps up during warmup, then decays) +- `epoch` — progress through the epoch + +Every 1,000 steps, it also reports: +- `eval_loss` — validation MLM loss +- Perplexity can be computed as `2^eval_loss` + +**What to watch for:** +- Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0 +- Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch) +- If loss spikes or goes to NaN, the learning rate may be too high + +## Artifacts + +| File | Purpose | +|------|---------| +| `python/configs/dapt/modernbert.yaml` | DAPT config | +| `python/configs/dapt/neobert.yaml` | NeoBERT config (if needed) | +| `python/main.py` | CLI entrypoint | +| `python/src/dapt/train.py` | Training loop | +| `python/src/data/corpus.py` | Corpus loading + tokenization | +| `python/src/common/config.py` | Typed YAML config | +| `ts/scripts/dapt-corpus-prep.ts` | Corpus preparation from HTML | +| `ts/scripts/dapt-corpus-analytics.ts` | Corpus analytics | +| `data/dapt-corpus/shard-*.jsonl` | Cleaned corpus (15 shards) | diff --git a/docs/DATA-QUALITY-AUDIT.md b/docs/DATA-QUALITY-AUDIT.md new file mode 100644 index 0000000..fc1a28b --- /dev/null +++ b/docs/DATA-QUALITY-AUDIT.md @@ -0,0 +1,421 @@ +# Data Quality Audit — SEC-cyBERT Corpus + +**Date:** 2026-03-29 +**Scope:** Full audit of DAPT corpus (14,756 docs) and paragraph data (72,045 paragraphs) +**Method:** 6 automated agents + manual investigation + +--- + +## 1. Executive Summary + +The data is in better shape than initially feared, but two significant issues were uncovered: + +1. **Inlined section headings affect ~22% of paragraphs** across all generators. These are section titles ("Risk Management and Strategy", "Board Oversight") prepended to paragraph body text with no separator. Consistent across generators = our extraction pipeline's heading detection, not a generator HTML quirk. + +2. **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)** produces severely degraded extraction quality: 36.8% orphan word rate (8x corpus average), 5.9% fragment rate, lowest paragraphs-per-filing. This generator was hidden in a 45% "UNKNOWN" bucket until we identified it. It affects 1,014 filings and 5,779 paragraphs. + +**Decision:** Strip inlined headers from fine-tuning data. Expand orphan word patching to cover EFiling/XDX paragraphs. Tag all paragraphs with generator metadata for quality-aware training. + +--- + +## 2. Generator Landscape + +### Identification + +We identified **14 distinct filing generators** covering 99.99% of all 14,759 HTML files. Only 2 files remain unidentified (both 0-byte empty files). Detection used a combination of HTML meta tags, comments, namespace declarations, CSS class patterns, and CIK-based filing agent identification. + +Full reference: `docs/EDGAR-FILING-GENERATORS.md` + +### Generator Distribution + +| Generator | Files | % | Paragraphs | Quality Tier | +|-----------|-------|---|------------|-------------| +| Workiva | 3,592 | 24.3% | 22,407 | Clean | +| Inline XBRL (unattributed) | 2,417 | 16.4% | 15,233 | Clean | +| Donnelley Financial Solutions | 2,327 | 15.8% | 13,153 | Clean | +| EFiling/EDGAR Agent (XDX) | 1,997 | 13.5% | 5,779 | **Bad** | +| Toppan Merrill | 1,378 | 9.3% | 7,332 | OK | +| CompSci Transform | 879 | 6.0% | 3,287 | **Degraded** | +| SEC Publisher | 793 | 5.4% | — | — | +| ThunderDome | 732 | 5.0% | 3,581 | OK | +| Broadridge PROfile | 465 | 3.2% | 772 | OK | +| Certent | 86 | 0.6% | — | — | +| SGML-wrapped | 58 | 0.4% | — | — | +| IRIS Carbon | 20 | 0.1% | — | — | +| RDG Portal | 12 | 0.1% | — | — | +| PDF to EDGAR | 1 | <0.1% | — | — | + +Note: Not all HTML files produced paragraphs (some lack Item 1C, some are 8-Ks or amendments). + +### Quality Metrics by Generator + +| Generator | Orphan% | Fragment% | Trunc% | InlHdr% | AvgWC | Paras/Filing | +|-----------|---------|-----------|--------|---------|-------|-------------| +| Workiva | 0.6% | 1.2% | 0.5% | 21.9% | 99.7 | 8.4 | +| Donnelley | 0.5% | 1.4% | 0.5% | 21.8% | 92.7 | 7.9 | +| Inline XBRL | 0.9% | 1.5% | 0.6% | 21.8% | 98.4 | 8.1 | +| Toppan Merrill | 3.2% | 3.0% | 1.4% | 23.1% | 84.7 | 8.1 | +| ThunderDome | 3.0% | 4.3% | 1.8% | 24.4% | 83.0 | 7.7 | +| Broadridge | 3.4% | 3.5% | 2.1% | 21.5% | 84.4 | 7.8 | +| **CompSci Transform** | **14.8%** | **5.8%** | 1.7% | 15.4% | 72.1 | 5.6 | +| **EFiling/XDX** | **36.8%** | **5.9%** | **2.1%** | 16.5% | 69.8 | 5.7 | +| *Corpus average* | *4.7%* | *2.3%* | *0.9%* | *21.5%* | *91.9* | *7.7* | + +**Bold** = >2x corpus average. + +Key observations: +- Inlined headers (~22%) are consistent across ALL generators → extraction pipeline issue, not generator-specific +- Orphan words are highly concentrated: EFiling/XDX (36.8%) and CompSci Transform (14.8%) account for the vast majority +- Workiva and Donnelley produce the cleanest output (>70% of paragraphs) +- EFiling/XDX also has the lowest paragraphs-per-filing (5.7 vs 7.7 avg), suggesting extraction misses content +- CompSci Transform was acquired by Broadridge in July 2024; newer filings may appear as Broadridge PROfile + +--- + +## 3. Issue Inventory + +### 3.1 Inlined Section Headings (~22% of paragraphs) + +**What:** Section headings like "Risk Management and Strategy", "Board Oversight", "Cybersecurity Governance" are prepended to paragraph body text with no separator. + +**Example:** +``` +Risk Management and Strategy We have designed our cybersecurity risk management program to identify, +assess, and manage risks from cybersecurity threats... +``` + +**Cause:** The `extractItem1C()` function in `fast-reparse.ts` extracts the full Item 1C text including sub-section headings, and the paragraph segmenter doesn't strip them. The headings become the first "sentence" of the paragraph. + +**Impact on classification:** +- The heading is a near-perfect predictor of `content_category` — creates shortcut learning risk +- The heading tells you nothing about `specificity_level` — model still has to read body text +- At inference time, heading presence will be inconsistent across filings +- **Decision: Strip from fine-tuning data.** Headings are consistent across generators, so a single detection heuristic works. + +**Detection heuristic:** +- Common Item 1C sub-headings: "Risk Management and Strategy", "Risk Management", "Board Oversight", "Governance", "Management('s) Role", "Cybersecurity Governance", "Incident Detection", "Incident Response", "Strategy", "Third Party", "Third-Party" +- Structural: 2-5 title-cased words at paragraph start, followed by sentence text starting with "We", "Our", "The", a pronoun, or an article + +### 3.2 Orphan Words (4.7% overall, concentrated in 2 generators) + +**What:** The first word of a paragraph is dropped during extraction, leaving a paragraph that starts with lowercase mid-sentence. + +**Example:** +``` +sole executive officer and director is responsible for assessing and managing cybersecurity risks... +``` +(should be: "Our sole executive officer...") + +**Cause:** HTML source wraps text at fixed column width. The `` opening tag consumes most of a line, so only the first word fits before a source newline. `stripHtml()` preserves that newline, and downstream processing drops the single-word fragment. + +**Scope by generator:** +- EFiling/XDX: 36.8% of its paragraphs (2,127 affected) +- CompSci Transform: 14.8% (487 affected) +- All others: <3.5% +- Total: ~3,400 paragraphs corpus-wide + +**Already patched:** 215 paragraphs were surgically patched in `paragraphs-clean.patched.jsonl`. The remaining ~3,185 need the same treatment. + +**Impact on classification:** Meaning is preserved — annotators and models can infer the missing word from context. But systematically missing subjects ("We", "Our") could subtly bias specificity assessment. + +### 3.3 Orphaned Fragments (2.3% overall) + +**What:** List items split from their parent paragraph, creating very short standalone paragraphs. + +**Example:** +``` +the use of external service providers, where appropriate, to assess, test or otherwise assist with +aspects of our security controls; +``` + +**Cause:** Semicolon-terminated list items are treated as paragraph boundaries by the segmenter. + +**Scope:** 250 paragraphs identified in the narrower audit; ~1,660 total with <25 words. + +**Impact:** These are classifiable in isolation (the content is clear) but lack the framing context of the parent list. Likely annotated correctly but may have lower model confidence. + +### 3.4 Truncated Paragraphs (0.37%) + +**What:** Paragraphs ending mid-sentence without terminal punctuation. + +**Two patterns:** +1. Paragraph absorbed the start of the next section's heading (ends with "Governance", "Identify") +2. True truncation — cross-reference sentence cut off ("Risk Factors" in this) + +**Scope:** 264 paragraphs. + +**Impact:** Low — 0.37% and meaning is usually recoverable from context. + +### 3.5 Cross-Filing Boilerplate (53.6%) + +**What:** Paragraphs with identical text appearing in multiple filings. Driven by law firms and compliance consultants providing template language. + +**Scope:** 38,601 paragraphs share text with at least one other filing. 1,705 unique boilerplate texts appear in 3+ filings. The most-duplicated text appears in 138 filings across 84 companies. + +**Impact:** This IS the construct being measured. Boilerplate paragraphs should be classified as Specificity Level 1 (Generic Boilerplate). Not a quality issue — it's the signal. + +--- + +## 4. DAPT Corpus Audit + +### 4.1 Corpus Stats + +- **14,756 documents**, 15 shards +- **~1.06 billion tokens** (ModernBERT tokenizer; chars/4.72, not chars/4.0) +- **Median doc length:** 347K chars (~73K tokens) +- **90.8% of docs exceed 8,192 tokens** — chunking is mandatory (handled by training pipeline) + +### 4.2 Issues Found + +| Issue | Scope | Verdict | +|-------|-------|---------| +| 188 docs < 10K chars (cover pages) | 0.04% of tokens | Filter out | +| XBRL preambles (8% of docs) | 0.18% of chars | Negligible | +| Financial table fragments (~25% of lines) | Widespread | Acceptable — SEC domain includes numbers | +| URLs in 80% of docs (~4 per doc) | Low | Optional cleanup | +| 64 8-K filings mixed in | Tiny | Keep — domain-relevant | +| 1,470 amendments (median 94K chars) | Substantial content | Keep | +| 2 single-block docs (no paragraph breaks) | 2 docs | Filter out | +| 242 near-duplicate cross-year filings | 1.6% | Keep — different content | +| 0 garbled text, 0 HTML artifacts | | Clean | +| 0 sentence boundary violations | | Clean | + +### 4.3 Decision + +Filter <10K char docs and 2 structureless docs. Everything else is acceptable for unsupervised MLM. The model will learn SEC language including financial notation, legal boilerplate, and cybersecurity terminology. + +--- + +## 5. Patch History + +### Patch 1: Orphan Word Fix (2026-03-29) + +- **Scope:** 215 paragraphs, 77 filings +- **Method:** Detect orphan word in raw HTML, prepend to paragraph text +- **Validation:** All prefix additions, 0 boundary changes, 0 text shrinkages +- **Files:** `paragraphs-clean.patched.jsonl`, `training.patched.jsonl` +- **Annotation impact:** 142 annotated paragraphs affected (0.28%), meaning preserved + +### Patch 2: Expanded Orphan Word Fix (2026-03-29) + +- **Scope:** 2,233 paragraphs (includes Patch 1's 215; net 2,026 new) +- **Method:** HTML lookback — find paragraph text in stripped HTML, extract preceding word +- **Top orphan words:** We (632), Our (403), As (152), The (91), To (84), In (78), Cybersecurity (64) +- **Validation:** 0 false positives after filtering "Table of Contents" artifacts. 1,122 candidates rejected (legitimate list items starting with lowercase). +- **Annotation impact:** 1,400 annotated paragraphs affected. Label bias detected: Strategy Integration 1.55x over-represented, Management Role 0.49x under-represented in orphan-word paragraphs. **Recommended: re-run Stage 1 on patched text (~$15-20, may resolve conflicts).** +- **Script:** `ts/scripts/patch-orphan-words.ts` +- **Patch file:** `data/paragraphs/patches/orphan-word-patches.jsonl` + +### Patch 3: Heading Stripping (2026-03-29) + +- **Scope:** 7,514 paragraphs (10.4%) +- **Method:** Explicit pattern matching against known Item 1C sub-section headings (71 unique headings). Validated by confirming body text starts with sentence-starting word. +- **Top headings stripped:** Risk Management and Strategy (2,453), Cybersecurity Risk Management and Strategy (1,281), Cybersecurity Governance (1,208), Governance (301), Third-Party Risk Management (224) +- **Annotation impact:** 5,013 annotated paragraphs. Heading removal eliminates shortcut learning risk (heading was near-perfect predictor of content_category). +- **Script:** Inline Python (see audit process notes) +- **Patch file:** `data/paragraphs/patches/heading-strip-patches.jsonl` + +### Patch 4: Colon-Headed Paragraphs (2026-03-29) + +- **Scope:** 370 paragraphs +- **Method:** Regex match for "Heading Text: Sentence..." patterns. Only fires when colon is followed by known sentence-starting word. +- **Top headings stripped:** Education and Awareness (97), Safeguards (18), Management (15), Approach (13), Training (11) +- **Annotation impact:** 227 annotated paragraphs. +- **Patch file:** `data/paragraphs/patches/colon-heading-patches.jsonl` + +### Patch 5: Extended Separator Headings (2026-03-29) + +- **Scope:** 184 paragraphs +- **Method:** Detect headings with period, dash/em-dash, semicolon, or ALL-CAPS separators that Patches 3-4 missed. +- **Annotation impact:** 133 annotated paragraphs. +- **Patch file:** `data/paragraphs/patches/heading-strip-v2-patches.jsonl` + +### Patch 6: HTML-Confirmed Headings (2026-03-29) + +- **Scope:** 343 paragraphs +- **Method:** Extract bold/underline/h-tag styled text from source HTML (cached in `filing-headings.jsonl`), match against paragraph starts, validate with sentence-start check. Zero false positives — if the HTML says it's bold, it's a heading. +- **855 ambiguous cases rejected** where styled text was a sentence subject (e.g., bold "Cybersecurity" starting "Cybersecurity is a critical component...") +- **Annotation impact:** 270 annotated paragraphs. +- **Scripts:** `ts/scripts/extract-html-headings.ts` (1.7s for 6,341 filings with 32 workers) +- **Patch file:** `data/paragraphs/patches/heading-strip-html-patches.jsonl` +- **Cache:** `data/paragraphs/quality/filing-headings.jsonl` + +### Cumulative Heading Strip Summary + +| Pass | Method | Count | Cumulative | +|------|--------|-------|-----------| +| Patch 3 | Explicit heading patterns (space separator) | 7,514 | 7,514 | +| Patch 4 | Colon separator | 370 | 7,884 | +| Patch 5 | Period/dash/caps/semicolon | 184 | 8,068 | +| Patch 6 | HTML bold/underline confirmed | 343 | 8,411 | +| **Total** | | **8,411** | **11.7% of corpus** | + +--- + +## 6. Data Integrity Rules + +1. **`paragraphs-clean.jsonl` is FROZEN.** Never modify. It is the original extraction output and the source of truth for reproducibility. + +2. **All fixes go through `.patched.jsonl` files.** The patched file has the same schema and IDs as the original. Text may differ. TextHash is updated. + +3. **Annotations link by paragraph `id` (UUID).** This linkage is stable across patches — IDs never change. + +4. **Never re-run extraction from HTML.** Cascade effects from merge logic changes cause thousands of ripple-effect text changes (documented in `docs/SEC-HTML-CLEANING.md`). Surgical JSONL patching is the only safe approach. + +5. **Every patch is documented** with scope, method, validation, and annotation impact. + +6. **Quality metadata is separate from text data.** Per-paragraph quality scores live in a separate file, not embedded in the paragraph data. This keeps the data schema stable. + +--- + +## 7. Quality Tier System + +Each paragraph gets a quality tier based on detected issues: + +| Tier | Criteria | Count | % | Training Action | +|------|----------|-------|---|-----------------| +| **clean** | No detected issues | 58,165 | 80.7% | Full weight (1.0) | +| **headed** | Had inlined section heading (now stripped) | 7,402 | 10.3% | Full weight (1.0) — heading removed | +| **degraded** | Embedded bullets (1,941), invisible merges (222), fragments, truncations, no-cyber | 4,331 | 6.0% | Downweight (0.5) — content preserved but structure degraded | +| **minor** | Had orphan word (now fixed) | 2,147 | 3.0% | Full weight (1.0) — word restored | + +Note: Tiers reflect the most severe issue. A paragraph can have multiple issues. All "headed" and "minor" paragraphs have been patched — the tier records what WAS wrong, not what IS wrong. + +### Sample Weighting Strategy + +During fine-tuning, each training sample is weighted by quality tier to reduce the influence of structurally degraded paragraphs without discarding them entirely: + +- **clean + headed + minor (1.0 weight):** Content is correct and text is clean (after patching). These form the reliable training signal. +- **degraded (0.5 weight):** Content is present but structural issues (concatenated list items, fragments, truncations) may cause the text to misrepresent paragraph-level semantics. The labels are likely correct (models can infer meaning despite structural noise), but the text doesn't match what the model will see at inference time on clean filings. Downweighting reduces overfitting to degraded patterns without losing the content signal. + +Sample weighting is applied via the HuggingFace Trainer's `sample_weight` column or a custom loss function that multiplies cross-entropy by the tier weight. + +### Additional Findings (from anomaly detection) + +| Finding | Count | Concern | +|---------|-------|---------| +| Embedded bullet points mid-text | 1,941 (flagged degraded) | MEDIUM — semicolon-separated list items without bullet markers | +| Invisible merges (no separators) | 222 (flagged degraded) | MEDIUM — list items concatenated with no trace of structure (e.g., Bancorp 34) | +| No cybersecurity keywords at all | 528 (348 annotated) | LOW — investigated, keyword filter was too narrow, labels correct | +| Cross-references to other SEC items | 5,750 | LOW — mostly legitimate "see Item 1A" refs | +| Dollar amounts in text | 46 | LOW — mostly legitimate incident costs | +| Paragraphs >400 words | 149 | LOW — possible failed splits | +| Repeated sentences within paragraph | 9 | LOW — copy-paste artifacts | + +--- + +## 8. Annotation Impact (Quantified) + +Of 49,795 annotated paragraphs: + +### Annotated set by generator + +| Generator | Annotated Paras | % of Annotated Set | +|-----------|----------------|-------------------| +| Inline XBRL | ~10,500 | 21.1% | +| Workiva | ~15,300 | 30.7% | +| Donnelley | ~9,000 | 18.1% | +| Toppan Merrill | ~5,900 | 11.8% | +| EFiling/XDX | 3,562 | 7.2% | +| ThunderDome | ~2,500 | 5.0% | +| CompSci Transform | 2,288 | 4.6% | +| Others | ~700 | 1.4% | + +### Orphan words in annotated set + +**2,178 annotated paragraphs (4.37%)** start with lowercase (non-list) — orphan word candidates. + +| Generator | Orphan Paras | % of Generator's Annotated | % of All Orphans | +|-----------|-------------|---------------------------|-----------------| +| EFiling/XDX | 1,389 | 39.0% | 63.8% | +| CompSci Transform | 401 | 17.5% | 18.4% | +| All others | 388 | <5% each | 17.8% | + +EFiling/XDX alone accounts for 63.8% of all orphan-word paragraphs in the annotated set. + +### Label bias in orphan-word paragraphs + +- **Strategy Integration** is over-represented at 1.55x base rate (16.1% of orphan paras vs 10.4% overall) +- **Board Governance** and **Management Role** are under-represented (0.60x and 0.49x) — likely because governance headings/lead-in sentences get split off, leaving the orphan fragment lacking governance context + +This suggests orphan words may cause subtle category misclassification, not just missing text. + +### Inlined headers in annotated set + +**4,513 annotated paragraphs (9.06%)** have section headings merged into text. Relatively uniform across generators (~9-10%), but notably lower for EFiling/XDX (5.3%) and CompSci Transform (5.6%) — these generators split at headers rather than merging them. + +### Combined impact + +**6,691 annotated paragraphs (13.44%)** have either orphan-word OR inlined-header issues. + +Per generator: +- EFiling/XDX: 1,577 of 3,562 (44.3%) affected +- CompSci Transform: ~600 of 2,288 (~26%) affected +- All others: <15% affected + +--- + +## 9. Summary of Changes to Annotated Data + +| Change | Annotated Paragraphs Affected | Semantic Impact | +|--------|------------------------------|----------------| +| Orphan word restored | 1,400 | Label bias detected (Strategy 1.55x, Management 0.49x) | +| Heading stripped (all passes) | ~5,643 | Removes shortcut learning signal | +| No-cyber flagged as degraded | 348 | May want to exclude from training | +| **Total modified** | **~7,100 of 49,795 (14.3%)** | | + +## 10. Remaining Questions / Next Steps + +- **Re-run Stage 1 on orphan-word paragraphs** (~$15-20 for 1,400 paragraphs). Label bias suggests some misclassification. May resolve conflicts and save Stage 2 judge costs. +- **Heading-stripped paragraphs:** Existing labels are likely still valid — annotators classified the body text, not the heading. But could re-run if budget allows. +- **Exclude 348 no-cyber-keyword annotated paragraphs?** If labeled "None/Other" they're fine; if other categories, they're noise from section bleed. +- **855 ambiguous HTML heading cases** — bold/underline text at paragraph start but also a valid sentence subject. Would need manual review to resolve. +- **Run DAPT** — filter <10K char docs from DAPT corpus, then start training. + +--- + +## 11. Artifacts Produced + +### Data Files + +``` +data/paragraphs/ +├── paragraphs-clean.jsonl ← FROZEN original (72,045 paragraphs) +├── paragraphs-clean.patched.jsonl ← All 6 patches applied (orphan + heading) +├── training.patched.jsonl ← Training subset, all patches applied (49,795) +├── patches/ +│ ├── orphan-word-patches.jsonl ← 2,233 orphan word recovery records +│ ├── heading-strip-patches.jsonl ← 7,514 heading strip records (space sep) +│ ├── colon-heading-patches.jsonl ← 370 colon-heading strip records +│ ├── heading-strip-v2-patches.jsonl ← 184 period/dash/caps/semicolon headings +│ └── heading-strip-html-patches.jsonl← 343 HTML bold/underline confirmed headings +└── quality/ + ├── generator-tags.jsonl ← 14,759 accession → generator mappings + ├── quality-scores.jsonl ← 72,045 per-paragraph quality metadata + ├── filing-headings.jsonl ← Cached styled headings from HTML (3,459 filings) + └── ambiguous-filings.txt ← Filing list used for HTML heading extraction +``` + +### Scripts + +| Script | Purpose | +|--------|---------| +| `ts/scripts/patch-orphan-words.ts` | Detect and recover orphan words from HTML source | +| `ts/scripts/tag-generators.ts` | Identify filing generator from HTML signatures | +| `ts/scripts/extract-html-headings.ts` | Extract bold/underline headings from HTML (32-worker parallel, 1.7s) | +| `ts/scripts/dapt-corpus-prep.ts` | DAPT corpus preparation (HTML → clean JSONL, 32-worker parallel) | +| `scripts/detect_generators.py` | Python generator detection (initial analysis) | +| `scripts/generator_quality_analysis.py` | Generator × quality metrics cross-reference | +| `scripts/analyze_generator_quality.py` | Annotation impact analysis by generator | +| `scripts/find_heading_candidates.py` | Creative heading pattern hunt (7 approaches) | +| `scripts/data_quality_audit.py` | Statistical anomaly detection (content, structure, outliers) | +| `scripts/audit_corpus.py` | Text corruption checks | +| `scripts/audit_paragraphs.py` | Boundary audit (per-filing stats, coherence, duplicates) | + +### Documentation + +| Doc | Content | +|-----|---------| +| `docs/DATA-QUALITY-AUDIT.md` | This document — full audit findings, patch history, quality tiers | +| `docs/EDGAR-FILING-GENERATORS.md` | Generator reference — 14 vendors, signatures, market share, quality issues | +| `docs/SEC-HTML-CLEANING.md` | HTML cleaning lessons and pitfalls | diff --git a/docs/EDGAR-FILING-GENERATORS.md b/docs/EDGAR-FILING-GENERATORS.md new file mode 100644 index 0000000..5c2247e --- /dev/null +++ b/docs/EDGAR-FILING-GENERATORS.md @@ -0,0 +1,490 @@ +# SEC EDGAR Filing Generator Reference + +Reference for identifying which software generated a given SEC 10-K HTML filing. +Built from direct inspection of EDGAR filings and market research (March 2026). + +--- + +## 1. Major Vendors and HTML Signatures + +### Workiva (Wdesk) -- Market Leader for 10-K/10-Q + +**Filing agent CIK:** `0001628280` + +**HTML comment signature (lines 1-3):** +```html + + + + +``` + +**Detection heuristics:** +- HTML comment: `XBRL Document Created with the Workiva Platform` +- HTML comment: `Copyright \d{4} Workiva` +- Third comment line contains `r:`, `g:`, `d:` UUIDs (document/generation tracking) +- `xml:lang="en-US"` attribute on `` tag +- Body uses inline styles exclusively (no CSS classes on content elements) +- Heavy use of `` with inline styles containing `background-color`, `font-family`, `font-size`, `font-weight`, `line-height` in every span +- Div IDs follow pattern: `i{hex32}_{number}` (e.g., `id="i56b78781f7c84a038f6ae0f6244f7dd8_1"`) +- Tables use `display:inline-table` and `vertical-align:text-bottom` +- iXBRL fact IDs follow pattern: `F_{uuid}` (e.g., `id="F_d8dc1eb1-109d-445d-a55a-3dde1a81ca63"`) +- No `` tag +- No CSS classes on body content (purely inline styles) + +**Structural patterns:** +- Span-heavy: nearly every text fragment wrapped in `` +- Font specified as `font-family:'Times New Roman',sans-serif` (note: sans-serif fallback, unusual) +- Line-height specified on every span (e.g., `line-height:120%`) +- Background color explicitly set: `background-color:#ffffff` + +**Known quality issues:** +- Extremely verbose HTML; simple paragraphs become deeply nested span trees +- Text extraction is clean because span boundaries align with word boundaries +- Large file sizes due to inline style repetition + +--- + +### DFIN / Donnelley Financial Solutions (ActiveDisclosure) + +DFIN operates under **two distinct CIKs** with **two different HTML output formats**. + +#### DFIN "New" ActiveDisclosure (primary) + +**Filing agent CIK:** `0000950170` (also `0000950130`) + +**HTML comment signature:** +```html + + + + +``` + +**Detection heuristics:** +- HTML comment: `DFIN New ActiveDisclosure` +- HTML comment: `http://www.dfinsolutions.com/` +- HTML comment: `Copyright (c) \d{4} Donnelley Financial Solutions` +- HTML comment: `Creation Date :` with ISO timestamp +- Body style: `padding:8px;margin:auto!important;` +- Inline styles use `font-kerning:none;min-width:fit-content;` on most spans +- Extensive use of `white-space:pre-wrap` on spans +- CSS class `item-list-element-wrapper` and `page-border-spacing` present +- iXBRL fact IDs follow pattern: `F_{uuid}` + +**Structural patterns:** +- Every text span carries `min-width:fit-content` (distinctive) +- Uses ` ` for spacing extensively +- Uses `

` tags with inline margins for all paragraphs +- Tables use explicit `padding-top:0in;vertical-align:top;padding-bottom:0in` cell styles + +#### DFIN Legacy (RR Donnelley heritage) + +**Filing agent CIK:** `0001193125` + +**HTML signature:** +```html + + + +10-K + + + +

Table of Contents
+``` + +**Detection heuristics:** +- No identifying HTML comments (no generator/copyright comment) +- Accession number prefix `0001193125` is definitive +- `` +- Immediately starts with `
` Table of Contents link +- Uses deprecated namespace aliases: `xmlns:xl`, `xmlns:xbrll`, `xmlns:deprecated` +- iXBRL fact IDs follow pattern: `Fact_{large_number}` (e.g., `id="Fact_129727210"`) +- Uses `` tags (HTML 3.2 style) in some documents +- Uppercase HTML tags in older filings (`

`, ``, `

`) + +**Structural patterns:** +- Cleaner HTML than ActiveDisclosure New +- Uses semantic `
` for table of contents +- Inline styles are simpler and more standard +- File description filenames follow pattern: `d{number}d10k.htm` + +--- + +### Toppan Merrill (Bridge) + +**Filing agent CIKs:** `0001104659` (primary), `0001558370` (secondary) + +**HTML comment signature:** +```html + + + + + + +``` + +**Detection heuristics:** +- HTML comment: `iXBRL document created with: Toppan Merrill Bridge iXBRL` +- HTML comment: `iXBRL Library version:` +- HTML comment: `iXBRL Service Job ID:` +- Includes version number in comment (e.g., `10.9.0.3`) +- `` tag contains company name + period end date (e.g., `Sunstone Hotel Investors, Inc._December 31, 2024`) +- Uses `xmlns:xs` alongside `xmlns:xsi` (both XML Schema namespaces) +- Body starts with `<div style="margin-top:30pt;"></div>` (distinctive) +- iXBRL hidden div uses `display:none;` (no additional styles on the div) + +**Structural patterns:** +- Context IDs use descriptive names with GUIDs: `As_Of_12_31_2024_{base64-like}`, `From_01_01_2024_to_12_31_2024_{guid}` +- Hidden fact IDs follow pattern: `Hidden_{base64-like}` +- Unit ref IDs follow pattern: `Unit_Standard_USD_{base64-like}` +- No CSS classes used on content elements +- Relatively clean HTML structure + +--- + +### RDG Filings (ThunderDome Portal) + +**Filing agent CIK:** `0001437749` + +**HTML signature:** +```html +<?xml version='1.0' encoding='ASCII'?> +<html xmlns:thunderdome="http://www.RDGFilings.com" ...> + <head> + <title>avpt20241231_10k.htm + + + + +``` + +**Detection heuristics:** +- XML namespace: `xmlns:thunderdome="http://www.RDGFilings.com"` +- HTML comment: `Generated by ThunderDome Portal` +- `` contains the filing filename +- Body style includes `cursor: auto; padding: 0in 0.1in` +- iXBRL fact IDs prefixed with `thunderdome-` (e.g., `id="thunderdome-EntityCentralIndexKey"`) +- Context ref IDs use simple date ranges: `d_2024-01-01_2024-12-31` +- Other fact IDs follow `ixv-{number}` or `c{number}` pattern + +**Market presence:** ~14,000 filings/year, rank #9 among filing agents. About 5% of annual filings. + +--- + +### Broadridge Financial Solutions (PROfile) + +**Filing agent CIKs:** `0001140361` (primary), `0001133228` (secondary) + +**HTML comment signature:** +```html +<!-- Licensed to: Broadridge + Document created using Broadridge PROfile 25.1.1.5279 + Copyright 1995 - 2025 Broadridge --> +``` + +**Detection heuristics:** +- HTML comment: `Licensed to: Broadridge` +- HTML comment: `Document created using Broadridge PROfile` with version number +- HTML comment: `Copyright 1995 - \d{4} Broadridge` +- CSS classes with `BRPF` prefix: `BRPFPageBreak`, `BRPFPageBreakArea`, `BRPFPageFooter`, `BRPFPageHeader`, `BRPFPageNumberArea` +- CSS class: `DSPFListTable` +- CSS class: `cfttable` +- CSS class: `Apple-interchange-newline` (suggests Mac/WebKit origin) +- Context ref IDs use XBRL-standard descriptive format: `c20240101to20241231_AxisName_MemberName` + +**Note:** Broadridge acquired CompSci Resources LLC in July 2024 and is integrating CompSci's Transform platform. Filings may transition to Broadridge branding over time. + +--- + +### CompSci / Novaworks (Transform and GoFiler) + +CompSci Resources produces two tools that leave distinct signatures. + +#### CompSci Transform (now Broadridge) + +**Filed via:** EdgarAgents LLC (`0001213900`) or other agents + +**HTML comment signature:** +```html +<?xml version='1.0' encoding='ASCII'?> +<!-- Generated by CompSci Transform (tm) - http://www.compsciresources.com --> +<!-- Created: Mon Mar 17 19:46:10 UTC 2025 --> +``` + +**Detection heuristics:** +- HTML comment: `Generated by CompSci Transform` +- HTML comment: `http://www.compsciresources.com` +- XML namespace: `xmlns:compsci="http://compsciresources.com"` +- Body wrapped in: `<div style="font: 10pt Times New Roman, Times, Serif">` +- Uses `<!-- Field: Rule-Page -->` and `<!-- Field: /Rule-Page -->` HTML comments as structural markers +- Empty `<div>` tags used as spacers between paragraphs +- iXBRL context refs use simple sequential IDs: `c0`, `c1`, `c2`, ... +- iXBRL fact IDs follow `ixv-{number}` pattern +- Uses shorthand CSS: `font: 10pt Times New Roman, Times, Serif` (combined property) +- Margin shorthand: `margin: 0pt 0` + +**Known quality issues:** +- Words can be broken across `<span>` tags mid-word +- Heavy use of ` ` for spacing +- Empty divs between every paragraph create parsing noise +- `<!-- Field: ... -->` comments interspersed throughout document body + +#### Novaworks GoFiler (XDX format) + +**Filed via:** SECUREX Filings (`0001214659`) or self-filed + +**HTML signature:** +```html +<head> + <title> + + + + +``` + +**Detection heuristics:** +- HTML comments with pattern: `` +- XDX comments appear between `` and `` (unusual placement) +- Body style: `font: 10pt Times New Roman, Times, Serif` (same shorthand as CompSci) +- Empty `` tag +- iXBRL fact IDs use `xdx2ixbrl{number}` pattern (e.g., `id="xdx2ixbrl0102"`) +- Standard fact IDs use `Fact{number:06d}` pattern (e.g., `id="Fact000003"`) +- Context refs use `From{date}to{date}` or `AsOf{date}` format (no separators within date) + +**XDX explained:** XDX (XBRL Data Exchange) is GoFiler's proprietary format that uses HTML tag ID attributes ("engrams") to embed XBRL metadata. The `xdx_` comments carry taxonomy, entity, period, and unit definitions that GoFiler uses to generate the final iXBRL. + +--- + +### Discount EDGAR / NTDAS (XBRLMaster / EDGARMaster) + +**Filing agent CIK:** `0001477932` + +**HTML signature:** +```html + + crona_10k.htm + + + + +``` + +**Detection heuristics:** +- HTML comment: `Document Created by XBRLMaster` +- Body style: `text-align:justify;font:10pt times new roman` +- Hidden iXBRL div has `id="XBRLDIV"` +- Additional body styles include `margin-left:7%;margin-right:7%` +- Uses lowercase `times new roman` (no capitalization) +- iXBRL fact IDs use `ixv-{number}` pattern + +--- + +### EdgarAgents LLC + +**Filing agent CIK:** `0001213900` + +EdgarAgents is a filing agent service, not a document creation tool. The HTML they submit is typically generated by CompSci Transform, GoFiler, or other tools. Check the HTML comments to identify the actual generator. + +--- + +### DFIN Legacy (pre-iXBRL / SGML-era) + +**Filing agent CIK:** `0001193125` + +Older filings (pre-2019) from this CIK may appear in `` SGML wrapper format: +```html + +10-K +1 +d913213d10k.htm +10-K + + +10-K + + +
+``` + +**Detection heuristics:** +- Uppercase HTML tags: ``, ``, ``, `

`, `` +- `BGCOLOR="WHITE"` attribute (deprecated HTML) +- `

` tag with capital C +- `
` tags for styling +- Filename pattern: `d{number}d10k.htm` + +--- + +## 2. Filing Agent Market Share + +Based on [secfilingdata.com](https://www.secfilingdata.com/top-filing-agents/) total filings across all form types: + +| Rank | Filing Agent | CIK | 2025 Filings | Total (All Time) | +|------|-------------|-----|-------------|-----------------| +| 1 | Donnelley Financial (DFIN) | 0001193125 | 65,180 | 1,872,890 | +| 2 | EdgarAgents LLC | 0001213900 | 48,021 | 367,211 | +| 3 | Quality Edgar (QES) | 0001839882 | 38,017 | 151,031 | +| 4 | Toppan Merrill | 0001104659 | 48,260 | 988,715 | +| 5 | WallStreetDocs Ltd | 0001918704 | 22,387 | 56,431 | +| 6 | Workiva (Wdesk) | 0001628280 | 21,606 | 141,795 | +| 7 | M2 Compliance LLC | 0001493152 | 13,810 | 164,603 | +| 8 | Davis Polk & Wardwell LLP | 0000950103 | 16,231 | 326,359 | +| 9 | RDG Filings (ThunderDome) | 0001437749 | 14,209 | 187,270 | +| 10 | Morgan Stanley | 0001950047 | 12,822 | 56,468 | +| 11 | Broadridge | 0001140361 | -- | 597,664 | +| 14 | SECUREX Filings | 0001214659 | -- | 115,218 | +| 19 | Blueprint | 0001654954 | -- | 62,250 | +| 20 | FilePoint | 0001398344 | -- | 76,218 | +| 38 | Discount EDGAR | 0001477932 | -- | 37,422 | + +**For 10-K/10-Q specifically (estimated from biotech IPO data and market research):** +- DFIN: ~40-50% of annual/quarterly filings +- Workiva: ~25-35% (has been gaining share from DFIN since ~2010) +- Toppan Merrill: ~10-15% +- RDG Filings: ~5% +- Broadridge/CompSci: ~5% +- Others (law firms, self-filed, smaller agents): ~5-10% + +--- + +## 3. XBRL/iXBRL Tool Signatures + +The iXBRL tagging tool is often the same as the filing generator, but not always. Key distinguishing patterns in the iXBRL layer: + +| Tool | Context Ref Pattern | Fact ID Pattern | Unit Ref Pattern | +|------|-------------------|----------------|-----------------| +| Workiva | `C_{uuid}` | `F_{uuid}` | `U_{uuid}` | +| DFIN New | `C_{uuid}` | `F_{uuid}` | Standard names | +| DFIN Legacy | `Fact_{large_int}` | `Fact_{large_int}` | Standard names | +| Toppan Merrill | `As_Of_{date}_{guid}` / `From_{date}_to_{date}_{guid}` | `Hidden_{guid}` | `Unit_Standard_USD_{guid}` | +| ThunderDome | `d_{date_range}` / `i_{date}` | `thunderdome-{name}` or `ixv-{n}` or `c{n}` | Standard names | +| CompSci Transform | `c0`, `c1`, `c2` ... | `ixv-{number}` | Standard names | +| GoFiler (XDX) | `From{date}to{date}` / `AsOf{date}` | `xdx2ixbrl{number}` | Standard names | +| XBRLMaster | `From{date}to{date}` | `ixv-{number}` | Standard names | +| Broadridge PROfile | `c{date}to{date}_{axis}_{member}` | Descriptive | Standard names | + +--- + +## 4. Detection Priority (Recommended Heuristic Order) + +For maximum reliability, check signatures in this order: + +1. **HTML comments** (first 10 lines) -- most generators embed identifying comments + - `Workiva Platform` --> Workiva + - `DFIN New ActiveDisclosure` --> DFIN New + - `Toppan Merrill Bridge` --> Toppan Merrill + - `ThunderDome Portal` --> RDG Filings + - `CompSci Transform` --> CompSci/Broadridge + - `Broadridge PROfile` --> Broadridge + - `XBRLMaster` --> Discount EDGAR / NTDAS +2. **XML namespaces** on `` tag + - `xmlns:thunderdome="http://www.RDGFilings.com"` --> RDG + - `xmlns:compsci="http://compsciresources.com"` --> CompSci +3. **XDX comments** between head and body --> GoFiler/Novaworks +4. **Accession number prefix** (first 10 digits) --> identifies filing agent CIK +5. **Body style patterns** as fallback +6. **iXBRL fact ID patterns** as secondary confirmation + +--- + +## 5. Known Quality Issues by Generator + +### CompSci Transform +- **Words broken across spans**: Text is split at arbitrary character boundaries, not word boundaries. A single word like "cybersecurity" may be split across 2-3 `` tags. This breaks naive text extraction that operates per-element. +- **Empty div spacers**: `
\n\n
` between every paragraph adds noise. +- **Field comments in body**: `` markers interspersed with content. + +### Workiva +- **Extreme span nesting**: Every text run gets its own `` with full inline style. A simple bold sentence may have 5+ spans. +- **Large file sizes**: Inline style repetition causes 10-K files to be 2-5x larger than equivalent DFIN filings. +- **Clean word boundaries**: Despite heavy span usage, spans align with word/phrase boundaries, making text extraction reliable. + +### DFIN New ActiveDisclosure +- **`min-width:fit-content` everywhere**: Unusual CSS property on every span; may cause rendering inconsistencies in older browsers. +- **`font-kerning:none`**: Explicit kerning disable on all text spans. +- **Generally clean**: Text extraction works well; word boundaries respected. + +### DFIN Legacy +- **Uppercase HTML tags**: Older filings use `

`, ``, `` -- need case-insensitive parsing. +- **Mixed HTML versions**: Some documents mix HTML 3.2 and 4.0 constructs. +- **SGML wrappers**: Some filings wrapped in `` SGML envelope. + +### GoFiler / Novaworks +- **XDX comment noise**: Multiple `` comments that must be stripped. +- **Generally clean HTML**: Body content is straightforward. + +### Toppan Merrill Bridge +- **Clean output**: Among the cleanest generators. Minimal inline style bloat. +- **GUID-heavy IDs**: Context and unit refs use base64-like GUIDs that are less human-readable. + +--- + +## 6. Self-Filed / In-House Filings + +Some large filers submit directly using their own CIK as the accession number prefix. These filings have **no generator comment** and variable HTML quality. + +**Detection:** Accession number prefix matches the filer's own CIK (e.g., Halliburton CIK `0000045012` files with accession `0000045012-25-000010`). + +**However:** Even self-filed companies typically use a commercial tool. Halliburton's self-filed 10-K contains the Workiva comment signature, indicating they use Workiva but submit directly rather than through a filing agent. + +**Truly in-house HTML** (no commercial tool) is rare among 10-K filers. When it occurs: +- No identifying comments +- No consistent structural patterns +- May use Word-to-HTML conversion (look for `mso-` CSS prefixes from Microsoft Office) +- May have minimal or no iXBRL tagging + +--- + +## 7. Law Firm Filings + +Several large law firms act as filing agents: +- Davis Polk & Wardwell (`0000950103`) -- 326K total filings +- Paul Weiss (`0000950142`) -- 56K total filings +- Foley & Lardner (`0000897069`) -- 30K total filings +- Sidley Austin (`0000905148`) -- 39K total filings +- Seward & Kissel (`0000919574`) -- 107K total filings + +Law firms typically file transactional documents (S-1, proxy, 8-K) rather than periodic 10-K filings. The HTML in law-firm-filed documents often comes from Word conversion and lacks commercial generator signatures. + +--- + +## 8. Summary: Quick Detection Regex Table + +``` +Pattern | Generator +-----------------------------------------------------|------------------ +/Workiva Platform/ | Workiva +/DFIN New ActiveDisclosure/ | DFIN (New) +/Donnelley Financial Solutions/ | DFIN (New) +/Toppan Merrill Bridge/ | Toppan Merrill +/ThunderDome Portal/ | RDG Filings +/CompSci Transform/ | CompSci/Broadridge +/Broadridge PROfile/ | Broadridge +/XBRLMaster/ | Discount EDGAR +/xmlns:thunderdome="http:\/\/www\.RDGFilings\.com"/ | RDG Filings +/xmlns:compsci="http:\/\/compsciresources\.com"/ | CompSci +/Field: Set; Name: xdx/ | GoFiler/Novaworks +/dfinsolutions\.com/ | DFIN +/min-width:fit-content/ | DFIN (New) +/BRPFPage/ | Broadridge PROfile +/id="XBRLDIV"/ | XBRLMaster +``` + +--- + +## Sources + +- Direct inspection of SEC EDGAR filings (March 2026) +- [secfilingdata.com/top-filing-agents](https://www.secfilingdata.com/top-filing-agents/) -- filing agent rankings +- [newstreetir.com -- Top SEC Filing Agents for Biotech IPOs](https://newstreetir.com/2025/05/14/who-are-the-top-sec-filing-agents-for-biotech-ipos/) -- biotech IPO market share +- [houseblend.io -- SEC Filing Software Platforms](https://www.houseblend.io/articles/sec-filing-software-platforms-pricing-compliance) -- vendor comparison +- [novaworkssoftware.com/inlinexbrl](https://www.novaworkssoftware.com/inlinexbrl.php) -- XDX format documentation +- [rdgfilings.com/thunderdome](https://rdgfilings.com/thunderdome-client-portal/) -- ThunderDome Portal +- [toppanmerrill.com/bridge](https://www.toppanmerrill.com/bridge/) -- Toppan Merrill Bridge +- [edgarmaster.com](https://edgarmaster.com/) -- EDGARMaster / XBRLMaster by NTDAS +- [pernasresearch.com -- DFIN analysis](https://pernasresearch.com/research-vault/donnelley-financial-initiation/) -- market share dynamics diff --git a/docs/LABELING-CODEBOOK.md b/docs/LABELING-CODEBOOK.md index 622404b..ef063cc 100644 --- a/docs/LABELING-CODEBOOK.md +++ b/docs/LABELING-CODEBOOK.md @@ -271,6 +271,16 @@ No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1 Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.** +### Case 9: Generic regulatory compliance language +> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."* + +This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.** + +The key distinctions: +- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted) +- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6) +- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity + --- ## Dimension 2: Specificity Level diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index 0690e0b..9dcc3c5 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -65,6 +65,7 @@ After extracting clean section text, splitting into paragraphs had its own chall - **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. - **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge. - **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable. +- **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `` per bullet item, and use `·` (middle dot in Symbol font) instead of the standard `•` bullet character. Since `stripHtml()` doesn't decode `·` as a bullet marker, the bullet-aware merge logic never fires. Each bullet item starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments and merges them with the preceding block. This cascades: a Bancorp 34 filing had three separate elements — two bullet items about risk management processes and a standalone paragraph disclosing a $25,000 cybersecurity incident — concatenated into a single 114-word run-on sentence. The HTML structure was completely unambiguous (separate `` and `

` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded. ### 8-K Extraction @@ -549,6 +550,174 @@ This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating --- +## Phase 10: Data Quality Audit and Corpus Remediation + +### The Discovery + +While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data: + +1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs. + +2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. + +### The Generator Investigation + +Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup. + +The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate. + +By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform. + +Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`. + +### Six Surgical Patches + +All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation. + +| Patch | Method | Paragraphs | Annotated | +|-------|--------|-----------|-----------| +| 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 | +| 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 | +| 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 | +| 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 | +| 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 | +| **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** | + +The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse. + +### Orphan Word Re-Annotation + +The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs: +- Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline) +- Management Role 0.49x under-represented +- Board Governance 0.60x under-represented + +Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong. + +**Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures. + +**Results:** +- **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real +- **37 paragraphs (2.4%)** changed consensus specificity +- **152 total (9.9%)** changed on at least one dimension +- mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%) +- 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings +- Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34) + +The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs. + +### No-Cyber-Keyword Paragraphs: A False Alarm + +The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other. + +**Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept. + +### Heading-Stripped Paragraphs: Labels Still Valid + +For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label. + +### Embedded Bullet Lists: The Cascade Failure + +A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on: + +> establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023... + +The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `` bullet items about risk management processes, and a standalone `

` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them. + +**Root cause: a three-part cascade failure in the extraction pipeline.** + +1. **Bullet character not recognized.** The HTML used `·` (middle dot in Symbol font) instead of `•` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires. +2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block. +3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph. + +The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text. + +We identified two classes of this failure: + +1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers). +2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text. + +All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal. + +### Sample Weighting for Fine-Tuning + +The quality tier system maps directly to training sample weights: + +| Tier | Weight | Rationale | +|------|--------|-----------| +| clean | 1.0 | No issues | +| headed | 1.0 | Heading removed, body text intact | +| minor | 1.0 | Orphan word restored | +| degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs | + +This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data. + +### Data Integrity Framework + +The audit produced a formal data integrity framework: + +1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor +2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash +3. Annotations link by UUID — stable across patches +4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes +5. Every patch is documented with scope, method, validation, and annotation impact +6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file + +### Quality Tier System + +Each paragraph gets a quality tier based on detected issues: + +| Tier | Criteria | Count | % | +|------|----------|-------|---| +| clean | No detected issues | 58,165 | 80.7% | +| headed | Had inlined heading (now stripped) | 7,402 | 10.3% | +| degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% | +| minor | Had orphan word (now fixed) | 2,147 | 3.0% | + +All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning. + +--- + +## Phase 11: DAPT Corpus Preparation + +### Corpus Cleaning + +The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required: + +**Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers. + +**Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms. + +**Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms. + +The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed. + +### Corpus Statistics (Final) + +| Metric | Value | +|--------|-------| +| Documents | 14,756 (14,568 after <10K filter) | +| Total tokens | ~1.056 billion (ModernBERT tokenizer) | +| Median document | ~73K tokens (347K chars) | +| Training sequences (seq_len=8192) | ~136K | +| Steps per epoch (eff. batch=32) | ~4,257 | +| Estimated training time | ~4-8 hours per epoch (RTX 3090) | + +### Sequence Length Decision + +ModernBERT was pre-trained at 8192 tokens. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates. The tradeoff — batch_size drops from 4 to 1, compensated by gradient_accumulation=32 — results in comparable training time because 4x fewer steps offset slower per-step throughput. + +### Epoch Decision + +We train for 1 epoch (single pass), following the empirical consensus: + +- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL):** Used a single pass over 2-8B token domain corpora. Sufficient for consistent downstream gains across all four domains tested. +- **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold. + +Full procedure documented in `docs/DAPT-PROCEDURE.md`. + +--- + ## Cost and Time Ledger ### Tooling @@ -565,7 +734,8 @@ All code was written collaboratively with **Claude Code** (Anthropic's agentic c | Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | | Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | | Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | -| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 | +| Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. | +| **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 | Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. @@ -578,9 +748,10 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok | Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | | Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | | Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | +| Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs | | Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | | Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure | -| **Total to date** | **~31h** | | +| **Total to date** | **~35h** | | ### Remaining Work (estimated) @@ -589,7 +760,7 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok | Human labeling (1,200 paragraphs, 6 annotators) | ~6-8h | $0 (team labor) | | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 | | Training data assembly | ~2h | $0 | -| DAPT pre-training | ~48-72h GPU | $0 (own 3090) | +| DAPT pre-training (1 epoch) | ~4-8h GPU | $0 (own 3090) | | TAPT pre-training | ~2-3h GPU | $0 | | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 | @@ -702,6 +873,11 @@ Three models from three providers — minimizes correlated errors. | Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning | | Stage 1 prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v2.5) + buildJudgePrompt() | | Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency | +| Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 | +| Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis | +| No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs | +| Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers | +| Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles | | Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data | | Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison | | Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation | @@ -732,3 +908,6 @@ Three models from three providers — minimizes correlated errors. - Systematic model biases are quantifiable and predictable. Use them as signal, not noise. - Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change. - Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling. +- **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying. +- **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average. +- **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic. diff --git a/docs/SEC-HTML-CLEANING.md b/docs/SEC-HTML-CLEANING.md new file mode 100644 index 0000000..cb0a471 --- /dev/null +++ b/docs/SEC-HTML-CLEANING.md @@ -0,0 +1,184 @@ +# SEC Filing HTML Cleaning — Lessons & Pitfalls + +Everything we've learned about cleaning SEC EDGAR HTML for text extraction, specifically for Item 1C (Cybersecurity) from 10-K filings. These lessons likely apply to any SEC filing text extraction pipeline. + +## The HTML landscape + +SEC filings come from thousands of different filers using dozens of different tools (Workiva/Toppan Merrill, Donnelley Financial, various legal/accounting software). There is no standard HTML structure. The same semantic content — a paragraph of body text — can appear as: + +- `

Text here

` +- `
Text here
` +- Nested XBRL inline tags: `

Text

` +- Table-based layouts: `
Text
` +- Deeply nested `
` structures with inline styles + +The only constant: it will be ugly. + +## Inline element newlines (the orphan word problem) + +**The bug:** Many filing generators produce HTML where the first word of a paragraph is on its own line within a `` tag: + +```html +

Our +sole executive officer and director is responsible for assessing and +managing cybersecurity risks...

+``` + +When this is stripped to plain text, `Our` ends up on its own line. If downstream processing splits on newlines and filters short lines (< 20 words), `Our` is silently dropped. The paragraph becomes `sole executive officer and director is responsible...` — missing its subject. + +**Prevalence:** ~1.4% of filings (156/11,299) have this pattern in their Item 1C section. It produces ~2,500 affected paragraphs across the corpus. + +**Common orphaned words:** `We` (73), `Our` (37), `The` (5), `To` (17), `As` (15), `In` (13), `Cybersecurity` (10), `Management` (6), `Following` (6). Basically any sentence-starting word. + +**Why it happens:** The filing generator wraps text at a fixed column width in the HTML source. If the `` opening tag + attributes eat most of a line, only the first word fits before the line break. The browser renders this identically (HTML treats source newlines as whitespace), but text extraction that preserves newlines from inline elements breaks. + +**Detection (for patching existing data):** Match the pattern `Word\nlowercase continuation...` directly in the raw HTML. Three validation layers are needed: + +1. **Same-tag check:** The orphan word and continuation must be within the same inline element (``, ``, ``, etc.). This distinguishes orphan first-words from section headings above paragraphs. Critically, exclude `` XBRL tags — these are structural, not inline, and their first text is often a section title. + +2. **Bold/underline filter:** Skip matches inside ``, ``, or `text-decoration: underline`. These are section headings that happen to have a line break mid-heading (e.g., `Risk\nManagement and Strategy`). Without this filter, headings get inlined into body text. + +3. **Stripped-text validation:** After finding an orphan word in the raw HTML, confirm it exists as a standalone word in the `stripHtml()` output. This catches mid-word splits across adjacent spans (see below). + +**Case-sensitivity matters:** If using a regex with the `i` (case-insensitive) flag for tag name matching, the `[a-z]` check on the continuation text becomes meaningless — it will match uppercase too, letting headings through. Either drop the `i` flag (and match tags as `[Ss][Pp][Aa][Nn]` etc.) or validate continuation case separately. + +**Prevention (for future extractions):** In the paragraph segmenter, buffer single-word blocks that would otherwise be dropped (below minimum word count) and prepend them to the next block when it starts lowercase. This must happen at the segmentation stage, not in the extraction merge logic — changes to merge behavior cascade through downstream paragraph boundary decisions. + +## Mid-word splits across adjacent spans + +**The bug:** Some filing generators split a single word across multiple `` tags, sometimes with empty formatting spans between them: + +```html +B + +lackrock +maintains a comprehensive cybersecurity risk management program... +``` + +The HTML cleaner's adjacent-inline-boundary collapse correctly joins `B` + `lackrock` into `Blackrock` in the stripped text. But if a patching script operates on raw HTML (to find orphan patterns), it sees `lackrock\nmaintains...` and incorrectly treats `lackrock` as an orphan word, prepending it to produce `lackrock maintains...` instead of the correct `Blackrock maintains...`. + +**Detection:** After finding a candidate orphan word in raw HTML, verify it exists as a standalone word (surrounded by whitespace or at line boundaries) in the stripped text. If `stripHtml()` produces `Blackrock` (not `lackrock`), the candidate is a word fragment, not an orphan. + +**Root cause:** The filing generator uses separate spans for styling changes (font-size) that happen to fall at character boundaries within words. The empty `` is a zero-width formatting artifact. + +## Adjacent inline element boundaries + +**The bug:** Different formatting applied to adjacent text creates word-joining when tags are stripped: + +```html +wordThe next word +``` + +Naively stripping tags produces `wordThe next word`. The words at the span boundary merge. + +**Fix:** Before stripping tags, collapse adjacent inline element boundaries to spaces: +```js +.replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi, + (_m, _tag, ws) => ws.length > 0 ? " " : "") +``` + +This replaces `` (and similar) with a space, preventing word joins. The whitespace check (`ws.length > 0`) handles cases where whitespace already exists between tags. + +Same treatment needed for XBRL inline tags (``). + +## Source newlines vs block-element breaks + +**The issue:** HTML source files contain newlines in two semantically different roles: +1. **Block-element breaks:** `

`, `
`, `
` — these are paragraph boundaries +2. **Source line wrapping:** Newlines within inline elements from the filing generator's line-length limit — these are meaningless whitespace + +Both become `\n` in the stripped text. The extraction pipeline relies on newlines to separate paragraphs, so collapsing all newlines breaks paragraph detection. But preserving all newlines creates the orphan word problem. + +**The tradeoff:** We chose to preserve newlines (they're needed for paragraph boundary detection in the extraction pass). The orphan word problem is handled downstream in the segmenter. An alternative (sentinel-based) approach — using `\x00` for block breaks, collapsing source newlines to spaces, then restoring sentinels — was tested but caused too many changes to paragraph segmentation across the corpus (18,589 paragraphs changed text in regression testing). + +## XBRL inline tags (iXBRL / `ix:` namespace) + +**What they are:** Starting in 2024, SEC filings use Inline XBRL to tag structured data directly in HTML. The `cyd:` taxonomy covers cybersecurity disclosures. Tags like `` wrap entire sections. + +**Pitfalls:** + +- **Not inline formatting:** Despite being inline XML elements, `ix:` tags are structural — they wrap paragraphs, sections, even entire Items. Treating them like `` for orphan detection will match section headings. +- **XBRL metadata leaks into text:** CIK numbers (`0000123456`), namespace URIs (`xbrli:`, `fasb.org`), ticker-date identifiers (`ae-20231231`) can appear in the text stream. Filter lines where >50% of tokens look like XBRL metadata. +- **`continuedAt` chains:** Long sections are split across multiple `ix:continuation` blocks. These can interrupt the visual flow of text. + +## Running headers/footers and page artifacts + +SEC HTML often retains print-formatting artifacts: + +| Pattern | Example | Detection | +|---------|---------|-----------| +| Page numbers | `17`, `- 17 -`, `Page 17` | Regex: `/^[-–—\s]*[A-Za-z]?[-–—]?\s*\d+[-–—\s]*$/` | +| Running headers | `ACME CORP FORM 10-K` | Short line + company name + form type | +| Table of contents markers | `Table of Contents` | Exact match, strip trailing content | +| Back-to-top links | `(Back to Index)` | Regex: `/back\s+to\s+(index|top|toc)/i` | +| Part headings | `PART II` | Short line, roman numerals | + +These appear mid-text because they're print-layout remnants. Filter them in the extraction pass, before paragraph segmentation. + +## Subsidiary headers in combined filings + +Holding companies file combined 10-Ks covering multiple subsidiaries. Each subsidiary section repeats a header: + +``` +ENTERGY ARKANSAS, LLC AND SUBSIDIARIES +``` + +These are ALL-CAPS, contain entity suffixes (LLC, INC, CORP, L.P.), and include "AND SUBSIDIARIES". Filter with: +```js +/^[A-Z][A-Z\s,.'&-]{5,}(?:LLC|INC|CORP|COMPANY|L\.?P\.?)\b.*\bAND\s+SUBSIDIARIES\b/ +``` + +## PDF extraction artifacts + +Some filings are PDF-converted-to-HTML, producing: + +- **Missing spaces:** `word.Next` → fix with `/([a-z])\.([A-Z])/g` +- **CamelCase joins:** `wordThe next` → fix common English words: `/([a-z])(The|Our|We|This|...)\b/g` +- **Orphaned punctuation:** `Director ,` → fix with `/ ([,;:.!?)])/g` +- **Colon joins:** `word:Word` → fix with `/([a-z]):([A-Z])/g` + +## Entity decoding + +SEC HTML uses a mix of named entities, decimal entities, and hex entities. Common ones to handle: + +``` +      → space +& → & +— — — → — +– – – → – +’ ’ ’ → ' (right single quote, used as apostrophe) +“ ” → " (curly quotes) +• • • → • +™ → ™ +``` + +Some filings use the Greek question mark (U+037E) instead of a semicolon — looks identical but breaks regex. + +## Truncation detection + +The extraction pipeline caps output at 50 blocks / 15,000 words. Filings that hit this cap may be truncated. Detection: check if the last paragraph of each filing ends with terminal punctuation (`[.!?;")]\s*$`). If not, the filing was likely cut mid-sentence — remove all its paragraphs from the training corpus. + +**Limitation:** This only catches truncation at sentence boundaries. If the cap happens to fall at a sentence end, the filing appears complete even though content was lost. No fix for this without comparing against the full filing length. + +## Merge logic and cascade effects + +The extraction pipeline merges short/broken lines in multiple passes. **Any change to merge logic cascades:** merging two lines changes the resulting line's length, which affects whether subsequent lines trigger length-based merge thresholds, which changes the next merge decision, etc. + +In regression testing, a single-word forward-merge change in the extraction pass caused 1,812 ripple-effect text changes across the corpus. Moving the fix to the segmentation stage (after all extraction merges complete) reduced ripples but still affected ~800 paragraphs. + +**Lesson:** For retroactive data fixes, prefer surgical data patching (find-and-prepend on the JSONL) over re-running extraction. For future extraction, place fixes as late in the pipeline as possible to minimize cascade. + +## Testing extraction changes + +When modifying the HTML cleaner, extraction, or segmentation code, regression test against the full corpus: + +1. Re-extract all cached HTML files with the modified code +2. Compare against existing paragraphs by `(accessionNumber, paragraphIndex)` +3. Classify changes: + - **Clean prefix** (new text ends with old text) — orphan word recovered + - **Clean suffix** (new text starts with old text) — fragment absorbed + - **Re-merge** (text differs in other ways) — cascade/ripple effect + - **Paragraph count change** — boundary shift, highest-risk regression +4. Investigate any paragraph count decreases and text shrinkages — these are the most likely regressions + +For the orphan word fix, acceptable results were: 215 clean prefix fixes, 0 paragraph count changes, 0 text shrinkages. diff --git a/python/audit_corpus.py b/python/audit_corpus.py new file mode 100644 index 0000000..2629a76 --- /dev/null +++ b/python/audit_corpus.py @@ -0,0 +1,248 @@ +""" +Quality audit of the SEC-cyBERT DAPT training corpus. +Reads sharded JSONL files and performs qualitative checks on document content. +READ-ONLY — does not modify any files. +""" + +import json +import os +import random +import re +import sys +from pathlib import Path + +CORPUS_DIR = Path(__file__).resolve().parent.parent / "data" / "dapt-corpus" +SHARDS = sorted(CORPUS_DIR.glob("shard-*.jsonl")) + +random.seed(42) + + +def load_all_docs() -> list[dict]: + """Load all documents from all shards.""" + docs = [] + for shard in SHARDS: + with open(shard) as f: + for line in f: + line = line.strip() + if line: + docs.append(json.loads(line)) + return docs + + +def separator(title: str) -> None: + print("\n" + "=" * 80) + print(f" {title}") + print("=" * 80 + "\n") + + +def audit_smallest(docs: list[dict]) -> None: + separator("1. SMALLEST 20 DOCUMENTS (by chars)") + sorted_docs = sorted(docs, key=lambda d: d["chars"]) + for i, doc in enumerate(sorted_docs[:20], 1): + text = doc["text"] + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---") + # Show full text for tiny docs, cap at 2000 chars + display = text if len(text) <= 2000 else text[:2000] + "\n... [TRUNCATED]" + print(display) + print() + + +def audit_largest(docs: list[dict]) -> None: + separator("2. LARGEST 5 DOCUMENTS (first/last 500 chars)") + sorted_docs = sorted(docs, key=lambda d: d["chars"], reverse=True) + for i, doc in enumerate(sorted_docs[:5], 1): + text = doc["text"] + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---") + print("FIRST 500 CHARS:") + print(text[:500]) + print("\n... [GAP] ...\n") + print("LAST 500 CHARS:") + print(text[-500:]) + print() + + +def audit_mid_samples(docs: list[dict]) -> None: + separator("3. RANDOM MID-DOCUMENT SAMPLES (10 docs, 500 chars from 50% point)") + sample = random.sample(docs, 10) + for i, doc in enumerate(sample, 1): + text = doc["text"] + mid = len(text) // 2 + start = max(0, mid - 250) + end = min(len(text), mid + 250) + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---") + print(text[start:end]) + print() + + +def audit_xbrl_contamination(docs: list[dict]) -> None: + separator("4. XBRL-CONTAMINATED STARTS (first 200 chars with XBRL patterns)") + xbrl_pattern = re.compile( + r"(0000\d{6}|xbrli:|fasb\.org|us-gaap:|dei:|srt:|^\d{4}-\d{2}-\d{2}\s*$)", + re.MULTILINE, + ) + found = [] + for doc in docs: + first200 = doc["text"][:200] + if xbrl_pattern.search(first200): + found.append(doc) + if len(found) >= 10: + break + if not found: + print("No XBRL-contaminated documents found in initial scan.") + print("Trying broader pattern...") + # Try a broader search + broad_pattern = re.compile(r"(xmlns|xbrl|0001\d{6})", re.IGNORECASE) + for doc in docs: + first200 = doc["text"][:200] + if broad_pattern.search(first200): + found.append(doc) + if len(found) >= 10: + break + for i, doc in enumerate(found[:10], 1): + text = doc["text"] + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---") + print("FIRST 500 CHARS:") + print(text[:500]) + # Find where XBRL junk ends and real text begins + # Look for "UNITED STATES" or "FORM 10-K" as transition marker + for marker in ["UNITED STATES", "FORM 10-K", "FORM 10-k", "ANNUAL REPORT"]: + idx = text.find(marker) + if idx > 0 and idx < 5000: + print(f"\n >> Transition to real text at char {idx} (marker: '{marker}')") + break + print() + + +def audit_short_lines(docs: list[dict]) -> None: + separator("5. DOCS WITH MOST SHORT LINES (<10 chars, excluding empty)") + scored = [] + for doc in docs: + lines = doc["text"].split("\n") + non_empty = [l for l in lines if l.strip()] + short = [l for l in non_empty if 0 < len(l.strip()) < 10] + if non_empty: + ratio = len(short) / len(non_empty) + scored.append((ratio, len(short), len(non_empty), doc, short)) + scored.sort(key=lambda x: x[0], reverse=True) + for i, (ratio, n_short, n_total, doc, short_lines) in enumerate(scored[:10], 1): + print( + f"--- #{i} | accession={doc['accession']} | ratio={ratio:.2%} " + f"| {n_short}/{n_total} short lines ---" + ) + # Show 20 short lines with surrounding context + text = doc["text"] + lines = text.split("\n") + shown = 0 + for j, line in enumerate(lines): + stripped = line.strip() + if 0 < len(stripped) < 10 and shown < 20: + # Show line with 1 line of context on each side + ctx_start = max(0, j - 1) + ctx_end = min(len(lines), j + 2) + for k in range(ctx_start, ctx_end): + prefix = ">>>" if k == j else " " + print(f" {prefix} L{k+1}: {lines[k][:100]}") + print() + shown += 1 + print() + + +def audit_transitions(docs: list[dict]) -> None: + separator("6. TRANSITION ZONES (SEC cover page -> company content)") + # Find docs that have the SEC header + candidates = [d for d in docs if "SECURITIES AND EXCHANGE COMMISSION" in d["text"][:2000]] + sample = random.sample(candidates, min(5, len(candidates))) + for i, doc in enumerate(sample, 1): + text = doc["text"] + idx = text.find("SECURITIES AND EXCHANGE COMMISSION") + if idx < 0: + continue + # Find end of cover page area — look for company-specific content markers + # like "Item 1" or "PART I" or "Table of Contents" + transition_markers = ["Item 1", "ITEM 1", "PART I", "TABLE OF CONTENTS", "Table of Contents"] + transition_idx = -1 + for marker in transition_markers: + t = text.find(marker, idx + 100) + if t > 0 and (transition_idx < 0 or t < transition_idx): + transition_idx = t + if transition_idx > 0: + start = max(0, transition_idx - 250) + end = min(len(text), transition_idx + 250) + print(f"--- #{i} | accession={doc['accession']} ---") + print(f"Cover page at char {idx}, transition at char {transition_idx}") + print(f"SHOWING chars {start}-{end}:") + print(text[start:end]) + else: + # Just show around the SEC header + start = max(0, idx - 50) + end = min(len(text), idx + 450) + print(f"--- #{i} | accession={doc['accession']} ---") + print(f"Cover page at char {idx}, no clear transition marker found") + print(text[start:end]) + print() + + +def audit_financial_tables(docs: list[dict]) -> None: + separator("7. FINANCIAL TABLE QUALITY (>30% lines with $ or mostly numeric)") + scored = [] + dollar_or_numeric = re.compile(r"(\$|^\s*[\d,.\-()]+\s*$)") + for doc in docs: + lines = doc["text"].split("\n") + non_empty = [l for l in lines if l.strip()] + if not non_empty: + continue + matching = sum(1 for l in non_empty if dollar_or_numeric.search(l)) + ratio = matching / len(non_empty) + if ratio > 0.30: + scored.append((ratio, doc)) + scored.sort(key=lambda x: x[0], reverse=True) + for i, (ratio, doc) in enumerate(scored[:5], 1): + text = doc["text"] + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | numeric ratio={ratio:.1%} ---") + # Find a dense numeric section + lines = text.split("\n") + # Find a window of 20 lines with the most dollar/numeric content + best_start = 0 + best_count = 0 + window = 20 + for j in range(len(lines) - window): + count = sum(1 for l in lines[j : j + window] if dollar_or_numeric.search(l)) + if count > best_count: + best_count = count + best_start = j + print(f"DENSEST 20-LINE WINDOW (starting at line {best_start + 1}, {best_count}/{window} numeric):") + for l in lines[best_start : best_start + window]: + print(f" | {l[:120]}") + print() + + +def audit_endings(docs: list[dict]) -> None: + separator("8. END-OF-DOCUMENT QUALITY (last 300 chars of 15 random docs)") + sample = random.sample(docs, 15) + for i, doc in enumerate(sample, 1): + text = doc["text"] + print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---") + print(text[-300:]) + print() + + +def main() -> None: + print("Loading all documents from corpus...") + docs = load_all_docs() + print(f"Loaded {len(docs)} documents from {len(SHARDS)} shards.\n") + + audit_smallest(docs) + audit_largest(docs) + audit_mid_samples(docs) + audit_xbrl_contamination(docs) + audit_short_lines(docs) + audit_transitions(docs) + audit_financial_tables(docs) + audit_endings(docs) + + separator("AUDIT COMPLETE") + print(f"Total documents audited: {len(docs)}") + + +if __name__ == "__main__": + main() diff --git a/python/configs/dapt/modernbert.yaml b/python/configs/dapt/modernbert.yaml index b577e51..7f72c0f 100644 --- a/python/configs/dapt/modernbert.yaml +++ b/python/configs/dapt/modernbert.yaml @@ -7,7 +7,7 @@ model: data: corpus_path: ../data/dapt-corpus text_field: text - max_seq_length: 2048 + max_seq_length: 8192 validation_split: 0.02 training: @@ -15,8 +15,8 @@ training: learning_rate: 5.0e-5 mlm_probability: 0.30 num_train_epochs: 1 - per_device_train_batch_size: 4 - gradient_accumulation_steps: 8 # effective batch = 32 + per_device_train_batch_size: 1 + gradient_accumulation_steps: 32 # effective batch = 32 warmup_ratio: 0.05 weight_decay: 0.01 bf16: true diff --git a/python/src/dapt/train.py b/python/src/dapt/train.py index 61f6126..a11bcbb 100644 --- a/python/src/dapt/train.py +++ b/python/src/dapt/train.py @@ -47,6 +47,14 @@ def train(config: DAPTConfig) -> None: dataset = load_corpus(config.data.corpus_path, config.data.text_field) print(f" Raw documents: {len(dataset):,}") + # Filter tiny documents (cover pages, empty filings) + min_chars = 10_000 + before = len(dataset) + dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars) + filtered = before - len(dataset) + if filtered > 0: + print(f" Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining") + print(f" Tokenizing and chunking to {config.data.max_seq_length} tokens...") chunked = tokenize_and_chunk( dataset, diff --git a/scripts/analyze_generator_quality.py b/scripts/analyze_generator_quality.py new file mode 100644 index 0000000..a4954d0 --- /dev/null +++ b/scripts/analyze_generator_quality.py @@ -0,0 +1,334 @@ +#!/usr/bin/env python3 +""" +Quantify how EFiling/XDX generator quality issues affect the annotated paragraph set. +READ-ONLY analysis — does not modify any files. +""" + +import json +import re +import sys +from collections import Counter, defaultdict +from pathlib import Path + +# Reuse detect_generator from the existing script +sys.path.insert(0, str(Path(__file__).parent)) +from detect_generators import detect_generator + +# Paths +HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html") +PARAGRAPHS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/paragraphs/paragraphs-clean.jsonl") +ANNOTATIONS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/annotations/stage1.jsonl") + +SEP = "=" * 100 + + +def load_paragraphs(): + """Load paragraphs, return dict: id -> paragraph dict.""" + paragraphs = {} + with open(PARAGRAPHS_PATH) as f: + for line in f: + p = json.loads(line) + paragraphs[p["id"]] = p + return paragraphs + + +def load_annotations(): + """Load annotations, return dict: paragraphId -> annotation dict.""" + annotations = {} + with open(ANNOTATIONS_PATH) as f: + for line in f: + a = json.loads(line) + pid = a["paragraphId"] + # Keep the first annotation per paragraph (or overwrite — doesn't matter for counts) + annotations[pid] = a + return annotations + + +def detect_all_generators(): + """Detect generators for all HTML files. Return dict: accession -> generator.""" + accession_to_gen = {} + files = sorted(HTML_DIR.glob("*.html")) + total = len(files) + for i, fp in enumerate(files): + accession = fp.stem + gen, _evidence = detect_generator(str(fp)) + accession_to_gen[accession] = gen + if (i + 1) % 3000 == 0: + print(f" Scanned {i + 1}/{total} HTML files...", file=sys.stderr) + print(f" Scanned {total}/{total} HTML files.", file=sys.stderr) + return accession_to_gen + + +def starts_lowercase(text: str) -> bool: + """True if text starts with a lowercase letter (orphan word candidate).""" + if not text: + return False + return text[0].islower() + + +def is_list_item(text: str) -> bool: + """True if text looks like a list item (starts with bullet, dash, number+period, etc.).""" + stripped = text.strip() + if not stripped: + return False + # Common list patterns: "- ", "• ", "* ", "1. ", "a) ", "(a) ", "(i) " + if re.match(r'^[-•*▪◦]\s', stripped): + return True + if re.match(r'^\d+[.)]\s', stripped): + return True + if re.match(r'^\([a-z0-9ivx]+\)\s', stripped, re.I): + return True + if re.match(r'^[a-z][.)]\s', stripped): + return True + return False + + +def looks_like_inlined_header(text: str) -> bool: + """ + True if text starts with a section heading run into body text, e.g.: + "Risk Management and Strategy We recognize the importance..." + "Cybersecurity Governance Our Board of Directors oversees..." + + Key distinction from normal sentences: the heading portion is a noun phrase + (not a full sentence subject like "Our Board" or "The Company"), and is + immediately followed by a new sentence that starts a different thought. + + We look for known SEC cybersecurity section heading patterns followed by + body text starting with a capital letter (new sentence) with no punctuation + separating them (no period, colon, or newline — just a space). + """ + # Known heading patterns for SEC Item 1C disclosures + heading_patterns = [ + r'(?:Cybersecurity\s+)?Risk\s+Management(?:\s+and\s+Strategy)?', + r'(?:Cybersecurity\s+)?Governance(?:\s+and\s+Risk\s+Management)?', + r'Cybersecurity\s+Governance', + r'Cybersecurity\s+Risk\s+Management\s+and\s+Strategy', + r'Board\s+Oversight(?:\s+of\s+(?:Risks?\s+from\s+)?Cybersecurity(?:\s+(?:Threats?|Risks?))?)?', + r'Management(?:\'s)?\s+Role\s+in\s+(?:Managing\s+)?Cybersecurity', + r'Governance\s+(?:Related\s+to|Oversight\s+of)\s+Cybersecurity(?:\s+Risks?)?', + r'Impact\s+of\s+Cybersecurity\s+(?:Risks?|Threats?)', + r'Cybersecurity\s+(?:Strategy|Overview|Program)', + r'(?:Management\s+and|Management|Governance)\s+(?:Strategy|Overview)', + r'Risk\s+Factors?', + r'Oversight\s+of\s+Cybersecurity\s+Risk\s+Management', + ] + + for pat in heading_patterns: + # Heading immediately followed by body text (capital letter starting new sentence) + m = re.match(rf'^({pat})\s+([A-Z])', text) + if m: + return True + # Also catch heading followed by lowercase (rarer but possible) + m = re.match(rf'^({pat})\s+([a-z])', text) + if m: + return True + + return False + + +def main(): + print("Loading data...") + paragraphs = load_paragraphs() + annotations = load_annotations() + print(f" Paragraphs: {len(paragraphs):,}") + print(f" Annotations: {len(annotations):,}") + + # Unique annotated paragraph IDs + annotated_ids = set(annotations.keys()) & set(paragraphs.keys()) + print(f" Annotated paragraphs with matching paragraph data: {len(annotated_ids):,}") + + print("\nDetecting generators for all HTML files...") + accession_to_gen = detect_all_generators() + print(f" HTML files scanned: {len(accession_to_gen):,}") + + # Map each paragraph to its generator + para_to_gen = {} + missing_accessions = set() + for pid, p in paragraphs.items(): + acc = p["filing"]["accessionNumber"] + gen = accession_to_gen.get(acc) + if gen is None: + missing_accessions.add(acc) + gen = "NO_HTML_FILE" + para_to_gen[pid] = gen + + if missing_accessions: + print(f"\n WARNING: {len(missing_accessions)} accession numbers in paragraphs have no HTML file") + + # ===================================================================== + # SECTION 1: Annotated paragraphs by generator + # ===================================================================== + print(f"\n{SEP}") + print("SECTION 1: Annotated paragraphs by generator") + print(SEP) + + ann_gen_counts = Counter() + for pid in annotated_ids: + ann_gen_counts[para_to_gen[pid]] += 1 + + total_ann = len(annotated_ids) + print(f"\n{'Generator':<50} {'Count':>7} {'%':>7}") + print("-" * 70) + for gen, count in ann_gen_counts.most_common(): + pct = count / total_ann * 100 + print(f"{gen:<50} {count:>7} {pct:>6.1f}%") + print("-" * 70) + print(f"{'TOTAL':<50} {total_ann:>7} {100.0:>6.1f}%") + + # ===================================================================== + # SECTION 2: Lowercase-start (orphan word) analysis for annotated set + # ===================================================================== + print(f"\n{SEP}") + print("SECTION 2: Lowercase-start paragraphs in annotated set") + print(SEP) + + # All annotated lowercase-start + ann_lc = {pid for pid in annotated_ids if starts_lowercase(paragraphs[pid]["text"])} + ann_lc_nonlist = {pid for pid in ann_lc if not is_list_item(paragraphs[pid]["text"])} + + print(f"\nAnnotated paragraphs starting with lowercase: {len(ann_lc):,} / {total_ann:,} ({len(ann_lc)/total_ann*100:.2f}%)") + print(f" Of those, excluding list items: {len(ann_lc_nonlist):,} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)") + + # Breakdown by generator for lowercase-start non-list + lc_by_gen = Counter() + for pid in ann_lc_nonlist: + lc_by_gen[para_to_gen[pid]] += 1 + + print(f"\n{'Generator':<50} {'LC-start':>9} {'Total ann':>10} {'% of gen':>9}") + print("-" * 85) + for gen, _ in ann_gen_counts.most_common(): + lc_count = lc_by_gen.get(gen, 0) + gen_total = ann_gen_counts[gen] + pct = lc_count / gen_total * 100 if gen_total else 0 + if lc_count > 0: + print(f"{gen:<50} {lc_count:>9} {gen_total:>10} {pct:>8.1f}%") + + # Specific callouts + efiling_gens = {"EFiling/EDGAR Agent", "EFiling XDX"} + efiling_ann = {pid for pid in annotated_ids if para_to_gen[pid] in efiling_gens} + efiling_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] in efiling_gens} + + compsci_ann = {pid for pid in annotated_ids if para_to_gen[pid] == "CompSci Transform"} + compsci_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] == "CompSci Transform"} + + print(f"\n--- Specific callouts ---") + print(f"EFiling/XDX annotated paragraphs starting lowercase (non-list): {len(efiling_lc):,} / {len(efiling_ann):,} ({len(efiling_lc)/len(efiling_ann)*100:.1f}% of EFiling/XDX)" if efiling_ann else "EFiling/XDX: 0 annotated paragraphs") + print(f"CompSci Transform annotated paragraphs starting lowercase (non-list): {len(compsci_lc):,} / {len(compsci_ann):,} ({len(compsci_lc)/len(compsci_ann)*100:.1f}% of CompSci)" if compsci_ann else "CompSci Transform: 0 annotated paragraphs") + print(f"\nTotal affected annotated paragraphs (LC non-list): {len(ann_lc_nonlist):,} / {total_ann:,} = {len(ann_lc_nonlist)/total_ann*100:.2f}%") + + # ===================================================================== + # SECTION 3: Orphan-word paragraphs detail + # ===================================================================== + print(f"\n{SEP}") + print("SECTION 3: Orphan-word paragraph details (LC-start, non-list, annotated)") + print(SEP) + + # Breakdown by generator + print(f"\nBreakdown by generator:") + print(f"{'Generator':<50} {'Count':>7} {'% of orphan':>12}") + print("-" * 75) + for gen, count in lc_by_gen.most_common(): + pct = count / len(ann_lc_nonlist) * 100 + print(f"{gen:<50} {count:>7} {pct:>11.1f}%") + + # 10 example texts with labels + print(f"\n10 example orphan-word annotated paragraphs:") + print("-" * 100) + examples = sorted(ann_lc_nonlist)[:10] + for pid in examples: + text = paragraphs[pid]["text"][:150] + ann = annotations[pid] + label = ann.get("label", {}) + cat = label.get("content_category", "?") + spec = label.get("specificity_level", "?") + gen = para_to_gen[pid] + print(f" [{gen}] cat={cat}, spec={spec}") + print(f" \"{text}...\"") + print() + + # Category distribution in orphan-word paragraphs vs overall + print(f"\nCategory distribution: orphan-word vs overall annotated set") + print("-" * 80) + + orphan_cats = Counter() + for pid in ann_lc_nonlist: + cat = annotations[pid].get("label", {}).get("content_category", "Unknown") + orphan_cats[cat] += 1 + + overall_cats = Counter() + for pid in annotated_ids: + cat = annotations[pid].get("label", {}).get("content_category", "Unknown") + overall_cats[cat] += 1 + + all_cats = sorted(set(orphan_cats.keys()) | set(overall_cats.keys())) + print(f"{'Category':<40} {'Orphan':>7} {'Orphan%':>8} {'Overall':>8} {'Overall%':>9} {'Over-rep':>9}") + print("-" * 85) + for cat in all_cats: + o_count = orphan_cats.get(cat, 0) + a_count = overall_cats.get(cat, 0) + o_pct = o_count / len(ann_lc_nonlist) * 100 if ann_lc_nonlist else 0 + a_pct = a_count / total_ann * 100 + ratio = (o_pct / a_pct) if a_pct > 0 else 0 + flag = " <<<" if ratio > 1.5 else "" + print(f"{cat:<40} {o_count:>7} {o_pct:>7.1f}% {a_count:>8} {a_pct:>8.1f}% {ratio:>8.2f}x{flag}") + + # ===================================================================== + # SECTION 4: Inlined headers analysis + # ===================================================================== + print(f"\n{SEP}") + print("SECTION 4: Inlined headers in annotated paragraphs") + print(SEP) + + ann_inlined = set() + for pid in annotated_ids: + text = paragraphs[pid]["text"] + if looks_like_inlined_header(text): + ann_inlined.add(pid) + + print(f"\nAnnotated paragraphs with inlined headers: {len(ann_inlined):,} / {total_ann:,} ({len(ann_inlined)/total_ann*100:.2f}%)") + + inlined_by_gen = Counter() + for pid in ann_inlined: + inlined_by_gen[para_to_gen[pid]] += 1 + + print(f"\n{'Generator':<50} {'Inlined':>8} {'Total ann':>10} {'% of gen':>9}") + print("-" * 85) + for gen, _ in ann_gen_counts.most_common(): + ih_count = inlined_by_gen.get(gen, 0) + gen_total = ann_gen_counts[gen] + pct = ih_count / gen_total * 100 if gen_total else 0 + if ih_count > 0: + print(f"{gen:<50} {ih_count:>8} {gen_total:>10} {pct:>8.1f}%") + + # Show some examples + print(f"\n10 example inlined-header paragraphs:") + print("-" * 100) + examples_ih = sorted(ann_inlined)[:10] + for pid in examples_ih: + text = paragraphs[pid]["text"][:150] + gen = para_to_gen[pid] + cat = annotations[pid].get("label", {}).get("content_category", "?") + print(f" [{gen}] cat={cat}") + print(f" \"{text}...\"") + print() + + # ===================================================================== + # SECTION 5: Combined impact summary + # ===================================================================== + print(f"\n{SEP}") + print("SECTION 5: Combined impact summary") + print(SEP) + + affected = ann_lc_nonlist | ann_inlined + print(f"\nOrphan-word (LC non-list): {len(ann_lc_nonlist):>6} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)") + print(f"Inlined headers: {len(ann_inlined):>6} ({len(ann_inlined)/total_ann*100:.2f}%)") + print(f"Either issue (union): {len(affected):>6} ({len(affected)/total_ann*100:.2f}%)") + print(f"Total annotated set: {total_ann:>6}") + + # EFiling/XDX specifically + efiling_affected = {pid for pid in affected if para_to_gen[pid] in efiling_gens} + print(f"\nEFiling/XDX affected (either issue): {len(efiling_affected):,} / {len(efiling_ann):,}") + + +if __name__ == "__main__": + main() diff --git a/scripts/audit_corpus.py b/scripts/audit_corpus.py new file mode 100644 index 0000000..076c2ef --- /dev/null +++ b/scripts/audit_corpus.py @@ -0,0 +1,435 @@ +#!/usr/bin/env python3 +"""Audit sec-cyBERT paragraph corpus for text quality issues.""" + +import json +import re +import random +import os +from collections import Counter, defaultdict +from pathlib import Path + +DATA_FILE = Path("data/paragraphs/paragraphs-clean.jsonl") +HTML_DIR = Path("data/raw/html") + +# ── Load all paragraphs ────────────────────────────────────────────────────── + +print("Loading paragraphs...") +paragraphs = [] +with open(DATA_FILE) as f: + for line in f: + paragraphs.append(json.loads(line)) +print(f"Loaded {len(paragraphs):,} paragraphs.\n") + + +def show(text, limit=200): + """Truncate text for display.""" + if len(text) <= limit: + return text + return text[:limit] + "..." + + +def header(title): + print("\n" + "=" * 80) + print(f" {title}") + print("=" * 80 + "\n") + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 1: Inlined headers +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 1: Inlined Headers") + +inlined_header_examples = [] + +# Detect heading+body merged into one paragraph. +# A heading is a short (2-10 word) title-case or ALL-CAPS phrase at the start, +# immediately followed (no colon/period separator) by a sentence starting with +# a common sentence-opener like We/Our/The/As/In/This/A/An/Each/Management/For/Since/During. +pat_merged_header = re.compile( + r"^([A-Z][A-Za-z\s,&/\-\']+?)(? 5 + + if is_title or is_allcaps: + kind = "ALLCAPS" if is_allcaps else "TITLECASE" + inlined_header_examples.append((kind, p, heading_candidate)) + +print(f"Found {len(inlined_header_examples):,} paragraphs with potential inlined headers.") +print(f" - ALLCAPS pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='ALLCAPS'):,}") +print(f" - TITLECASE pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='TITLECASE'):,}") +print() + +# Show 20 examples, mix of both types +random.seed(42) +sample = random.sample(inlined_header_examples, min(20, len(inlined_header_examples))) + +for i, (kind, p, hdr) in enumerate(sample, 1): + print(f" [{i}] ({kind}) Header: \"{hdr}\" [{p['filing']['companyName'][:30]}]") + print(f" {show(p['text'])}") + print() + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 2: Sentence boundary violations +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 2: Sentence Boundary Violations") + +boundary_examples = [] + +# word.Next — period followed immediately by uppercase letter (not abbreviations) +pat_dotcap = re.compile(r"[a-z]\.([A-Z][a-z])") +# word,Next — comma followed immediately by uppercase letter +pat_commacap = re.compile(r"[a-z],([A-Z][a-z])") +# Two words jammed: lowercase then uppercase with no space/punct +pat_jammed = re.compile(r"[a-z]{2}[A-Z][a-z]{2}") + +# Common false positives for dot-cap: abbreviations, names +false_pos_dot = re.compile( + r"(?:Mr|Mrs|Ms|Dr|Jr|Sr|Inc|Corp|Ltd|Co|No|vs|St|Dept|Gen|Gov|Sec|Vol|Rev|etc|U\.S|U\.K)\." +) + +for p in paragraphs: + text = p["text"] + issues = [] + + for m in pat_dotcap.finditer(text): + start = max(0, m.start() - 10) + context = text[start : m.end() + 10] + # skip if it's a known abbreviation + if not false_pos_dot.search(text[max(0, m.start() - 5) : m.end()]): + issues.append(("dot-cap", context)) + + for m in pat_commacap.finditer(text): + start = max(0, m.start() - 10) + context = text[start : m.end() + 10] + issues.append(("comma-cap", context)) + + if issues: + boundary_examples.append((p, issues)) + +print(f"Found {len(boundary_examples):,} paragraphs with sentence boundary violations.") +print() + +random.seed(43) +sample = random.sample(boundary_examples, min(20, len(boundary_examples))) +for i, (p, issues) in enumerate(sample, 1): + print(f" [{i}] [{p['filing']['companyName'][:30]}]") + for kind, ctx in issues[:3]: + print(f" ({kind}) ...{ctx}...") + print(f" Full start: {show(p['text'], 150)}") + print() + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 3: Garbled / nonsensical text +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 3: Garbled / Nonsensical Text") + +garbled_examples = [] + +# Spaced-out characters: single chars separated by spaces +pat_spaced = re.compile(r"(?:\b[a-zA-Z]\s){4,}") + +for p in paragraphs: + text = p["text"] + reason = None + + # Check spaced-out characters + if pat_spaced.search(text): + reason = "spaced-chars" + + # Check long non-ASCII runs + non_ascii = sum(1 for c in text if ord(c) > 127) + if non_ascii > len(text) * 0.15 and len(text) > 20: + reason = f"non-ASCII ({non_ascii}/{len(text)} chars)" + + # Check mostly numbers/symbols (>50% non-alpha) + alpha = sum(1 for c in text if c.isalpha()) + if len(text) > 20 and alpha < len(text) * 0.4: + reason = f"low-alpha ({alpha}/{len(text)} = {alpha/len(text):.0%})" + + if reason: + garbled_examples.append((reason, p)) + +print(f"Found {len(garbled_examples):,} potentially garbled paragraphs.") +reason_counts = Counter(r.split("(")[0].strip() for r, _ in garbled_examples) +for r, c in reason_counts.most_common(): + print(f" - {r}: {c}") +print() + +random.seed(44) +sample = random.sample(garbled_examples, min(10, len(garbled_examples))) +for i, (reason, p) in enumerate(sample, 1): + print(f" [{i}] ({reason}) [{p['filing']['companyName'][:30]}] wc={p['wordCount']}") + print(f" {show(p['text'], 250)}") + print() + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 4: HTML / markup artifacts +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 4: HTML / Markup Artifacts") + +html_examples = [] + +pat_html_tag = re.compile(r"<[a-zA-Z/][^>]*>") +pat_html_entity = re.compile(r"&(?:amp|lt|gt|nbsp|quot|#\d+|#x[0-9a-fA-F]+);") +pat_xbrl = re.compile(r"\b(?:ix|us-gaap|dei|xbrli):") +pat_css = re.compile(r"(?:font-family|font-size|color:|margin:|padding:|text-align|line-height)", re.IGNORECASE) + +for p in paragraphs: + text = p["text"] + reasons = [] + + if pat_html_tag.search(text): + reasons.append("html-tag") + if pat_html_entity.search(text): + reasons.append("html-entity") + if pat_xbrl.search(text): + reasons.append("xbrl") + if pat_css.search(text): + reasons.append("css") + + if reasons: + html_examples.append((reasons, p)) + +print(f"Found {len(html_examples):,} paragraphs with HTML/markup artifacts.") +reason_counts = Counter() +for reasons, _ in html_examples: + for r in reasons: + reason_counts[r] += 1 +for r, c in reason_counts.most_common(): + print(f" - {r}: {c}") +print() + +random.seed(45) +sample = random.sample(html_examples, min(10, len(html_examples))) +for i, (reasons, p) in enumerate(sample, 1): + print(f" [{i}] ({', '.join(reasons)}) [{p['filing']['companyName'][:30]}]") + print(f" {show(p['text'], 250)}") + print() + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 5: Truncated paragraphs +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 5: Truncated Paragraphs") + +truncated = [] + +# Common abbreviations that end sentences without terminal punct being an issue +abbrevs = {"inc", "corp", "ltd", "co", "mr", "mrs", "ms", "dr", "jr", "sr", + "etc", "al", "eg", "ie", "vs", "no", "approx", "dept", "gov"} + +for p in paragraphs: + text = p["text"].rstrip() + if not text: + continue + + # Check if ends with terminal punctuation + last_char = text[-1] + if last_char in ".!?:;)\"'""'": + continue + + # Check if it's a very short text (likely a heading) + if p["wordCount"] <= 5: + continue + + # Check if last word is a common abbreviation + last_word = text.split()[-1].lower().rstrip(".,;:!?") + if last_word in abbrevs: + continue + + truncated.append(p) + +print(f"Found {len(truncated):,} potentially truncated paragraphs (no terminal punctuation, >5 words).") +print() + +random.seed(46) +sample = random.sample(truncated, min(10, len(truncated))) +for i, p in enumerate(sample, 1): + text = p["text"] + print(f" [{i}] [{p['filing']['companyName'][:30]}] wc={p['wordCount']}") + # Show the END of the text + if len(text) > 200: + print(f" ...{text[-200:]}") + else: + print(f" {text}") + print() + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 6: Duplicate text across filings +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 6: Cross-Filing Duplicate Text") + +# Group by textHash +hash_to_paras = defaultdict(list) +for p in paragraphs: + hash_to_paras[p["textHash"]].append(p) + +# Find hashes that appear in multiple different filings +cross_filing_dupes = {} +for h, ps in hash_to_paras.items(): + accessions = set(p["filing"]["accessionNumber"] for p in ps) + if len(accessions) > 1: + cross_filing_dupes[h] = ps + +total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values()) +print(f"Unique textHashes appearing in multiple filings: {len(cross_filing_dupes):,}") +print(f"Total paragraphs involved: {total_dupe_paragraphs:,}") +print() + +# Sort by number of filings (most duplicated first) +sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(set(p["filing"]["accessionNumber"] for p in x[1])), reverse=True) + +print("Top 15 most duplicated paragraphs:") +for i, (h, ps) in enumerate(sorted_dupes[:15], 1): + accessions = set(p["filing"]["accessionNumber"] for p in ps) + companies = set(p["filing"]["companyName"] for p in ps) + print(f"\n [{i}] Hash={h}, in {len(accessions)} filings, {len(companies)} companies") + print(f" Companies: {', '.join(list(companies)[:5])}{'...' if len(companies) > 5 else ''}") + print(f" Text: {show(ps[0]['text'], 200)}") + +# Check for same-company cross-year dupes vs different-company dupes +same_company_dupes = 0 +diff_company_dupes = 0 +for h, ps in cross_filing_dupes.items(): + companies = set(p["filing"]["companyName"] for p in ps) + if len(companies) == 1: + same_company_dupes += 1 + else: + diff_company_dupes += 1 + +print(f"\n\nBreakdown:") +print(f" Same company, different filings (likely year-over-year boilerplate): {same_company_dupes:,}") +print(f" Different companies (likely industry boilerplate or extraction error): {diff_company_dupes:,}") + + +# ══════════════════════════════════════════════════════════════════════════════ +# CHECK 7: Ground truth spot-check +# ══════════════════════════════════════════════════════════════════════════════ +header("CHECK 7: Ground Truth Spot-Check (10 random paragraphs vs. source HTML)") + + +def normalize_html_to_plain(html_text): + """Convert raw HTML to normalized plain text for comparison.""" + plain = re.sub(r"<[^>]+>", " ", html_text) + # Decode common HTML entities + plain = re.sub(r" ?", " ", plain) + plain = re.sub(r"&", "&", plain) + plain = re.sub(r"<", "<", plain) + plain = re.sub(r">", ">", plain) + plain = re.sub(r"’|’|’", "\u2019", plain) + plain = re.sub(r"‘|‘|‘", "\u2018", plain) + plain = re.sub(r"”|”|”", "\u201D", plain) + plain = re.sub(r"“|“|“", "\u201C", plain) + plain = re.sub(r"—|—", "\u2014", plain) + plain = re.sub(r"–|–", "\u2013", plain) + plain = re.sub(r"&#(\d+);", lambda m: chr(int(m.group(1))), plain) + plain = re.sub(r"&#x([0-9a-fA-F]+);", lambda m: chr(int(m.group(1), 16)), plain) + plain = re.sub(r"&\w+;", " ", plain) + plain = re.sub(r"\s+", " ", plain) + return plain + + +random.seed(99) +spot_check_sample = random.sample(paragraphs, 10) +match_count = 0 +partial_count = 0 +not_found_count = 0 + +for i, p in enumerate(spot_check_sample, 1): + acc = p["filing"]["accessionNumber"] + html_path = HTML_DIR / f"{acc}.html" + + print(f" [{i}] {p['filing']['companyName'][:40]} | {acc}") + print(f" Paragraph index: {p['paragraphIndex']}, word count: {p['wordCount']}") + + corpus_text = p["text"] + corpus_norm = re.sub(r"\s+", " ", corpus_text).strip() + + if not html_path.exists(): + print(f" *** HTML file not found: {html_path}") + print(f" Corpus text: {show(corpus_text, 150)}") + not_found_count += 1 + print() + continue + + with open(html_path, "r", errors="replace") as f: + html_content = f.read() + + plain_html = normalize_html_to_plain(html_content) + + # Check if the entire corpus text appears verbatim in the HTML plain text + if corpus_norm in plain_html: + print(f" VERBATIM MATCH: Corpus text found exactly in HTML source.") + match_count += 1 + else: + # Try to find a distinctive substring to locate the paragraph + # Use multiple probes from different positions + found = False + for start_frac in [0.3, 0.5, 0.1, 0.7]: + start_pos = int(len(corpus_norm) * start_frac) + probe = corpus_norm[start_pos:start_pos + 40] + if not probe: + continue + idx = plain_html.find(probe) + if idx >= 0: + found = True + # Show surrounding context from HTML + ctx_start = max(0, idx - 80) + ctx_end = min(len(plain_html), idx + len(corpus_norm) + 80) + html_ctx = plain_html[ctx_start:ctx_end].strip() + print(f" PARTIAL MATCH: Text found in HTML but paragraph boundaries differ.") + print(f" Corpus first 120: {corpus_norm[:120]}") + print(f" HTML context 120: {html_ctx[:120]}") + partial_count += 1 + break + + if not found: + print(f" NOT FOUND in HTML plain text!") + print(f" Corpus text: {show(corpus_text, 150)}") + not_found_count += 1 + + print() + +print(f"Spot-check results: {match_count} verbatim, {partial_count} partial, {not_found_count} not found") + + +# ══════════════════════════════════════════════════════════════════════════════ +# SUMMARY +# ══════════════════════════════════════════════════════════════════════════════ +header("SUMMARY") +print(f"Total paragraphs: {len(paragraphs):,}") +print(f" 1. Inlined headers: {len(inlined_header_examples):,}") +print(f" 2. Sentence boundary violations: {len(boundary_examples):,}") +print(f" 3. Garbled / nonsensical text: {len(garbled_examples):,}") +print(f" 4. HTML / markup artifacts: {len(html_examples):,}") +print(f" 5. Truncated paragraphs: {len(truncated):,}") +print(f" 6. Cross-filing duplicates: {len(cross_filing_dupes):,} unique texts in {total_dupe_paragraphs:,} paragraphs") +print() diff --git a/scripts/audit_paragraphs.py b/scripts/audit_paragraphs.py new file mode 100644 index 0000000..fee56af --- /dev/null +++ b/scripts/audit_paragraphs.py @@ -0,0 +1,405 @@ +""" +Audit SEC-cyBERT paragraph corpus for boundary errors. +Run from project root: python3 scripts/audit_paragraphs.py +""" + +import json +import random +import re +import sys +from collections import Counter, defaultdict +from pathlib import Path + +DATA_PATH = Path("data/paragraphs/paragraphs-clean.jsonl") + +def load_paragraphs(): + paragraphs = [] + with open(DATA_PATH) as f: + for line in f: + paragraphs.append(json.loads(line)) + return paragraphs + +def section_header(title): + bar = "=" * 80 + print(f"\n{bar}") + print(f" {title}") + print(bar) + +def truncate(text, n): + if len(text) <= n: + return text + return text[:n] + "..." + +# --------------------------------------------------------------------------- +# Load +# --------------------------------------------------------------------------- +print("Loading paragraphs...") +paragraphs = load_paragraphs() +print(f"Loaded {len(paragraphs):,} paragraphs") + +# Group by accessionNumber +by_filing = defaultdict(list) +for p in paragraphs: + acc = p["filing"]["accessionNumber"] + by_filing[acc].append(p) + +print(f"Unique filings: {len(by_filing):,}") + +# --------------------------------------------------------------------------- +# 1. Paragraphs-per-filing distribution +# --------------------------------------------------------------------------- +section_header("1. PARAGRAPHS-PER-FILING DISTRIBUTION") + +counts = sorted([len(ps) for ps in by_filing.values()]) +n = len(counts) + +import math + +mean = sum(counts) / n +variance = sum((c - mean) ** 2 for c in counts) / n +stdev = math.sqrt(variance) + +def percentile(sorted_list, pct): + idx = pct / 100 * (len(sorted_list) - 1) + lo = int(math.floor(idx)) + hi = int(math.ceil(idx)) + if lo == hi: + return sorted_list[lo] + frac = idx - lo + return sorted_list[lo] * (1 - frac) + sorted_list[hi] * frac + +print(f" Min: {counts[0]}") +print(f" P5: {percentile(counts, 5):.1f}") +print(f" P25: {percentile(counts, 25):.1f}") +print(f" Median: {percentile(counts, 50):.1f}") +print(f" P75: {percentile(counts, 75):.1f}") +print(f" P95: {percentile(counts, 95):.1f}") +print(f" Max: {counts[-1]}") +print(f" Stdev: {stdev:.2f}") +print(f" Mean: {mean:.2f}") + +# Histogram buckets +buckets = [1, 2, 3, 5, 10, 15, 20, 30, 50, 100, 200] +print("\n Histogram:") +prev = 0 +for b in buckets: + c = sum(1 for x in counts if prev < x <= b) + if c > 0: + print(f" ({prev+1}-{b}]: {c:>5} filings") + prev = b +c = sum(1 for x in counts if x > buckets[-1]) +if c > 0: + print(f" (>{buckets[-1]}): {c:>5} filings") + +# Fewest paragraphs +print("\n --- 10 filings with FEWEST paragraphs ---") +sorted_filings = sorted(by_filing.items(), key=lambda x: len(x[1])) +for acc, ps in sorted_filings[:10]: + company = ps[0]["filing"]["companyName"] + print(f"\n [{acc}] {company} — {len(ps)} paragraph(s):") + for p in sorted(ps, key=lambda x: x["paragraphIndex"]): + print(f" p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}") + +# Most paragraphs +print("\n --- 10 filings with MOST paragraphs ---") +for acc, ps in sorted_filings[-10:]: + company = ps[0]["filing"]["companyName"] + print(f"\n [{acc}] {company} — {len(ps)} paragraph(s):") + for p in sorted(ps, key=lambda x: x["paragraphIndex"])[:5]: + print(f" p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}") + if len(ps) > 5: + print(f" ... ({len(ps) - 5} more)") + +# --------------------------------------------------------------------------- +# 2. Suspiciously long paragraphs +# --------------------------------------------------------------------------- +section_header("2. SUSPICIOUSLY LONG PARAGRAPHS (top 20 by word count)") + +sorted_by_wc = sorted(paragraphs, key=lambda p: p["wordCount"], reverse=True) + +for i, p in enumerate(sorted_by_wc[:20]): + acc = p["filing"]["accessionNumber"] + company = p["filing"]["companyName"] + text = p["text"] + first200 = text[:200] + last200 = text[-200:] if len(text) > 400 else "" + print(f"\n #{i+1}: {p['wordCount']} words | p{p['paragraphIndex']} | {company}") + print(f" Acc: {acc}") + print(f" FIRST 200: {first200}") + if last200: + print(f" LAST 200: {last200}") + + # Check for signs of merged paragraphs + issues = [] + if p["wordCount"] > 300: + issues.append("VERY LONG (>300w)") + # Look for heading-like patterns mid-text (capitalized lines, bold markers) + lines = text.split("\n") + if len(lines) > 1: + issues.append(f"CONTAINS {len(lines)} LINES (possible merge)") + # Look for sentence-ending followed by topic shift + sentences = re.split(r'(?<=[.!?])\s+', text) + if len(sentences) > 8: + issues.append(f"{len(sentences)} sentences") + if issues: + print(f" FLAGS: {', '.join(issues)}") + +# --------------------------------------------------------------------------- +# 3. Suspiciously short paragraphs +# --------------------------------------------------------------------------- +section_header("3. SUSPICIOUSLY SHORT PARAGRAPHS (<25 words)") + +short = [p for p in paragraphs if p["wordCount"] < 25] +print(f"\n Total paragraphs <25 words: {len(short)} ({100*len(short)/len(paragraphs):.1f}%)") + +# Categorize +headings = [] +standalone = [] +fragments = [] +list_items = [] + +heading_patterns = re.compile( + r"^(risk management|cybersecurity|governance|strategy|board|" + r"oversight|incident|material|information security|" + r"risk factors|item 1c|risk management and strategy|" + r"risk management, strategy|governance, risk management)" + , re.IGNORECASE +) + +for p in short: + text = p["text"].strip() + lower = text.lower() + + # Heading detection: short, no period at end, title-case-ish + is_heading = False + if len(text.split()) <= 8 and not text.endswith("."): + is_heading = True + if heading_patterns.match(lower): + is_heading = True + if text.isupper() and len(text.split()) <= 10: + is_heading = True + + # List item: starts with bullet, dash, number, or letter + is_list = bool(re.match(r"^(\d+[.)]\s|[-•●◦▪]\s|[a-z][.)]\s|\([a-z]\)\s|\(\d+\)\s)", text)) + + # Fragment: doesn't end with period/question/exclamation and not a heading + is_fragment = not is_heading and not is_list and not re.search(r'[.!?"]$', text.rstrip()) + + if is_heading: + headings.append(p) + elif is_list: + list_items.append(p) + elif is_fragment: + fragments.append(p) + else: + standalone.append(p) + +print(f" Headings: {len(headings)}") +print(f" Standalone sentences:{len(standalone)}") +print(f" Fragments: {len(fragments)}") +print(f" List items: {len(list_items)}") + +def show_examples(label, items, count): + sample = items[:count] if len(items) <= count else random.sample(items, count) + print(f"\n --- {label} (showing {len(sample)} of {len(items)}) ---") + for p in sample: + acc = p["filing"]["accessionNumber"] + print(f" [{p['wordCount']}w] p{p['paragraphIndex']} | {truncate(p['text'], 120)}") + print(f" {p['filing']['companyName']} | {acc}") + +random.seed(42) +show_examples("Headings", headings, 10) +show_examples("Standalone sentences", standalone, 8) +show_examples("Fragments", fragments, 8) +show_examples("List items", list_items, 4) + +# --------------------------------------------------------------------------- +# 4. Sequential paragraph coherence +# --------------------------------------------------------------------------- +section_header("4. SEQUENTIAL PARAGRAPH COHERENCE (20 random filings)") + +random.seed(123) +sample_accs = random.sample(list(by_filing.keys()), min(20, len(by_filing))) + +mid_sentence_breaks = [] +topic_shifts = [] + +for acc in sample_accs: + ps = sorted(by_filing[acc], key=lambda x: x["paragraphIndex"]) + for i in range(len(ps) - 1): + curr = ps[i] + nxt = ps[i + 1] + curr_text = curr["text"].strip() + nxt_text = nxt["text"].strip() + + # Check: does current paragraph end mid-sentence? + # Signs: ends with comma, semicolon, conjunction, lowercase word, no terminal punctuation + ends_mid = False + if curr_text and not re.search(r'[.!?:"\)]$', curr_text): + ends_mid = True + if curr_text and re.search(r'(,|;|\band\b|\bor\b|\bbut\b|\bthat\b|\bwhich\b)\s*$', curr_text): + ends_mid = True + + # Check: does next paragraph start with lowercase (continuation)? + starts_lower = bool(nxt_text) and nxt_text[0].islower() + + if ends_mid or starts_lower: + mid_sentence_breaks.append({ + "acc": acc, + "company": curr["filing"]["companyName"], + "curr_idx": curr["paragraphIndex"], + "nxt_idx": nxt["paragraphIndex"], + "curr_end": curr_text[-150:] if len(curr_text) > 150 else curr_text, + "nxt_start": nxt_text[:150] if len(nxt_text) > 150 else nxt_text, + "ends_mid": ends_mid, + "starts_lower": starts_lower, + }) + +print(f"\n Checked {len(sample_accs)} filings") +print(f" Potential mid-sentence breaks found: {len(mid_sentence_breaks)}") + +print("\n --- Examples of mid-sentence / continuation breaks ---") +for ex in mid_sentence_breaks[:5]: + print(f"\n [{ex['acc']}] {ex['company']}") + print(f" p{ex['curr_idx']} ENDS: ...{ex['curr_end']}") + print(f" p{ex['nxt_idx']} STARTS: {ex['nxt_start']}...") + flags = [] + if ex["ends_mid"]: + flags.append("no terminal punctuation") + if ex["starts_lower"]: + flags.append("next starts lowercase") + print(f" FLAGS: {', '.join(flags)}") + +if len(mid_sentence_breaks) == 0: + print(" (none found)") + +# Also check for topic shifts within single paragraphs (long ones in sampled filings) +print("\n --- Checking for intra-paragraph topic shifts ---") +shift_examples = [] +for acc in sample_accs: + for p in by_filing[acc]: + if p["wordCount"] < 150: + continue + text = p["text"] + # Look for heading-like substrings mid-text + # e.g., "Risk Management" or "Governance" appearing after a sentence end + matches = list(re.finditer( + r'(?<=[.!?]\s)(Risk Management|Governance|Strategy|Cybersecurity|' + r'Board of Directors|Incident Response|Overview|Third.Party)', + text + )) + if matches: + shift_examples.append({ + "acc": acc, + "company": p["filing"]["companyName"], + "idx": p["paragraphIndex"], + "wordCount": p["wordCount"], + "match": matches[0].group(), + "context": text[max(0, matches[0].start()-80):matches[0].end()+80], + }) + +print(f" Paragraphs with possible embedded topic headers: {len(shift_examples)}") +for ex in shift_examples[:5]: + print(f"\n [{ex['acc']}] {ex['company']} p{ex['idx']} ({ex['wordCount']}w)") + print(f" Found '{ex['match']}' mid-paragraph:") + print(f" ...{ex['context']}...") + +# --------------------------------------------------------------------------- +# 5. Paragraph index gaps +# --------------------------------------------------------------------------- +section_header("5. PARAGRAPH INDEX GAPS & DUPLICATES") + +gap_filings = [] +dup_filings = [] + +for acc, ps in by_filing.items(): + indices = sorted(p["paragraphIndex"] for p in ps) + + # Check for duplicates + if len(indices) != len(set(indices)): + counter = Counter(indices) + dups = {k: v for k, v in counter.items() if v > 1} + dup_filings.append((acc, ps[0]["filing"]["companyName"], dups)) + + # Check for gaps (should be 0, 1, 2, ...) + expected = list(range(indices[0], indices[0] + len(indices))) + if indices != expected: + missing = set(expected) - set(indices) + extra = set(indices) - set(expected) + if missing or extra: + gap_filings.append((acc, ps[0]["filing"]["companyName"], sorted(missing), sorted(extra), indices)) + +print(f"\n Filings with duplicate paragraph indices: {len(dup_filings)}") +for acc, company, dups in dup_filings[:10]: + print(f" [{acc}] {company}: duplicates at indices {dups}") + +print(f"\n Filings with index gaps: {len(gap_filings)}") +for acc, company, missing, extra, indices in gap_filings[:10]: + print(f" [{acc}] {company}") + if missing: + print(f" Missing indices: {missing}") + if extra: + print(f" Unexpected indices: {extra}") + print(f" Actual indices: {indices}") + +# Check if all start at 0 +non_zero_start = [(acc, ps) for acc, ps in by_filing.items() + if min(p["paragraphIndex"] for p in ps) != 0] +print(f"\n Filings not starting at index 0: {len(non_zero_start)}") +for acc, ps in non_zero_start[:5]: + start = min(p["paragraphIndex"] for p in ps) + print(f" [{acc}] {ps[0]['filing']['companyName']}: starts at {start}") + +# --------------------------------------------------------------------------- +# 6. Cross-filing duplicate paragraphs +# --------------------------------------------------------------------------- +section_header("6. CROSS-FILING DUPLICATE PARAGRAPHS") + +# Group by textHash +by_hash = defaultdict(list) +for p in paragraphs: + by_hash[p["textHash"]].append(p) + +# Find hashes appearing in multiple filings +cross_filing_dupes = {} +for h, ps in by_hash.items(): + accs = set(p["filing"]["accessionNumber"] for p in ps) + if len(accs) > 1: + cross_filing_dupes[h] = ps + +total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values()) +unique_dupe_texts = len(cross_filing_dupes) + +print(f"\n Unique paragraph texts appearing in >1 filing: {unique_dupe_texts}") +print(f" Total paragraphs that are cross-filing duplicates: {total_dupe_paragraphs} ({100*total_dupe_paragraphs/len(paragraphs):.1f}%)") + +# Also count same-hash within same filing +within_filing_dupes = 0 +for h, ps in by_hash.items(): + accs = [p["filing"]["accessionNumber"] for p in ps] + if len(accs) != len(set(accs)): + within_filing_dupes += 1 +print(f" Hashes duplicated WITHIN a single filing: {within_filing_dupes}") + +# Top 20 most duplicated +sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(x[1]), reverse=True) + +print("\n --- Top 20 most duplicated texts across filings ---") +for i, (h, ps) in enumerate(sorted_dupes[:20]): + n_filings = len(set(p["filing"]["accessionNumber"] for p in ps)) + text = ps[0]["text"] + print(f"\n #{i+1}: hash={h} | {n_filings} filings | {ps[0]['wordCount']}w") + print(f" TEXT: {truncate(text, 200)}") + +# Boilerplate analysis: texts appearing in 3+ filings +boilerplate_threshold = 3 +boilerplate_hashes = {h for h, ps in cross_filing_dupes.items() + if len(set(p["filing"]["accessionNumber"] for p in ps)) >= boilerplate_threshold} +boilerplate_paragraphs = sum(len(by_hash[h]) for h in boilerplate_hashes) +print(f"\n Boilerplate (text in {boilerplate_threshold}+ filings):") +print(f" Unique texts: {len(boilerplate_hashes)}") +print(f" Total paragraphs: {boilerplate_paragraphs} ({100*boilerplate_paragraphs/len(paragraphs):.1f}%)") + +print("\n" + "=" * 80) +print(" AUDIT COMPLETE") +print("=" * 80) diff --git a/scripts/data_quality_audit.py b/scripts/data_quality_audit.py new file mode 100644 index 0000000..4a2a7ee --- /dev/null +++ b/scripts/data_quality_audit.py @@ -0,0 +1,539 @@ +#!/usr/bin/env python3 +""" +Novel data quality audit for paragraphs-clean.jsonl. +READ-ONLY: prints findings to stdout, does not modify any files. +""" + +import json +import re +import sys +from collections import Counter, defaultdict +from pathlib import Path + +DATA_PATH = Path(__file__).resolve().parent.parent / "data" / "paragraphs" / "paragraphs-clean.jsonl" + +# ── Cybersecurity domain keywords (broad) ────────────────────────────── +CYBER_KEYWORDS = { + "cyber", "cybersecurity", "security", "breach", "incident", "threat", + "vulnerability", "malware", "ransomware", "phishing", "firewall", + "encryption", "intrusion", "unauthorized", "attack", "hacker", + "data protection", "information security", "network security", + "access control", "authentication", "risk management", "ciso", + "chief information security", "chief information officer", + "information technology", "it systems", "data privacy", "privacy", + "personally identifiable", "pii", "soc", "nist", "iso 27001", + "penetration test", "disaster recovery", "business continuity", + "third party", "vendor", "supply chain", "cloud", "endpoint", + "monitoring", "detection", "response", "remediation", "patch", + "compliance", "regulatory", "safeguard", "protect", "secure", + "confidential", "integrity", "availability", "resilience", + "governance", "oversight", "board of directors", "audit committee", + "risk factor", "material", "disclosure", "1c", "item 1c", +} + +# ── Non-cyber legal boilerplate patterns ──────────────────────────────── +BOILERPLATE_PATTERNS = [ + re.compile(r"forward[- ]looking\s+statements?", re.I), + re.compile(r"safe\s+harbor", re.I), + re.compile(r"private\s+securities\s+litigation\s+reform\s+act", re.I), + re.compile(r"cautionary\s+statement", re.I), + re.compile(r"except\s+as\s+required\s+by\s+law.*no\s+obligation\s+to\s+update", re.I), + re.compile(r"this\s+(annual\s+)?report\s+(on\s+form\s+10-k\s+)?contains?\s+forward", re.I), +] + +# ── SEC item cross-reference pattern ──────────────────────────────────── +SEC_ITEM_RE = re.compile(r"\bItem\s+(\d+[A-Z]?)\b", re.I) + +# ── Dollar amount pattern ────────────────────────────────────────────── +DOLLAR_RE = re.compile(r"\$[\d,]+(?:\.\d+)?\s*(?:thousand|million|billion|trillion)?", re.I) + +# ── Date patterns (unusual formats) ──────────────────────────────────── +DATE_PATTERNS = [ + # MM/DD/YYYY or MM-DD-YYYY + re.compile(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b"), + # Month DD, YYYY + re.compile(r"\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}\b", re.I), + # DD Month YYYY + re.compile(r"\b\d{1,2}\s+(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\b", re.I), + # YYYY-MM-DD (ISO) + re.compile(r"\b\d{4}-\d{2}-\d{2}\b"), +] + +# ── Bullet point characters ──────────────────────────────────────────── +BULLET_RE = re.compile(r"[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25AB\u25CF\u25CB\u25A0\u25A1]") + +# ── Helpers ───────────────────────────────────────────────────────────── +def truncate(text: str, max_len: int = 200) -> str: + if len(text) <= max_len: + return text + return text[:max_len] + "..." + + +def print_section(title: str): + print(f"\n{'=' * 80}") + print(f" {title}") + print(f"{'=' * 80}") + + +def print_finding(name: str, concern: str, count: int, total: int, examples: list[dict]): + pct = count / total * 100 if total else 0 + print(f"\n--- {name} [{concern} CONCERN] ---") + print(f" Count: {count:,} / {total:,} ({pct:.2f}%)") + for i, ex in enumerate(examples[:5]): + filing = ex.get("filing", {}) + company = filing.get("companyName", "?") + print(f" Example {i+1} [{company}]:") + print(f" {truncate(ex['text'], 300)}") + if count > 5: + print(f" ... and {count - 5:,} more") + + +def has_cyber_relevance(text_lower: str) -> bool: + for kw in CYBER_KEYWORDS: + if kw in text_lower: + return True + return False + + +# ── Load data ────────────────────────────────────────────────────────── +def load_data(): + paragraphs = [] + with open(DATA_PATH) as f: + for line in f: + paragraphs.append(json.loads(line)) + return paragraphs + + +def main(): + print("Loading data...") + paragraphs = load_data() + total = len(paragraphs) + print(f"Loaded {total:,} paragraphs.\n") + + # Pre-compute lowercase texts + texts_lower = [p["text"].lower() for p in paragraphs] + + # ════════════════════════════════════════════════════════════════════ + print_section("1. CHARACTER-LEVEL ANOMALIES") + # ════════════════════════════════════════════════════════════════════ + + # 1a. High uppercase ratio (>30%) + high_upper = [] + for p in paragraphs: + t = p["text"] + alpha = sum(1 for c in t if c.isalpha()) + if alpha < 10: + continue + upper = sum(1 for c in t if c.isupper()) + ratio = upper / alpha + if ratio > 0.30: + high_upper.append({**p, "_ratio": ratio}) + high_upper.sort(key=lambda x: x["_ratio"], reverse=True) + print_finding("High uppercase ratio (>30% of alpha chars)", "MEDIUM", + len(high_upper), total, high_upper) + + # 1b. Unusual punctuation density + high_punct = [] + for p in paragraphs: + t = p["text"] + if len(t) < 30: + continue + semis = t.count(";") + colons = t.count(":") + dashes = t.count("—") + t.count("–") + t.count("-") + punct_count = semis + colons + dashes + density = punct_count / len(t) + if density > 0.05: + high_punct.append({**p, "_density": density, "_semis": semis, "_colons": colons, "_dashes": dashes}) + high_punct.sort(key=lambda x: x["_density"], reverse=True) + print_finding("High punctuation density (semicolons/colons/dashes >5% of chars)", "LOW", + len(high_punct), total, high_punct) + + # 1c. Non-ASCII characters + non_ascii_paras = [] + non_ascii_chars_all = Counter() + for p in paragraphs: + t = p["text"] + non_ascii = [(c, hex(ord(c)), ord(c)) for c in t if ord(c) > 127] + if non_ascii: + chars_found = set((c, h) for c, h, _ in non_ascii) + for c, h, _ in non_ascii: + non_ascii_chars_all[f"{c} ({h})"] += 1 + non_ascii_paras.append({**p, "_chars": chars_found}) + print_finding("Paragraphs with non-ASCII characters", "MEDIUM", + len(non_ascii_paras), total, non_ascii_paras) + if non_ascii_chars_all: + print("\n Non-ASCII character frequency:") + for char_repr, cnt in non_ascii_chars_all.most_common(20): + print(f" {char_repr}: {cnt:,} occurrences") + + # 1d. Unusual whitespace (multiple spaces, tabs) + multi_space_re = re.compile(r" +") + tab_re = re.compile(r"\t") + whitespace_issues = [] + for p in paragraphs: + t = p["text"] + multi = len(multi_space_re.findall(t)) + tabs = len(tab_re.findall(t)) + if multi > 0 or tabs > 0: + whitespace_issues.append({**p, "_multi_spaces": multi, "_tabs": tabs}) + print_finding("Unusual whitespace (multiple spaces or tabs)", "MEDIUM", + len(whitespace_issues), total, whitespace_issues) + + # ════════════════════════════════════════════════════════════════════ + print_section("2. CONTENT ANOMALIES") + # ════════════════════════════════════════════════════════════════════ + + # 2a. Dollar amounts + dollar_paras = [] + for p in paragraphs: + matches = DOLLAR_RE.findall(p["text"]) + if matches: + dollar_paras.append({**p, "_amounts": matches}) + print_finding("Paragraphs with dollar amounts", "MEDIUM", + len(dollar_paras), total, dollar_paras) + if dollar_paras: + # Show distribution of dollar amounts + all_amounts = [] + for dp in dollar_paras: + all_amounts.extend(dp["_amounts"]) + print(f"\n Total dollar amount mentions: {len(all_amounts):,}") + amount_counter = Counter(all_amounts) + print(" Most common amounts:") + for amt, cnt in amount_counter.most_common(10): + print(f" {amt}: {cnt:,}") + + # 2b. Dates in text + date_paras = [] + for p in paragraphs: + t = p["text"] + found_dates = [] + for pat in DATE_PATTERNS: + found_dates.extend(pat.findall(t)) + if found_dates: + date_paras.append({**p, "_dates": found_dates}) + print_finding("Paragraphs containing dates", "LOW", + len(date_paras), total, date_paras) + if date_paras: + all_dates = [] + for dp in date_paras: + all_dates.extend(dp["_dates"]) + print(f"\n Total date mentions: {len(all_dates):,}") + + # 2c. Cross-references to other SEC items + cross_ref_paras = [] + for p in paragraphs: + matches = SEC_ITEM_RE.findall(p["text"]) + # Filter out Item 1C (that's expected) + other_items = [m for m in matches if m.upper() != "1C"] + if other_items: + cross_ref_paras.append({**p, "_items": other_items}) + # Count which items are referenced + item_counts = Counter() + for crp in cross_ref_paras: + for item in crp["_items"]: + item_counts[f"Item {item}"] += 1 + print_finding("Cross-references to non-1C SEC items", "HIGH", + len(cross_ref_paras), total, cross_ref_paras) + if item_counts: + print("\n Referenced items:") + for item, cnt in item_counts.most_common(): + print(f" {item}: {cnt:,}") + + # 2d. Non-cyber legal boilerplate + boilerplate_paras = [] + for p in paragraphs: + t = p["text"] + matched = [] + for pat in BOILERPLATE_PATTERNS: + if pat.search(t): + matched.append(pat.pattern[:60]) + if matched: + boilerplate_paras.append({**p, "_patterns": matched}) + print_finding("Non-cybersecurity legal boilerplate", "HIGH", + len(boilerplate_paras), total, boilerplate_paras) + + # ════════════════════════════════════════════════════════════════════ + print_section("3. STRUCTURAL ANOMALIES") + # ════════════════════════════════════════════════════════════════════ + + # 3a. Bullet points mid-text + bullet_paras = [] + for p in paragraphs: + t = p["text"] + if BULLET_RE.search(t): + bullet_paras.append(p) + elif re.search(r"(?:^|\n)\s*[-*]\s+\w", t): + bullet_paras.append(p) + print_finding("Paragraphs with bullet points mid-text", "MEDIUM", + len(bullet_paras), total, bullet_paras) + + # 3b. Embedded newlines + newline_paras = [] + for p in paragraphs: + t = p["text"] + nl_count = t.count("\n") + if nl_count > 0: + newline_paras.append({**p, "_newlines": nl_count}) + newline_paras.sort(key=lambda x: x["_newlines"], reverse=True) + print_finding("Paragraphs with embedded newlines", "MEDIUM", + len(newline_paras), total, newline_paras) + + # 3c. Mid-paragraph headings (ALL CAPS phrase of 3+ words followed by different content) + mid_heading_re = re.compile(r"(?<=\. )([A-Z][A-Z\s]{10,}[A-Z])(?=\.?\s+[A-Z][a-z])") + mid_heading_paras = [] + for p in paragraphs: + t = p["text"] + matches = mid_heading_re.findall(t) + if matches: + mid_heading_paras.append({**p, "_headings": matches}) + print_finding("Mid-paragraph headings (ALL CAPS phrase mid-sentence)", "MEDIUM", + len(mid_heading_paras), total, mid_heading_paras) + + # ════════════════════════════════════════════════════════════════════ + print_section("4. OUTLIER DETECTION") + # ════════════════════════════════════════════════════════════════════ + + # 4a. Extremely high word count (>400) + long_paras = [p for p in paragraphs if p["wordCount"] > 400] + long_paras.sort(key=lambda x: x["wordCount"], reverse=True) + print_finding("Extremely long paragraphs (>400 words)", "HIGH", + len(long_paras), total, long_paras) + if long_paras: + wc_values = [p["wordCount"] for p in long_paras] + print(f"\n Word count range: {min(wc_values)} - {max(wc_values)}") + print(f" Mean: {sum(wc_values)/len(wc_values):.0f}") + + # 4b. Low information density + # Common English stopwords + STOPWORDS = { + "the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", + "of", "with", "by", "from", "is", "are", "was", "were", "be", "been", + "being", "have", "has", "had", "do", "does", "did", "will", "would", + "could", "should", "may", "might", "shall", "can", "that", "which", + "who", "whom", "this", "these", "those", "it", "its", "we", "our", + "us", "they", "their", "them", "he", "she", "his", "her", "as", + "if", "not", "no", "nor", "so", "than", "too", "very", "such", + "also", "each", "any", "all", "both", "other", "some", "into", + "through", "during", "before", "after", "about", "between", "under", + "over", "above", "up", "down", "out", "off", "then", "once", + } + low_info_paras = [] + for p in paragraphs: + words = re.findall(r"[a-z]+", p["text"].lower()) + if len(words) < 20: + continue + stop_ratio = sum(1 for w in words if w in STOPWORDS) / len(words) + if stop_ratio > 0.65: + low_info_paras.append({**p, "_stop_ratio": stop_ratio}) + low_info_paras.sort(key=lambda x: x["_stop_ratio"], reverse=True) + print_finding("Low information density (>65% stopwords)", "LOW", + len(low_info_paras), total, low_info_paras) + + # 4c. Exact substring matches across filings + print("\n--- Exact substring matches across filings [HIGH CONCERN] ---") + print(" (Checking paragraphs that appear as substrings of others in different filings...)") + # Group by accession number for efficiency + by_accession = defaultdict(list) + for p in paragraphs: + acc = p["filing"]["accessionNumber"] + by_accession[acc].append(p) + + # For efficiency, only check paragraphs 50-200 chars (likely fragments/duplicates) + # Sort by length so shorter ones are checked as substrings of longer ones + candidates = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"], p["id"]) + for p in paragraphs if 50 <= len(p["text"]) <= 200] + longer_texts = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"]) + for p in paragraphs if len(p["text"]) > 200] + + substring_matches = [] + # Use a set for dedup + seen = set() + # Only check a sample for performance + check_limit = min(len(candidates), 3000) + for i in range(check_limit): + cand_text, cand_acc, cand_co, cand_id = candidates[i] + for long_text, long_acc, long_co in longer_texts[:5000]: + if cand_acc == long_acc: + continue # same filing, skip + if cand_text in long_text and cand_id not in seen: + seen.add(cand_id) + substring_matches.append({ + "text": cand_text, + "filing": {"companyName": cand_co, "accessionNumber": cand_acc}, + "_found_in": long_co, + }) + break + print(f" Count (sampled {check_limit:,} short paras against {min(len(longer_texts), 5000):,} long paras): {len(substring_matches):,}") + for i, ex in enumerate(substring_matches[:5]): + print(f" Example {i+1} [{ex['filing']['companyName']}] (also in {ex['_found_in']}):") + print(f" {truncate(ex['text'], 300)}") + if len(substring_matches) > 5: + print(f" ... and {len(substring_matches) - 5:,} more") + + # ════════════════════════════════════════════════════════════════════ + print_section("5. SEMANTIC COHERENCE") + # ════════════════════════════════════════════════════════════════════ + + # 5a. Company name mismatch — look for SPECIFIC named companies in text + # that differ from the filing company. Filter out generic refs like "the Company". + company_name_mismatches = [] + # Pattern: proper noun(s) + legal suffix at end, NOT preceded by "the " + specific_company_re = re.compile( + r"(? 20) + dupes = {s: c for s, c in sent_counter.items() if c > 1} + if dupes: + repeated_sent_paras.append({**p, "_dupes": dupes}) + print_finding("Paragraphs with repeated sentences", "HIGH", + len(repeated_sent_paras), total, repeated_sent_paras) + + # ════════════════════════════════════════════════════════════════════ + print_section("SUMMARY") + # ════════════════════════════════════════════════════════════════════ + print(f"\n Total paragraphs analyzed: {total:,}") + print(f"\n HIGH concern findings:") + print(f" - Cross-references to non-1C items: {len(cross_ref_paras):,}") + print(f" - Non-cyber legal boilerplate: {len(boilerplate_paras):,}") + print(f" - Extremely long paragraphs (>400 words): {len(long_paras):,}") + print(f" - Company name mismatches: {len(company_name_mismatches):,}") + print(f" - No cybersecurity keywords: {len(no_cyber):,}") + print(f" - Table/numeric data: {len(table_paras):,}") + print(f" - Encoding artifacts: {len(encoding_paras):,}") + print(f" - Repeated sentences: {len(repeated_sent_paras):,}") + print(f" - Exact substring matches (sampled): {len(substring_matches):,}") + print(f"\n MEDIUM concern findings:") + print(f" - High uppercase ratio: {len(high_upper):,}") + print(f" - Non-ASCII characters: {len(non_ascii_paras):,}") + print(f" - Unusual whitespace: {len(whitespace_issues):,}") + print(f" - Dollar amounts: {len(dollar_paras):,}") + print(f" - Bullet points mid-text: {len(bullet_paras):,}") + print(f" - Embedded newlines: {len(newline_paras):,}") + print(f" - Mid-paragraph headings: {len(mid_heading_paras):,}") + print(f" - URLs in text: {len(url_paras):,}") + print(f"\n LOW concern findings:") + print(f" - High punctuation density: {len(high_punct):,}") + print(f" - Date mentions: {len(date_paras):,}") + print(f" - Low information density: {len(low_info_paras):,}") + print(f" - Footnote references: {len(footnote_paras):,}") + + +if __name__ == "__main__": + main() diff --git a/scripts/detect_generators.py b/scripts/detect_generators.py new file mode 100644 index 0000000..fcdb059 --- /dev/null +++ b/scripts/detect_generators.py @@ -0,0 +1,537 @@ +#!/usr/bin/env python3 +""" +Detect HTML generators for all SEC filing HTML files. +Phase 1: Exhaustive signature detection +Phase 2: Cluster remaining unknowns +Phase 3: Summary statistics +""" + +import os +import re +import sys +from collections import defaultdict, Counter +from pathlib import Path + +HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html") +READ_BYTES = 20_000 + +# Known SEC filing agent CIKs (accession number prefixes) +FILING_AGENT_CIKS = { + "0000950170": "Donnelley Financial Solutions", + "0001193125": "Donnelley Financial Solutions", + "0001558370": "Toppan Merrill", + "0001654954": "Toppan Merrill", +} + + +def detect_generator(filepath: str) -> tuple[str, str]: + """Read first 20KB of file and detect generator. Returns (generator, evidence).""" + with open(filepath, "rb") as f: + raw = f.read(READ_BYTES) + + text = raw.decode("utf-8", errors="replace") + text_lower = text.lower() + + # --- Explicit generator metadata --- + + # 1. (both attribute orderings) + m = re.search( + r' + m = re.search( + r' + m = re.search( + r'", text, re.I): + return "Workiva", "comment: Created with the Workiva Platform" + if re.search(r"", text, re.I): + return "Workiva", "comment: Copyright Workiva" + if re.search(r"", text, re.I): + return "Workiva", "comment: Document created using Wdesk" + + # Toppan Merrill / Bridge + if re.search(r"", text, re.I): + return "Toppan Merrill", "comment: Toppan Merrill" + if re.search(r"", text, re.I): + return "Toppan Merrill", "comment: Merrill Bridge" + + # Donnelley Financial Solutions / RR Donnelley + if re.search(r"", text, re.I): + return "Donnelley Financial Solutions", "comment: Donnelley Financial Solutions" + if re.search(r"", text, re.I): + return "Donnelley Financial Solutions", "comment: RR Donnelley" + + # Broadridge PROfile + if re.search(r"", text, re.I): + return "Broadridge PROfile", "comment: Broadridge PROfile" + # Also match "Licensed to: ... Document created using Broadridge PROfile" + if "broadridge" in text_lower: + return "Broadridge PROfile", "keyword: broadridge" + + # SEC Publisher (in title or comment) + m_title = re.search(r"]*>([^<]+)", text, re.I) + title_text = m_title.group(1).strip() if m_title else "" + if "sec publisher" in text_lower or "sec publisher" in title_text.lower(): + return "SEC Publisher", "title/keyword: SEC Publisher" + + # IRIS Carbon (various filing agents using IRIS Carbon platform) + m = re.search(r"", text, re.I) + if m: + # Extract the filing agent name before "Powered by IRIS Carbon" + m2 = re.search(r"", text, re.I): + return "Certent", "comment: Certent Disclosure Management" + if "certent" in text_lower: + return "Certent", "keyword: certent" + + # CompSci Resources, LLC + if re.search(r"", text, re.I): + return "CompSci Transform", "comment: CompSci Resources" + + # RDG Portal + if re.search(r"", text, re.I): + return "RDG Portal", "comment: RDG Portal" + + # PDF to EDGAR + if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]: + return "PDF to EDGAR", "title/keyword: PDF to EDGAR" + + # Generic generated/created by comments (but NOT bare dates) + m = re.search(r"", text, re.I) + if m: + val = m.group(1).strip() + if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val): + return _normalize_generator(val), f"comment: Generated by {val}" + m = re.search(r"", text, re.I) + if m: + val = m.group(1).strip() + if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val): + return _normalize_generator(val), f"comment: Created by/with {val}" + + # --- Keyword signatures in full text --- + + # 5. Workiva + if re.search(r"\bwdesk\b", text_lower): + return "Workiva", "keyword: wdesk" + if re.search(r"\bworkiva\b", text_lower): + return "Workiva", "keyword: workiva" + + # 6. Donnelley/DFIN + if re.search(r"\brrdonnelley\b", text_lower): + return "Donnelley Financial Solutions", "keyword: rrdonnelley" + if re.search(r"\bedgar-online\b", text_lower): + return "Donnelley Financial Solutions", "keyword: edgar-online" + + # 7. Toppan Merrill + if re.search(r"\btoppan\b", text_lower): + return "Toppan Merrill", "keyword: toppan" + if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower): + return "Toppan Merrill", "keyword: merrill + bridge/xbrl" + if re.search(r"\bbowne\b", text_lower): + return "Toppan Merrill", "keyword: bowne" + + # 8. CompSci Transform + if re.search(r"\bcompsci\b", text_lower): + return "CompSci Transform", "keyword: compsci" + + # 9. ThunderDome + if re.search(r"\bthunderdome\b", text_lower): + return "ThunderDome", "keyword: thunderdome" + + # 10. GoXBRL + if re.search(r"\bgoxbrl\b", text_lower): + return "GoXBRL", "keyword: goxbrl" + + # 16. CSS class naming patterns + if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower): + return "Workiva", "CSS class prefix: wk_" + + # --- SGML document wrapper detection --- + has_sgml = re.search(r"\s*\n?\s*", text, re.I) + if has_sgml: + m_fn = re.search(r"\s*([\w\-\.]+)", text, re.I) + if m_fn: + filename = m_fn.group(1).lower() + # d + digits = Donnelley Financial Solutions + if re.match(r"d\d+", filename): + return "Donnelley Financial Solutions", f"SGML filename: {m_fn.group(1)}" + # tm + digits = Toppan Merrill + if re.match(r"tm\d+", filename): + return "Toppan Merrill", f"SGML filename: {m_fn.group(1)}" + # ea + digits = EFiling/EDGAR Agent + if re.match(r"ea\d+", filename): + return "EFiling/EDGAR Agent", f"SGML filename: {m_fn.group(1)}" + + # SGML-wrapped but no known filename pattern — check for other signals inside + # Rule-Page comments = Broadridge/EFiling variant + if " or without xdx + if " + m = re.search(r'", text, re.I): + return "Workiva" + if re.search(r"", text, re.I): + return "Workiva" + if re.search(r"", text, re.I): + return "Workiva" + + if re.search(r"", text, re.I): + return "Toppan Merrill" + if re.search(r"", text, re.I): + return "Toppan Merrill" + + if re.search(r"", text, re.I): + return "Donnelley Financial Solutions" + if re.search(r"", text, re.I): + return "Donnelley Financial Solutions" + + if re.search(r"", text, re.I): + return "Broadridge PROfile" + if "broadridge" in text_lower: + return "Broadridge PROfile" + + m_title = re.search(r"]*>([^<]+)", text, re.I) + title_text = m_title.group(1).strip() if m_title else "" + if "sec publisher" in text_lower or "sec publisher" in title_text.lower(): + return "SEC Publisher" + + m = re.search(r"", text, re.I) + if m: + return "IRIS Carbon" + + if re.search(r"", text, re.I): + return "Certent" + if "certent" in text_lower: + return "Certent" + + if re.search(r"", text, re.I): + return "CompSci Transform" + + if re.search(r"", text, re.I): + return "RDG Portal" + + if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]: + return "PDF to EDGAR" + + m = re.search(r"", text, re.I) + if m: + val = m.group(1).strip() + if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val): + return _normalize_generator(val) + m = re.search(r"", text, re.I) + if m: + val = m.group(1).strip() + if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val): + return _normalize_generator(val) + + # Keyword signatures + if re.search(r"\bwdesk\b", text_lower): + return "Workiva" + if re.search(r"\bworkiva\b", text_lower): + return "Workiva" + if re.search(r"\brrdonnelley\b", text_lower): + return "Donnelley Financial Solutions" + if re.search(r"\bedgar-online\b", text_lower): + return "Donnelley Financial Solutions" + if re.search(r"\btoppan\b", text_lower): + return "Toppan Merrill" + if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower): + return "Toppan Merrill" + if re.search(r"\bbowne\b", text_lower): + return "Toppan Merrill" + if re.search(r"\bcompsci\b", text_lower): + return "CompSci Transform" + if re.search(r"\bthunderdome\b", text_lower): + return "ThunderDome" + if re.search(r"\bgoxbrl\b", text_lower): + return "GoXBRL" + + if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower): + return "Workiva" + + # SGML document wrapper + has_sgml = re.search(r"\s*\n?\s*", text, re.I) + if has_sgml: + m_fn = re.search(r"\s*([\w\-\.]+)", text, re.I) + if m_fn: + filename = m_fn.group(1).lower() + if re.match(r"d\d+", filename): + return "Donnelley Financial Solutions" + if re.match(r"tm\d+", filename): + return "Toppan Merrill" + if re.match(r"ea\d+", filename): + return "EFiling/EDGAR Agent" + if "/i.test(text)) return "Workiva"; + if (//i.test(text)) return "Workiva"; + if (//i.test(text)) return "Workiva"; + + // Toppan Merrill / Bridge + if (//i.test(text)) + return "Toppan Merrill"; + if (//i.test(text)) return "Toppan Merrill"; + + // Donnelley Financial Solutions / RR Donnelley + if (//i.test(text)) + return "Donnelley Financial Solutions"; + if (//i.test(text)) return "Donnelley Financial Solutions"; + + // Broadridge PROfile + if (//i.test(text)) return "Broadridge PROfile"; + if (textLower.includes("broadridge")) return "Broadridge PROfile"; + + // SEC Publisher + const titleMatch = text.match(/]*>([^<]+)<\/title>/i); + const titleText = titleMatch ? titleMatch[1]!.trim() : ""; + if (textLower.includes("sec publisher") || titleText.toLowerCase().includes("sec publisher")) + return "SEC Publisher"; + + // IRIS Carbon + if (//i.test(text)) return "IRIS Carbon"; + + // Certent + if (//i.test(text)) return "Certent"; + if (textLower.includes("certent")) return "Certent"; + + // CompSci Resources + if (//i.test(text)) return "CompSci Transform"; + + // RDG Portal + if (//i.test(text)) return "RDG Portal"; + + // PDF to EDGAR + if (titleText.toLowerCase() === "pdf to edgar" || textLower.slice(0, 2000).includes("pdf to edgar")) + return "PDF to EDGAR"; + + // Generic generated/created by comments + m = text.match(//i); + if (m) { + const val = m[1]!.trim(); + if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val); + } + m = text.match(//i); + if (m) { + const val = m[1]!.trim(); + if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val); + } + + // --- Keyword signatures --- + + if (/\bwdesk\b/.test(textLower)) return "Workiva"; + if (/\bworkiva\b/.test(textLower)) return "Workiva"; + if (/\brrdonnelley\b/.test(textLower)) return "Donnelley Financial Solutions"; + if (/\bedgar-online\b/.test(textLower)) return "Donnelley Financial Solutions"; + if (/\btoppan\b/.test(textLower)) return "Toppan Merrill"; + if (/\bmerrill\b/.test(textLower) && /\b(?:bridge|ixbrl|xbrl)\b/.test(textLower)) + return "Toppan Merrill"; + if (/\bbowne\b/.test(textLower)) return "Toppan Merrill"; + if (/\bcompsci\b/.test(textLower)) return "CompSci Transform"; + if (/\bthunderdome\b/.test(textLower)) return "ThunderDome"; + if (/\bgoxbrl\b/.test(textLower)) return "GoXBRL"; + + // CSS class naming patterns + if (/class\s*=\s*["'][^"']*\bwk_\w+/.test(textLower)) return "Workiva"; + + // --- SGML document wrapper detection --- + const hasSgml = /\s*\n?\s*/i.test(text); + if (hasSgml) { + const fnMatch = text.match(/\s*([\w\-\.]+)/i); + if (fnMatch) { + const filename = fnMatch[1]!.toLowerCase(); + if (/^d\d+/.test(filename)) return "Donnelley Financial Solutions"; + if (/^tm\d+/.test(filename)) return "Toppan Merrill"; + if (/^ea\d+/.test(filename)) return "EFiling/EDGAR Agent"; + } + + if ( + textLower.includes("