DAPT and precleaning for DAPT

This commit is contained in:
Joey Eamigh 2026-03-29 20:33:39 -04:00
parent c4d7732c87
commit 9d41dd199f
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
31 changed files with 7350 additions and 61 deletions

View File

@ -0,0 +1,87 @@
# Codebook Rationale & Interpretive Guide
Companion to `LABELING-CODEBOOK.md`. Covers the "why" behind design decisions and common interpretive pitfalls that aren't obvious from the codebook itself.
---
## Category Design: Mapping to SEC Regulation S-K Item 106
The six substantive categories map directly to the structure of the SEC's cybersecurity disclosure rule (adopted July 2023):
| Codebook Category | SEC Basis | What the SEC is asking |
|---|---|---|
| Board Governance | Item 106(c)(1) | How does the board oversee cyber risk? |
| Management Role | Item 106(c)(2) | Who in management is responsible, and what qualifies them? |
| Risk Management Process | Item 106(b) | What processes do you use to assess, identify, and manage cyber risk? |
| Third-Party Risk | Item 106(b) | How do you handle vendor/supply chain cyber risk? |
| Strategy Integration | Item 106(b)(2) | Has cyber risk materially affected your business or financials? |
| Incident Disclosure | 8-K Item 1.05 | What happened in an actual cybersecurity incident? |
| None/Other | N/A | Classifier catch-all for non-substantive content |
### Editorial choice: Third-Party Risk as a separate category
The SEC does not give Third-Party Risk its own subsection — vendor/supply chain oversight is part of 106(b) alongside general risk management. The codebook carves it out as a distinct class because it represents a sufficiently different disclosure pattern to be analytically useful.
### "Risk Management" is broader than it sounds
The SEC's 106(b) definition of risk management encompasses the full lifecycle: assessing, identifying, **and managing** cybersecurity risks. Under frameworks like NIST CSF (which the SEC references), "managing" includes Respond and Recover functions — not just preventive controls.
This means incident response **procedures** (escalation chains, playbooks, notification workflows, materiality determination processes) are Risk Management Process, not Incident Disclosure. The test:
| What the paragraph describes | Category |
|---|---|
| Pre-established process for handling incidents (playbooks, escalation chains, "in the event of...") | **Risk Management Process** |
| An actual incident that occurred (dates, scope, remediation of a real event) | **Incident Disclosure** |
Conditional language ("in the event of," "if necessary," "if and when") is a strong signal that the paragraph describes a process, not an event.
### "Strategy Integration" is narrower than it sounds
Strategy Integration does not mean "strategic approach to cybersecurity." It specifically covers the **business and financial consequences** of cyber risk — the SEC 106(b)(2) question of whether cyber risk hit the bottom line or changed business strategy.
What qualifies:
- Materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition")
- Cybersecurity spending and investment (budgets, dollar amounts, year-over-year changes)
- Insurance coverage (carriers, limits, deductibles)
- Financial impact of incidents (costs, revenue loss, insurance claims)
What does not qualify:
- Describing a sophisticated incident response process (that's Risk Management Process even though it's "strategic" in the colloquial sense)
- Describing a materiality **determination process** (the process for deciding if something is material is Risk Management Process; the actual materiality **conclusion** is Strategy Integration)
---
## Specificity Scale: Design Rationale
### The four levels measure disclosure quality progression
| Level | What it tells you |
|---|---|
| 1 — Generic Boilerplate | Company said nothing substantive. Could paste into any filing unchanged. |
| 2 — Sector-Adapted | Company name-dropped a recognized standard (NIST, ISO 27001, SOC 2, etc.) but nothing unique to their organization. |
| 3 — Firm-Specific | Company disclosed at least one fact unique to their organization. |
| 4 — Quantified-Verifiable | Company disclosed two or more independently verifiable hard facts. |
### "Sector-Adapted" refers to the cybersecurity sector, not the company's industry
The name is misleading. "Sector-Adapted" does not mean "the company adapted its disclosure to its industry" (e.g., a bank discussing financial-sector cyber risks). It means the company referenced a recognized **cybersecurity** standard or framework — NIST CSF, ISO 27001, SOC 2, PCI DSS, HIPAA, etc. The "sector" is cybersecurity itself. A utility company mentioning NERC CIP and a retailer mentioning PCI DSS both qualify for Level 2 the same way — they named a standard. The company's own industry is irrelevant to the specificity score.
### Level 2 is intentionally narrow
Level 2 requires naming a recognized standard but having zero firm-specific facts. In practice this is uncommon — most filings either say nothing specific (Level 1) or name a framework alongside a CISO or named committee in the same paragraph (Level 3).
This is a feature, not a bug. The analytically interesting distinction is between Level 1 (boilerplate box-checking) and Level 3/4 (substantive disclosure). Level 2 is a real but thin middle ground. A mushier middle would make the classifier's job harder without adding research value.
### The research contribution is the specificity dimension itself
The SEC requires cybersecurity disclosure but does not grade its quality. The 1-4 specificity scale measures something the SEC doesn't: how much substance is actually in the disclosure versus boilerplate. The core research question is whether companies are genuinely disclosing or just filling the regulatory box.
### Common specificity pitfalls
**Generic practices are not specific.** Penetration testing, vulnerability scanning, tabletop exercises, phishing simulations, security awareness training, encryption, logging and monitoring — all Level 1. These are standard activities that appear in nearly every filing.
**Long paragraphs can still be Level 1.** A paragraph can list ten generic security practices and still be boilerplate. Length and detail are not the same as specificity.
**Cross-references and section titles don't add specificity.** Quoting a long Risk Factors section title with specific-sounding language ("collaborators, contract research organizations, third-party logistics providers") is just metadata, not disclosure substance.
**The materiality boilerplate is Level 1.** The phrase "have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition" appears nearly verbatim in thousands of filings. It is Strategy Integration (it makes a materiality assessment) but Specificity 1 (the assessment is template language).

184
docs/DAPT-PROCEDURE.md Normal file
View File

@ -0,0 +1,184 @@
# DAPT/TAPT Training Procedure
**Date:** 2026-03-29
**Hardware:** NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128
---
## Pre-flight Checklist
| Check | Status |
|-------|--------|
| PyTorch 2.10.0+cu128, CUDA available | Verified |
| RTX 3090, 25.3 GB VRAM, bf16 supported | Verified |
| CUDA driver 13.2 / runtime 12.8 forward compatible | Verified (GPU matmul test passed) |
| ModernBERT-large loads: 396M params, max_position_embeddings=8192 | Verified |
| Corpus: 14,756 docs, ~1.06B tokens, 15 shards | Verified |
| After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss) | Verified |
| Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens | Verified |
| Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30 | Set |
## DAPT Corpus Summary
- **14,568 documents** (after filtering 188 cover pages <10K chars)
- **~1.056 billion tokens** (ModernBERT tokenizer, 4.72 chars/token)
- **~136K training sequences** at seq_len=8192
- **Median document: ~73K tokens** (347K chars) — 90.6% of docs exceed 8192 tokens
- Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed
- Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by `ts/scripts/dapt-corpus-prep.ts`
## Training Configuration
**Config file:** `python/configs/dapt/modernbert.yaml`
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `max_seq_length` | 8192 | Match ModernBERT's pre-training context length |
| `per_device_train_batch_size` | 1 | Memory-limited at 8192 seq_len on 24GB |
| `gradient_accumulation_steps` | 32 | Effective batch size = 32 |
| `num_train_epochs` | 1 | Single pass per Gururangan et al. (2020) and Ponnock (2025) |
| `learning_rate` | 5e-5 | Standard for continued pre-training |
| `mlm_probability` | 0.30 | ModernBERT's pre-training masking rate |
| `warmup_ratio` | 0.05 | ~213 warmup steps |
| `gradient_checkpointing` | true | Required for 8192 seq_len on 24GB |
| `bf16` | true | Native RTX 3090 support |
| `save_steps` | 1000 | Checkpoint every ~1000 steps |
| `eval_steps` | 1000 | Evaluate every ~1000 steps |
| `save_total_limit` | 3 | Keep last 3 checkpoints |
### Epoch Decision Justification
We train for 1 epoch (single pass over the corpus), following the empirical consensus:
- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks.
- **Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384):** Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold.
Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass.
### Sequence Length Decision
ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT.
The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time.
For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability.
## Step 1: DAPT
### Command
```bash
cd python
bun run py:train dapt --config configs/dapt/modernbert.yaml
```
Equivalent to: `uv run main.py dapt --config configs/dapt/modernbert.yaml`
### What happens
1. Loads ModernBERT-large from HuggingFace (cached after first download)
2. Loads 14,756 docs from `data/dapt-corpus/`, filters 188 < 10K chars
3. Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens
4. Splits 2% validation (~2,700 sequences), 98% train (~133K sequences)
5. Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing
6. ~4,257 steps total, logging every 50, checkpoint+eval every 1,000
7. Saves final model + tokenizer to `checkpoints/dapt/modernbert-large/final/`
8. Reports final eval loss and perplexity
### Expected duration
~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing).
### Resume if interrupted
HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically.
### Output
```
checkpoints/dapt/modernbert-large/
checkpoint-1000/
checkpoint-2000/
checkpoint-3000/
final/ <- final model + tokenizer
config.json
model.safetensors
tokenizer.json
...
```
## Step 2: TAPT
After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.
### Command
```bash
bun run py:train dapt --config configs/dapt/modernbert.yaml \
--model-path ../checkpoints/dapt/modernbert-large/final \
--data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
--output-dir ../checkpoints/tapt/modernbert-large \
--stage tapt
```
### What happens
1. Loads the DAPT checkpoint (not the base ModernBERT)
2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl`
3. Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
4. Trains MLM with same hyperparameters
5. Saves to `checkpoints/tapt/modernbert-large/final/`
### Expected duration
~2-3 hours (much smaller corpus).
### Output
```
checkpoints/tapt/modernbert-large/
final/ <- SEC-cyBERT-large (DAPT + TAPT)
```
## Step 3: Ablation Checkpoints
The training pipeline produces clean ablation rows for the paper:
| Model | Checkpoint | Description |
|-------|-----------|-------------|
| Base | `answerdotai/ModernBERT-large` | Off-the-shelf, no domain adaptation |
| +DAPT | `checkpoints/dapt/modernbert-large/final` | After domain pre-training on 14.5K filings |
| +DAPT+TAPT | `checkpoints/tapt/modernbert-large/final` | After task pre-training on 72K paragraphs |
Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage.
## Monitoring
During training, the Trainer logs to stderr every 50 steps:
- `loss` — training MLM loss (cross-entropy on masked tokens)
- `learning_rate` — current LR (ramps up during warmup, then decays)
- `epoch` — progress through the epoch
Every 1,000 steps, it also reports:
- `eval_loss` — validation MLM loss
- Perplexity can be computed as `2^eval_loss`
**What to watch for:**
- Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
- Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
- If loss spikes or goes to NaN, the learning rate may be too high
## Artifacts
| File | Purpose |
|------|---------|
| `python/configs/dapt/modernbert.yaml` | DAPT config |
| `python/configs/dapt/neobert.yaml` | NeoBERT config (if needed) |
| `python/main.py` | CLI entrypoint |
| `python/src/dapt/train.py` | Training loop |
| `python/src/data/corpus.py` | Corpus loading + tokenization |
| `python/src/common/config.py` | Typed YAML config |
| `ts/scripts/dapt-corpus-prep.ts` | Corpus preparation from HTML |
| `ts/scripts/dapt-corpus-analytics.ts` | Corpus analytics |
| `data/dapt-corpus/shard-*.jsonl` | Cleaned corpus (15 shards) |

421
docs/DATA-QUALITY-AUDIT.md Normal file
View File

@ -0,0 +1,421 @@
# Data Quality Audit — SEC-cyBERT Corpus
**Date:** 2026-03-29
**Scope:** Full audit of DAPT corpus (14,756 docs) and paragraph data (72,045 paragraphs)
**Method:** 6 automated agents + manual investigation
---
## 1. Executive Summary
The data is in better shape than initially feared, but two significant issues were uncovered:
1. **Inlined section headings affect ~22% of paragraphs** across all generators. These are section titles ("Risk Management and Strategy", "Board Oversight") prepended to paragraph body text with no separator. Consistent across generators = our extraction pipeline's heading detection, not a generator HTML quirk.
2. **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)** produces severely degraded extraction quality: 36.8% orphan word rate (8x corpus average), 5.9% fragment rate, lowest paragraphs-per-filing. This generator was hidden in a 45% "UNKNOWN" bucket until we identified it. It affects 1,014 filings and 5,779 paragraphs.
**Decision:** Strip inlined headers from fine-tuning data. Expand orphan word patching to cover EFiling/XDX paragraphs. Tag all paragraphs with generator metadata for quality-aware training.
---
## 2. Generator Landscape
### Identification
We identified **14 distinct filing generators** covering 99.99% of all 14,759 HTML files. Only 2 files remain unidentified (both 0-byte empty files). Detection used a combination of HTML meta tags, comments, namespace declarations, CSS class patterns, and CIK-based filing agent identification.
Full reference: `docs/EDGAR-FILING-GENERATORS.md`
### Generator Distribution
| Generator | Files | % | Paragraphs | Quality Tier |
|-----------|-------|---|------------|-------------|
| Workiva | 3,592 | 24.3% | 22,407 | Clean |
| Inline XBRL (unattributed) | 2,417 | 16.4% | 15,233 | Clean |
| Donnelley Financial Solutions | 2,327 | 15.8% | 13,153 | Clean |
| EFiling/EDGAR Agent (XDX) | 1,997 | 13.5% | 5,779 | **Bad** |
| Toppan Merrill | 1,378 | 9.3% | 7,332 | OK |
| CompSci Transform | 879 | 6.0% | 3,287 | **Degraded** |
| SEC Publisher | 793 | 5.4% | — | — |
| ThunderDome | 732 | 5.0% | 3,581 | OK |
| Broadridge PROfile | 465 | 3.2% | 772 | OK |
| Certent | 86 | 0.6% | — | — |
| SGML-wrapped | 58 | 0.4% | — | — |
| IRIS Carbon | 20 | 0.1% | — | — |
| RDG Portal | 12 | 0.1% | — | — |
| PDF to EDGAR | 1 | <0.1% | | |
Note: Not all HTML files produced paragraphs (some lack Item 1C, some are 8-Ks or amendments).
### Quality Metrics by Generator
| Generator | Orphan% | Fragment% | Trunc% | InlHdr% | AvgWC | Paras/Filing |
|-----------|---------|-----------|--------|---------|-------|-------------|
| Workiva | 0.6% | 1.2% | 0.5% | 21.9% | 99.7 | 8.4 |
| Donnelley | 0.5% | 1.4% | 0.5% | 21.8% | 92.7 | 7.9 |
| Inline XBRL | 0.9% | 1.5% | 0.6% | 21.8% | 98.4 | 8.1 |
| Toppan Merrill | 3.2% | 3.0% | 1.4% | 23.1% | 84.7 | 8.1 |
| ThunderDome | 3.0% | 4.3% | 1.8% | 24.4% | 83.0 | 7.7 |
| Broadridge | 3.4% | 3.5% | 2.1% | 21.5% | 84.4 | 7.8 |
| **CompSci Transform** | **14.8%** | **5.8%** | 1.7% | 15.4% | 72.1 | 5.6 |
| **EFiling/XDX** | **36.8%** | **5.9%** | **2.1%** | 16.5% | 69.8 | 5.7 |
| *Corpus average* | *4.7%* | *2.3%* | *0.9%* | *21.5%* | *91.9* | *7.7* |
**Bold** = >2x corpus average.
Key observations:
- Inlined headers (~22%) are consistent across ALL generators → extraction pipeline issue, not generator-specific
- Orphan words are highly concentrated: EFiling/XDX (36.8%) and CompSci Transform (14.8%) account for the vast majority
- Workiva and Donnelley produce the cleanest output (>70% of paragraphs)
- EFiling/XDX also has the lowest paragraphs-per-filing (5.7 vs 7.7 avg), suggesting extraction misses content
- CompSci Transform was acquired by Broadridge in July 2024; newer filings may appear as Broadridge PROfile
---
## 3. Issue Inventory
### 3.1 Inlined Section Headings (~22% of paragraphs)
**What:** Section headings like "Risk Management and Strategy", "Board Oversight", "Cybersecurity Governance" are prepended to paragraph body text with no separator.
**Example:**
```
Risk Management and Strategy We have designed our cybersecurity risk management program to identify,
assess, and manage risks from cybersecurity threats...
```
**Cause:** The `extractItem1C()` function in `fast-reparse.ts` extracts the full Item 1C text including sub-section headings, and the paragraph segmenter doesn't strip them. The headings become the first "sentence" of the paragraph.
**Impact on classification:**
- The heading is a near-perfect predictor of `content_category` — creates shortcut learning risk
- The heading tells you nothing about `specificity_level` — model still has to read body text
- At inference time, heading presence will be inconsistent across filings
- **Decision: Strip from fine-tuning data.** Headings are consistent across generators, so a single detection heuristic works.
**Detection heuristic:**
- Common Item 1C sub-headings: "Risk Management and Strategy", "Risk Management", "Board Oversight", "Governance", "Management('s) Role", "Cybersecurity Governance", "Incident Detection", "Incident Response", "Strategy", "Third Party", "Third-Party"
- Structural: 2-5 title-cased words at paragraph start, followed by sentence text starting with "We", "Our", "The", a pronoun, or an article
### 3.2 Orphan Words (4.7% overall, concentrated in 2 generators)
**What:** The first word of a paragraph is dropped during extraction, leaving a paragraph that starts with lowercase mid-sentence.
**Example:**
```
sole executive officer and director is responsible for assessing and managing cybersecurity risks...
```
(should be: "Our sole executive officer...")
**Cause:** HTML source wraps text at fixed column width. The `<span>` opening tag consumes most of a line, so only the first word fits before a source newline. `stripHtml()` preserves that newline, and downstream processing drops the single-word fragment.
**Scope by generator:**
- EFiling/XDX: 36.8% of its paragraphs (2,127 affected)
- CompSci Transform: 14.8% (487 affected)
- All others: <3.5%
- Total: ~3,400 paragraphs corpus-wide
**Already patched:** 215 paragraphs were surgically patched in `paragraphs-clean.patched.jsonl`. The remaining ~3,185 need the same treatment.
**Impact on classification:** Meaning is preserved — annotators and models can infer the missing word from context. But systematically missing subjects ("We", "Our") could subtly bias specificity assessment.
### 3.3 Orphaned Fragments (2.3% overall)
**What:** List items split from their parent paragraph, creating very short standalone paragraphs.
**Example:**
```
the use of external service providers, where appropriate, to assess, test or otherwise assist with
aspects of our security controls;
```
**Cause:** Semicolon-terminated list items are treated as paragraph boundaries by the segmenter.
**Scope:** 250 paragraphs identified in the narrower audit; ~1,660 total with <25 words.
**Impact:** These are classifiable in isolation (the content is clear) but lack the framing context of the parent list. Likely annotated correctly but may have lower model confidence.
### 3.4 Truncated Paragraphs (0.37%)
**What:** Paragraphs ending mid-sentence without terminal punctuation.
**Two patterns:**
1. Paragraph absorbed the start of the next section's heading (ends with "Governance", "Identify")
2. True truncation — cross-reference sentence cut off ("Risk Factors" in this)
**Scope:** 264 paragraphs.
**Impact:** Low — 0.37% and meaning is usually recoverable from context.
### 3.5 Cross-Filing Boilerplate (53.6%)
**What:** Paragraphs with identical text appearing in multiple filings. Driven by law firms and compliance consultants providing template language.
**Scope:** 38,601 paragraphs share text with at least one other filing. 1,705 unique boilerplate texts appear in 3+ filings. The most-duplicated text appears in 138 filings across 84 companies.
**Impact:** This IS the construct being measured. Boilerplate paragraphs should be classified as Specificity Level 1 (Generic Boilerplate). Not a quality issue — it's the signal.
---
## 4. DAPT Corpus Audit
### 4.1 Corpus Stats
- **14,756 documents**, 15 shards
- **~1.06 billion tokens** (ModernBERT tokenizer; chars/4.72, not chars/4.0)
- **Median doc length:** 347K chars (~73K tokens)
- **90.8% of docs exceed 8,192 tokens** — chunking is mandatory (handled by training pipeline)
### 4.2 Issues Found
| Issue | Scope | Verdict |
|-------|-------|---------|
| 188 docs < 10K chars (cover pages) | 0.04% of tokens | Filter out |
| XBRL preambles (8% of docs) | 0.18% of chars | Negligible |
| Financial table fragments (~25% of lines) | Widespread | Acceptable — SEC domain includes numbers |
| URLs in 80% of docs (~4 per doc) | Low | Optional cleanup |
| 64 8-K filings mixed in | Tiny | Keep — domain-relevant |
| 1,470 amendments (median 94K chars) | Substantial content | Keep |
| 2 single-block docs (no paragraph breaks) | 2 docs | Filter out |
| 242 near-duplicate cross-year filings | 1.6% | Keep — different content |
| 0 garbled text, 0 HTML artifacts | | Clean |
| 0 sentence boundary violations | | Clean |
### 4.3 Decision
Filter <10K char docs and 2 structureless docs. Everything else is acceptable for unsupervised MLM. The model will learn SEC language including financial notation, legal boilerplate, and cybersecurity terminology.
---
## 5. Patch History
### Patch 1: Orphan Word Fix (2026-03-29)
- **Scope:** 215 paragraphs, 77 filings
- **Method:** Detect orphan word in raw HTML, prepend to paragraph text
- **Validation:** All prefix additions, 0 boundary changes, 0 text shrinkages
- **Files:** `paragraphs-clean.patched.jsonl`, `training.patched.jsonl`
- **Annotation impact:** 142 annotated paragraphs affected (0.28%), meaning preserved
### Patch 2: Expanded Orphan Word Fix (2026-03-29)
- **Scope:** 2,233 paragraphs (includes Patch 1's 215; net 2,026 new)
- **Method:** HTML lookback — find paragraph text in stripped HTML, extract preceding word
- **Top orphan words:** We (632), Our (403), As (152), The (91), To (84), In (78), Cybersecurity (64)
- **Validation:** 0 false positives after filtering "Table of Contents" artifacts. 1,122 candidates rejected (legitimate list items starting with lowercase).
- **Annotation impact:** 1,400 annotated paragraphs affected. Label bias detected: Strategy Integration 1.55x over-represented, Management Role 0.49x under-represented in orphan-word paragraphs. **Recommended: re-run Stage 1 on patched text (~$15-20, may resolve conflicts).**
- **Script:** `ts/scripts/patch-orphan-words.ts`
- **Patch file:** `data/paragraphs/patches/orphan-word-patches.jsonl`
### Patch 3: Heading Stripping (2026-03-29)
- **Scope:** 7,514 paragraphs (10.4%)
- **Method:** Explicit pattern matching against known Item 1C sub-section headings (71 unique headings). Validated by confirming body text starts with sentence-starting word.
- **Top headings stripped:** Risk Management and Strategy (2,453), Cybersecurity Risk Management and Strategy (1,281), Cybersecurity Governance (1,208), Governance (301), Third-Party Risk Management (224)
- **Annotation impact:** 5,013 annotated paragraphs. Heading removal eliminates shortcut learning risk (heading was near-perfect predictor of content_category).
- **Script:** Inline Python (see audit process notes)
- **Patch file:** `data/paragraphs/patches/heading-strip-patches.jsonl`
### Patch 4: Colon-Headed Paragraphs (2026-03-29)
- **Scope:** 370 paragraphs
- **Method:** Regex match for "Heading Text: Sentence..." patterns. Only fires when colon is followed by known sentence-starting word.
- **Top headings stripped:** Education and Awareness (97), Safeguards (18), Management (15), Approach (13), Training (11)
- **Annotation impact:** 227 annotated paragraphs.
- **Patch file:** `data/paragraphs/patches/colon-heading-patches.jsonl`
### Patch 5: Extended Separator Headings (2026-03-29)
- **Scope:** 184 paragraphs
- **Method:** Detect headings with period, dash/em-dash, semicolon, or ALL-CAPS separators that Patches 3-4 missed.
- **Annotation impact:** 133 annotated paragraphs.
- **Patch file:** `data/paragraphs/patches/heading-strip-v2-patches.jsonl`
### Patch 6: HTML-Confirmed Headings (2026-03-29)
- **Scope:** 343 paragraphs
- **Method:** Extract bold/underline/h-tag styled text from source HTML (cached in `filing-headings.jsonl`), match against paragraph starts, validate with sentence-start check. Zero false positives — if the HTML says it's bold, it's a heading.
- **855 ambiguous cases rejected** where styled text was a sentence subject (e.g., bold "Cybersecurity" starting "Cybersecurity is a critical component...")
- **Annotation impact:** 270 annotated paragraphs.
- **Scripts:** `ts/scripts/extract-html-headings.ts` (1.7s for 6,341 filings with 32 workers)
- **Patch file:** `data/paragraphs/patches/heading-strip-html-patches.jsonl`
- **Cache:** `data/paragraphs/quality/filing-headings.jsonl`
### Cumulative Heading Strip Summary
| Pass | Method | Count | Cumulative |
|------|--------|-------|-----------|
| Patch 3 | Explicit heading patterns (space separator) | 7,514 | 7,514 |
| Patch 4 | Colon separator | 370 | 7,884 |
| Patch 5 | Period/dash/caps/semicolon | 184 | 8,068 |
| Patch 6 | HTML bold/underline confirmed | 343 | 8,411 |
| **Total** | | **8,411** | **11.7% of corpus** |
---
## 6. Data Integrity Rules
1. **`paragraphs-clean.jsonl` is FROZEN.** Never modify. It is the original extraction output and the source of truth for reproducibility.
2. **All fixes go through `.patched.jsonl` files.** The patched file has the same schema and IDs as the original. Text may differ. TextHash is updated.
3. **Annotations link by paragraph `id` (UUID).** This linkage is stable across patches — IDs never change.
4. **Never re-run extraction from HTML.** Cascade effects from merge logic changes cause thousands of ripple-effect text changes (documented in `docs/SEC-HTML-CLEANING.md`). Surgical JSONL patching is the only safe approach.
5. **Every patch is documented** with scope, method, validation, and annotation impact.
6. **Quality metadata is separate from text data.** Per-paragraph quality scores live in a separate file, not embedded in the paragraph data. This keeps the data schema stable.
---
## 7. Quality Tier System
Each paragraph gets a quality tier based on detected issues:
| Tier | Criteria | Count | % | Training Action |
|------|----------|-------|---|-----------------|
| **clean** | No detected issues | 58,165 | 80.7% | Full weight (1.0) |
| **headed** | Had inlined section heading (now stripped) | 7,402 | 10.3% | Full weight (1.0) — heading removed |
| **degraded** | Embedded bullets (1,941), invisible merges (222), fragments, truncations, no-cyber | 4,331 | 6.0% | Downweight (0.5) — content preserved but structure degraded |
| **minor** | Had orphan word (now fixed) | 2,147 | 3.0% | Full weight (1.0) — word restored |
Note: Tiers reflect the most severe issue. A paragraph can have multiple issues. All "headed" and "minor" paragraphs have been patched — the tier records what WAS wrong, not what IS wrong.
### Sample Weighting Strategy
During fine-tuning, each training sample is weighted by quality tier to reduce the influence of structurally degraded paragraphs without discarding them entirely:
- **clean + headed + minor (1.0 weight):** Content is correct and text is clean (after patching). These form the reliable training signal.
- **degraded (0.5 weight):** Content is present but structural issues (concatenated list items, fragments, truncations) may cause the text to misrepresent paragraph-level semantics. The labels are likely correct (models can infer meaning despite structural noise), but the text doesn't match what the model will see at inference time on clean filings. Downweighting reduces overfitting to degraded patterns without losing the content signal.
Sample weighting is applied via the HuggingFace Trainer's `sample_weight` column or a custom loss function that multiplies cross-entropy by the tier weight.
### Additional Findings (from anomaly detection)
| Finding | Count | Concern |
|---------|-------|---------|
| Embedded bullet points mid-text | 1,941 (flagged degraded) | MEDIUM — semicolon-separated list items without bullet markers |
| Invisible merges (no separators) | 222 (flagged degraded) | MEDIUM — list items concatenated with no trace of structure (e.g., Bancorp 34) |
| No cybersecurity keywords at all | 528 (348 annotated) | LOW — investigated, keyword filter was too narrow, labels correct |
| Cross-references to other SEC items | 5,750 | LOW — mostly legitimate "see Item 1A" refs |
| Dollar amounts in text | 46 | LOW — mostly legitimate incident costs |
| Paragraphs >400 words | 149 | LOW — possible failed splits |
| Repeated sentences within paragraph | 9 | LOW — copy-paste artifacts |
---
## 8. Annotation Impact (Quantified)
Of 49,795 annotated paragraphs:
### Annotated set by generator
| Generator | Annotated Paras | % of Annotated Set |
|-----------|----------------|-------------------|
| Inline XBRL | ~10,500 | 21.1% |
| Workiva | ~15,300 | 30.7% |
| Donnelley | ~9,000 | 18.1% |
| Toppan Merrill | ~5,900 | 11.8% |
| EFiling/XDX | 3,562 | 7.2% |
| ThunderDome | ~2,500 | 5.0% |
| CompSci Transform | 2,288 | 4.6% |
| Others | ~700 | 1.4% |
### Orphan words in annotated set
**2,178 annotated paragraphs (4.37%)** start with lowercase (non-list) — orphan word candidates.
| Generator | Orphan Paras | % of Generator's Annotated | % of All Orphans |
|-----------|-------------|---------------------------|-----------------|
| EFiling/XDX | 1,389 | 39.0% | 63.8% |
| CompSci Transform | 401 | 17.5% | 18.4% |
| All others | 388 | <5% each | 17.8% |
EFiling/XDX alone accounts for 63.8% of all orphan-word paragraphs in the annotated set.
### Label bias in orphan-word paragraphs
- **Strategy Integration** is over-represented at 1.55x base rate (16.1% of orphan paras vs 10.4% overall)
- **Board Governance** and **Management Role** are under-represented (0.60x and 0.49x) — likely because governance headings/lead-in sentences get split off, leaving the orphan fragment lacking governance context
This suggests orphan words may cause subtle category misclassification, not just missing text.
### Inlined headers in annotated set
**4,513 annotated paragraphs (9.06%)** have section headings merged into text. Relatively uniform across generators (~9-10%), but notably lower for EFiling/XDX (5.3%) and CompSci Transform (5.6%) — these generators split at headers rather than merging them.
### Combined impact
**6,691 annotated paragraphs (13.44%)** have either orphan-word OR inlined-header issues.
Per generator:
- EFiling/XDX: 1,577 of 3,562 (44.3%) affected
- CompSci Transform: ~600 of 2,288 (~26%) affected
- All others: <15% affected
---
## 9. Summary of Changes to Annotated Data
| Change | Annotated Paragraphs Affected | Semantic Impact |
|--------|------------------------------|----------------|
| Orphan word restored | 1,400 | Label bias detected (Strategy 1.55x, Management 0.49x) |
| Heading stripped (all passes) | ~5,643 | Removes shortcut learning signal |
| No-cyber flagged as degraded | 348 | May want to exclude from training |
| **Total modified** | **~7,100 of 49,795 (14.3%)** | |
## 10. Remaining Questions / Next Steps
- **Re-run Stage 1 on orphan-word paragraphs** (~$15-20 for 1,400 paragraphs). Label bias suggests some misclassification. May resolve conflicts and save Stage 2 judge costs.
- **Heading-stripped paragraphs:** Existing labels are likely still valid — annotators classified the body text, not the heading. But could re-run if budget allows.
- **Exclude 348 no-cyber-keyword annotated paragraphs?** If labeled "None/Other" they're fine; if other categories, they're noise from section bleed.
- **855 ambiguous HTML heading cases** — bold/underline text at paragraph start but also a valid sentence subject. Would need manual review to resolve.
- **Run DAPT** — filter <10K char docs from DAPT corpus, then start training.
---
## 11. Artifacts Produced
### Data Files
```
data/paragraphs/
├── paragraphs-clean.jsonl ← FROZEN original (72,045 paragraphs)
├── paragraphs-clean.patched.jsonl ← All 6 patches applied (orphan + heading)
├── training.patched.jsonl ← Training subset, all patches applied (49,795)
├── patches/
│ ├── orphan-word-patches.jsonl ← 2,233 orphan word recovery records
│ ├── heading-strip-patches.jsonl ← 7,514 heading strip records (space sep)
│ ├── colon-heading-patches.jsonl ← 370 colon-heading strip records
│ ├── heading-strip-v2-patches.jsonl ← 184 period/dash/caps/semicolon headings
│ └── heading-strip-html-patches.jsonl← 343 HTML bold/underline confirmed headings
└── quality/
├── generator-tags.jsonl ← 14,759 accession → generator mappings
├── quality-scores.jsonl ← 72,045 per-paragraph quality metadata
├── filing-headings.jsonl ← Cached styled headings from HTML (3,459 filings)
└── ambiguous-filings.txt ← Filing list used for HTML heading extraction
```
### Scripts
| Script | Purpose |
|--------|---------|
| `ts/scripts/patch-orphan-words.ts` | Detect and recover orphan words from HTML source |
| `ts/scripts/tag-generators.ts` | Identify filing generator from HTML signatures |
| `ts/scripts/extract-html-headings.ts` | Extract bold/underline headings from HTML (32-worker parallel, 1.7s) |
| `ts/scripts/dapt-corpus-prep.ts` | DAPT corpus preparation (HTML → clean JSONL, 32-worker parallel) |
| `scripts/detect_generators.py` | Python generator detection (initial analysis) |
| `scripts/generator_quality_analysis.py` | Generator × quality metrics cross-reference |
| `scripts/analyze_generator_quality.py` | Annotation impact analysis by generator |
| `scripts/find_heading_candidates.py` | Creative heading pattern hunt (7 approaches) |
| `scripts/data_quality_audit.py` | Statistical anomaly detection (content, structure, outliers) |
| `scripts/audit_corpus.py` | Text corruption checks |
| `scripts/audit_paragraphs.py` | Boundary audit (per-filing stats, coherence, duplicates) |
### Documentation
| Doc | Content |
|-----|---------|
| `docs/DATA-QUALITY-AUDIT.md` | This document — full audit findings, patch history, quality tiers |
| `docs/EDGAR-FILING-GENERATORS.md` | Generator reference — 14 vendors, signatures, market share, quality issues |
| `docs/SEC-HTML-CLEANING.md` | HTML cleaning lessons and pitfalls |

View File

@ -0,0 +1,490 @@
# SEC EDGAR Filing Generator Reference
Reference for identifying which software generated a given SEC 10-K HTML filing.
Built from direct inspection of EDGAR filings and market research (March 2026).
---
## 1. Major Vendors and HTML Signatures
### Workiva (Wdesk) -- Market Leader for 10-K/10-Q
**Filing agent CIK:** `0001628280`
**HTML comment signature (lines 1-3):**
```html
<?xml version='1.0' encoding='ASCII'?>
<!--XBRL Document Created with the Workiva Platform-->
<!--Copyright 2025 Workiva-->
<!--r:{uuid},g:{uuid},d:{hex-id}-->
```
**Detection heuristics:**
- HTML comment: `XBRL Document Created with the Workiva Platform`
- HTML comment: `Copyright \d{4} Workiva`
- Third comment line contains `r:`, `g:`, `d:` UUIDs (document/generation tracking)
- `xml:lang="en-US"` attribute on `<html>` tag
- Body uses inline styles exclusively (no CSS classes on content elements)
- Heavy use of `<span>` with inline styles containing `background-color`, `font-family`, `font-size`, `font-weight`, `line-height` in every span
- Div IDs follow pattern: `i{hex32}_{number}` (e.g., `id="i56b78781f7c84a038f6ae0f6244f7dd8_1"`)
- Tables use `display:inline-table` and `vertical-align:text-bottom`
- iXBRL fact IDs follow pattern: `F_{uuid}` (e.g., `id="F_d8dc1eb1-109d-445d-a55a-3dde1a81ca63"`)
- No `<meta name="generator">` tag
- No CSS classes on body content (purely inline styles)
**Structural patterns:**
- Span-heavy: nearly every text fragment wrapped in `<span style="...">`
- Font specified as `font-family:'Times New Roman',sans-serif` (note: sans-serif fallback, unusual)
- Line-height specified on every span (e.g., `line-height:120%`)
- Background color explicitly set: `background-color:#ffffff`
**Known quality issues:**
- Extremely verbose HTML; simple paragraphs become deeply nested span trees
- Text extraction is clean because span boundaries align with word boundaries
- Large file sizes due to inline style repetition
---
### DFIN / Donnelley Financial Solutions (ActiveDisclosure)
DFIN operates under **two distinct CIKs** with **two different HTML output formats**.
#### DFIN "New" ActiveDisclosure (primary)
**Filing agent CIK:** `0000950170` (also `0000950130`)
**HTML comment signature:**
```html
<?xml version='1.0' encoding='ASCII'?>
<!-- DFIN New ActiveDisclosure (SM) Inline XBRL Document - http://www.dfinsolutions.com/ -->
<!-- Creation Date :2025-02-18T12:36:24.4008+00:00 -->
<!-- Copyright (c) 2025 Donnelley Financial Solutions, Inc. All Rights Reserved. -->
```
**Detection heuristics:**
- HTML comment: `DFIN New ActiveDisclosure`
- HTML comment: `http://www.dfinsolutions.com/`
- HTML comment: `Copyright (c) \d{4} Donnelley Financial Solutions`
- HTML comment: `Creation Date :` with ISO timestamp
- Body style: `padding:8px;margin:auto!important;`
- Inline styles use `font-kerning:none;min-width:fit-content;` on most spans
- Extensive use of `white-space:pre-wrap` on spans
- CSS class `item-list-element-wrapper` and `page-border-spacing` present
- iXBRL fact IDs follow pattern: `F_{uuid}`
**Structural patterns:**
- Every text span carries `min-width:fit-content` (distinctive)
- Uses `&#160;` for spacing extensively
- Uses `<p>` tags with inline margins for all paragraphs
- Tables use explicit `padding-top:0in;vertical-align:top;padding-bottom:0in` cell styles
#### DFIN Legacy (RR Donnelley heritage)
**Filing agent CIK:** `0001193125`
**HTML signature:**
```html
<?xml version='1.0' encoding='ASCII'?>
<html xmlns:link="..." xmlns:xbrldi="..." ...>
<head>
<title>10-K</title>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
</head>
<body style="line-height:normal;background-color:white;">
<h5 style="font-size:10pt;font-weight:bold"><a href="#toc">Table of Contents</a></h5>
```
**Detection heuristics:**
- No identifying HTML comments (no generator/copyright comment)
- Accession number prefix `0001193125` is definitive
- `<body style="line-height:normal;background-color:white;">`
- Immediately starts with `<h5>` Table of Contents link
- Uses deprecated namespace aliases: `xmlns:xl`, `xmlns:xbrll`, `xmlns:deprecated`
- iXBRL fact IDs follow pattern: `Fact_{large_number}` (e.g., `id="Fact_129727210"`)
- Uses `<FONT>` tags (HTML 3.2 style) in some documents
- Uppercase HTML tags in older filings (`<P>`, `<B>`, `<DIV>`)
**Structural patterns:**
- Cleaner HTML than ActiveDisclosure New
- Uses semantic `<h5>` for table of contents
- Inline styles are simpler and more standard
- File description filenames follow pattern: `d{number}d10k.htm`
---
### Toppan Merrill (Bridge)
**Filing agent CIKs:** `0001104659` (primary), `0001558370` (secondary)
**HTML comment signature:**
```html
<?xml version='1.0' encoding='ASCII'?>
<!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 10.9.0.3 -->
<!-- Based on: iXBRL 1.1 -->
<!-- Created on: 2/21/2025 8:11:11 PM -->
<!-- iXBRL Library version: 1.0.9062.16423 -->
<!-- iXBRL Service Job ID: {uuid} -->
```
**Detection heuristics:**
- HTML comment: `iXBRL document created with: Toppan Merrill Bridge iXBRL`
- HTML comment: `iXBRL Library version:`
- HTML comment: `iXBRL Service Job ID:`
- Includes version number in comment (e.g., `10.9.0.3`)
- `<title>` tag contains company name + period end date (e.g., `Sunstone Hotel Investors,&#160;Inc._December 31, 2024`)
- Uses `xmlns:xs` alongside `xmlns:xsi` (both XML Schema namespaces)
- Body starts with `<div style="margin-top:30pt;"></div>` (distinctive)
- iXBRL hidden div uses `display:none;` (no additional styles on the div)
**Structural patterns:**
- Context IDs use descriptive names with GUIDs: `As_Of_12_31_2024_{base64-like}`, `From_01_01_2024_to_12_31_2024_{guid}`
- Hidden fact IDs follow pattern: `Hidden_{base64-like}`
- Unit ref IDs follow pattern: `Unit_Standard_USD_{base64-like}`
- No CSS classes used on content elements
- Relatively clean HTML structure
---
### RDG Filings (ThunderDome Portal)
**Filing agent CIK:** `0001437749`
**HTML signature:**
```html
<?xml version='1.0' encoding='ASCII'?>
<html xmlns:thunderdome="http://www.RDGFilings.com" ...>
<head>
<title>avpt20241231_10k.htm</title>
<!-- Generated by ThunderDome Portal - 2/27/2025 6:06:48 PM -->
<meta http-equiv="Content-Type" content="text/html"/>
</head>
<body style="cursor: auto; padding: 0in 0.1in; font-family: &quot;Times New Roman&quot;, Times, serif; font-size: 10pt;">
```
**Detection heuristics:**
- XML namespace: `xmlns:thunderdome="http://www.RDGFilings.com"`
- HTML comment: `Generated by ThunderDome Portal`
- `<title>` contains the filing filename
- Body style includes `cursor: auto; padding: 0in 0.1in`
- iXBRL fact IDs prefixed with `thunderdome-` (e.g., `id="thunderdome-EntityCentralIndexKey"`)
- Context ref IDs use simple date ranges: `d_2024-01-01_2024-12-31`
- Other fact IDs follow `ixv-{number}` or `c{number}` pattern
**Market presence:** ~14,000 filings/year, rank #9 among filing agents. About 5% of annual filings.
---
### Broadridge Financial Solutions (PROfile)
**Filing agent CIKs:** `0001140361` (primary), `0001133228` (secondary)
**HTML comment signature:**
```html
<!-- Licensed to: Broadridge
Document created using Broadridge PROfile 25.1.1.5279
Copyright 1995 - 2025 Broadridge -->
```
**Detection heuristics:**
- HTML comment: `Licensed to: Broadridge`
- HTML comment: `Document created using Broadridge PROfile` with version number
- HTML comment: `Copyright 1995 - \d{4} Broadridge`
- CSS classes with `BRPF` prefix: `BRPFPageBreak`, `BRPFPageBreakArea`, `BRPFPageFooter`, `BRPFPageHeader`, `BRPFPageNumberArea`
- CSS class: `DSPFListTable`
- CSS class: `cfttable`
- CSS class: `Apple-interchange-newline` (suggests Mac/WebKit origin)
- Context ref IDs use XBRL-standard descriptive format: `c20240101to20241231_AxisName_MemberName`
**Note:** Broadridge acquired CompSci Resources LLC in July 2024 and is integrating CompSci's Transform platform. Filings may transition to Broadridge branding over time.
---
### CompSci / Novaworks (Transform and GoFiler)
CompSci Resources produces two tools that leave distinct signatures.
#### CompSci Transform (now Broadridge)
**Filed via:** EdgarAgents LLC (`0001213900`) or other agents
**HTML comment signature:**
```html
<?xml version='1.0' encoding='ASCII'?>
<!-- Generated by CompSci Transform (tm) - http://www.compsciresources.com -->
<!-- Created: Mon Mar 17 19:46:10 UTC 2025 -->
```
**Detection heuristics:**
- HTML comment: `Generated by CompSci Transform`
- HTML comment: `http://www.compsciresources.com`
- XML namespace: `xmlns:compsci="http://compsciresources.com"`
- Body wrapped in: `<div style="font: 10pt Times New Roman, Times, Serif">`
- Uses `<!-- Field: Rule-Page -->` and `<!-- Field: /Rule-Page -->` HTML comments as structural markers
- Empty `<div>` tags used as spacers between paragraphs
- iXBRL context refs use simple sequential IDs: `c0`, `c1`, `c2`, ...
- iXBRL fact IDs follow `ixv-{number}` pattern
- Uses shorthand CSS: `font: 10pt Times New Roman, Times, Serif` (combined property)
- Margin shorthand: `margin: 0pt 0`
**Known quality issues:**
- Words can be broken across `<span>` tags mid-word
- Heavy use of `&#160;` for spacing
- Empty divs between every paragraph create parsing noise
- `<!-- Field: ... -->` comments interspersed throughout document body
#### Novaworks GoFiler (XDX format)
**Filed via:** SECUREX Filings (`0001214659`) or self-filed
**HTML signature:**
```html
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html"/>
</head>
<!-- Field: Set; Name: xdx; ID: xdx_021_US%2DGAAP%2D2024%2D... -->
<!-- Field: Set; Name: xdx; ID: xdx_03B_... -->
```
**Detection heuristics:**
- HTML comments with pattern: `<!-- Field: Set; Name: xdx; ID: xdx_{code}_{data} -->`
- XDX comments appear between `</head>` and `<body>` (unusual placement)
- Body style: `font: 10pt Times New Roman, Times, Serif` (same shorthand as CompSci)
- Empty `<title></title>` tag
- iXBRL fact IDs use `xdx2ixbrl{number}` pattern (e.g., `id="xdx2ixbrl0102"`)
- Standard fact IDs use `Fact{number:06d}` pattern (e.g., `id="Fact000003"`)
- Context refs use `From{date}to{date}` or `AsOf{date}` format (no separators within date)
**XDX explained:** XDX (XBRL Data Exchange) is GoFiler's proprietary format that uses HTML tag ID attributes ("engrams") to embed XBRL metadata. The `xdx_` comments carry taxonomy, entity, period, and unit definitions that GoFiler uses to generate the final iXBRL.
---
### Discount EDGAR / NTDAS (XBRLMaster / EDGARMaster)
**Filing agent CIK:** `0001477932`
**HTML signature:**
```html
<head>
<title>crona_10k.htm</title>
<!--Document Created by XBRLMaster-->
<meta http-equiv="Content-Type" content="text/html"/>
</head>
<body style="text-align:justify;font:10pt times new roman">
```
**Detection heuristics:**
- HTML comment: `Document Created by XBRLMaster`
- Body style: `text-align:justify;font:10pt times new roman`
- Hidden iXBRL div has `id="XBRLDIV"`
- Additional body styles include `margin-left:7%;margin-right:7%`
- Uses lowercase `times new roman` (no capitalization)
- iXBRL fact IDs use `ixv-{number}` pattern
---
### EdgarAgents LLC
**Filing agent CIK:** `0001213900`
EdgarAgents is a filing agent service, not a document creation tool. The HTML they submit is typically generated by CompSci Transform, GoFiler, or other tools. Check the HTML comments to identify the actual generator.
---
### DFIN Legacy (pre-iXBRL / SGML-era)
**Filing agent CIK:** `0001193125`
Older filings (pre-2019) from this CIK may appear in `<DOCUMENT>` SGML wrapper format:
```html
<DOCUMENT>
<TYPE>10-K
<SEQUENCE>1
<FILENAME>d913213d10k.htm
<DESCRIPTION>10-K
<TEXT>
<HTML><HEAD>
<TITLE>10-K</TITLE>
</HEAD>
<BODY BGCOLOR="WHITE" STYLE="line-height:Normal">
<Center><DIV STYLE="width:8.5in" align="left">
```
**Detection heuristics:**
- Uppercase HTML tags: `<HTML>`, `<HEAD>`, `<BODY>`, `<P>`, `<B>`
- `BGCOLOR="WHITE"` attribute (deprecated HTML)
- `<Center>` tag with capital C
- `<DIV STYLE="width:8.5in"` (page-width container)
- `<FONT>` tags for styling
- Filename pattern: `d{number}d10k.htm`
---
## 2. Filing Agent Market Share
Based on [secfilingdata.com](https://www.secfilingdata.com/top-filing-agents/) total filings across all form types:
| Rank | Filing Agent | CIK | 2025 Filings | Total (All Time) |
|------|-------------|-----|-------------|-----------------|
| 1 | Donnelley Financial (DFIN) | 0001193125 | 65,180 | 1,872,890 |
| 2 | EdgarAgents LLC | 0001213900 | 48,021 | 367,211 |
| 3 | Quality Edgar (QES) | 0001839882 | 38,017 | 151,031 |
| 4 | Toppan Merrill | 0001104659 | 48,260 | 988,715 |
| 5 | WallStreetDocs Ltd | 0001918704 | 22,387 | 56,431 |
| 6 | Workiva (Wdesk) | 0001628280 | 21,606 | 141,795 |
| 7 | M2 Compliance LLC | 0001493152 | 13,810 | 164,603 |
| 8 | Davis Polk & Wardwell LLP | 0000950103 | 16,231 | 326,359 |
| 9 | RDG Filings (ThunderDome) | 0001437749 | 14,209 | 187,270 |
| 10 | Morgan Stanley | 0001950047 | 12,822 | 56,468 |
| 11 | Broadridge | 0001140361 | -- | 597,664 |
| 14 | SECUREX Filings | 0001214659 | -- | 115,218 |
| 19 | Blueprint | 0001654954 | -- | 62,250 |
| 20 | FilePoint | 0001398344 | -- | 76,218 |
| 38 | Discount EDGAR | 0001477932 | -- | 37,422 |
**For 10-K/10-Q specifically (estimated from biotech IPO data and market research):**
- DFIN: ~40-50% of annual/quarterly filings
- Workiva: ~25-35% (has been gaining share from DFIN since ~2010)
- Toppan Merrill: ~10-15%
- RDG Filings: ~5%
- Broadridge/CompSci: ~5%
- Others (law firms, self-filed, smaller agents): ~5-10%
---
## 3. XBRL/iXBRL Tool Signatures
The iXBRL tagging tool is often the same as the filing generator, but not always. Key distinguishing patterns in the iXBRL layer:
| Tool | Context Ref Pattern | Fact ID Pattern | Unit Ref Pattern |
|------|-------------------|----------------|-----------------|
| Workiva | `C_{uuid}` | `F_{uuid}` | `U_{uuid}` |
| DFIN New | `C_{uuid}` | `F_{uuid}` | Standard names |
| DFIN Legacy | `Fact_{large_int}` | `Fact_{large_int}` | Standard names |
| Toppan Merrill | `As_Of_{date}_{guid}` / `From_{date}_to_{date}_{guid}` | `Hidden_{guid}` | `Unit_Standard_USD_{guid}` |
| ThunderDome | `d_{date_range}` / `i_{date}` | `thunderdome-{name}` or `ixv-{n}` or `c{n}` | Standard names |
| CompSci Transform | `c0`, `c1`, `c2` ... | `ixv-{number}` | Standard names |
| GoFiler (XDX) | `From{date}to{date}` / `AsOf{date}` | `xdx2ixbrl{number}` | Standard names |
| XBRLMaster | `From{date}to{date}` | `ixv-{number}` | Standard names |
| Broadridge PROfile | `c{date}to{date}_{axis}_{member}` | Descriptive | Standard names |
---
## 4. Detection Priority (Recommended Heuristic Order)
For maximum reliability, check signatures in this order:
1. **HTML comments** (first 10 lines) -- most generators embed identifying comments
- `Workiva Platform` --> Workiva
- `DFIN New ActiveDisclosure` --> DFIN New
- `Toppan Merrill Bridge` --> Toppan Merrill
- `ThunderDome Portal` --> RDG Filings
- `CompSci Transform` --> CompSci/Broadridge
- `Broadridge PROfile` --> Broadridge
- `XBRLMaster` --> Discount EDGAR / NTDAS
2. **XML namespaces** on `<html>` tag
- `xmlns:thunderdome="http://www.RDGFilings.com"` --> RDG
- `xmlns:compsci="http://compsciresources.com"` --> CompSci
3. **XDX comments** between head and body --> GoFiler/Novaworks
4. **Accession number prefix** (first 10 digits) --> identifies filing agent CIK
5. **Body style patterns** as fallback
6. **iXBRL fact ID patterns** as secondary confirmation
---
## 5. Known Quality Issues by Generator
### CompSci Transform
- **Words broken across spans**: Text is split at arbitrary character boundaries, not word boundaries. A single word like "cybersecurity" may be split across 2-3 `<span>` tags. This breaks naive text extraction that operates per-element.
- **Empty div spacers**: `<div>\n\n</div>` between every paragraph adds noise.
- **Field comments in body**: `<!-- Field: Rule-Page -->` markers interspersed with content.
### Workiva
- **Extreme span nesting**: Every text run gets its own `<span>` with full inline style. A simple bold sentence may have 5+ spans.
- **Large file sizes**: Inline style repetition causes 10-K files to be 2-5x larger than equivalent DFIN filings.
- **Clean word boundaries**: Despite heavy span usage, spans align with word/phrase boundaries, making text extraction reliable.
### DFIN New ActiveDisclosure
- **`min-width:fit-content` everywhere**: Unusual CSS property on every span; may cause rendering inconsistencies in older browsers.
- **`font-kerning:none`**: Explicit kerning disable on all text spans.
- **Generally clean**: Text extraction works well; word boundaries respected.
### DFIN Legacy
- **Uppercase HTML tags**: Older filings use `<P>`, `<B>`, `<FONT>` -- need case-insensitive parsing.
- **Mixed HTML versions**: Some documents mix HTML 3.2 and 4.0 constructs.
- **SGML wrappers**: Some filings wrapped in `<DOCUMENT>` SGML envelope.
### GoFiler / Novaworks
- **XDX comment noise**: Multiple `<!-- Field: Set; ... -->` comments that must be stripped.
- **Generally clean HTML**: Body content is straightforward.
### Toppan Merrill Bridge
- **Clean output**: Among the cleanest generators. Minimal inline style bloat.
- **GUID-heavy IDs**: Context and unit refs use base64-like GUIDs that are less human-readable.
---
## 6. Self-Filed / In-House Filings
Some large filers submit directly using their own CIK as the accession number prefix. These filings have **no generator comment** and variable HTML quality.
**Detection:** Accession number prefix matches the filer's own CIK (e.g., Halliburton CIK `0000045012` files with accession `0000045012-25-000010`).
**However:** Even self-filed companies typically use a commercial tool. Halliburton's self-filed 10-K contains the Workiva comment signature, indicating they use Workiva but submit directly rather than through a filing agent.
**Truly in-house HTML** (no commercial tool) is rare among 10-K filers. When it occurs:
- No identifying comments
- No consistent structural patterns
- May use Word-to-HTML conversion (look for `mso-` CSS prefixes from Microsoft Office)
- May have minimal or no iXBRL tagging
---
## 7. Law Firm Filings
Several large law firms act as filing agents:
- Davis Polk & Wardwell (`0000950103`) -- 326K total filings
- Paul Weiss (`0000950142`) -- 56K total filings
- Foley & Lardner (`0000897069`) -- 30K total filings
- Sidley Austin (`0000905148`) -- 39K total filings
- Seward & Kissel (`0000919574`) -- 107K total filings
Law firms typically file transactional documents (S-1, proxy, 8-K) rather than periodic 10-K filings. The HTML in law-firm-filed documents often comes from Word conversion and lacks commercial generator signatures.
---
## 8. Summary: Quick Detection Regex Table
```
Pattern | Generator
-----------------------------------------------------|------------------
/Workiva Platform/ | Workiva
/DFIN New ActiveDisclosure/ | DFIN (New)
/Donnelley Financial Solutions/ | DFIN (New)
/Toppan Merrill Bridge/ | Toppan Merrill
/ThunderDome Portal/ | RDG Filings
/CompSci Transform/ | CompSci/Broadridge
/Broadridge PROfile/ | Broadridge
/XBRLMaster/ | Discount EDGAR
/xmlns:thunderdome="http:\/\/www\.RDGFilings\.com"/ | RDG Filings
/xmlns:compsci="http:\/\/compsciresources\.com"/ | CompSci
/Field: Set; Name: xdx/ | GoFiler/Novaworks
/dfinsolutions\.com/ | DFIN
/min-width:fit-content/ | DFIN (New)
/BRPFPage/ | Broadridge PROfile
/id="XBRLDIV"/ | XBRLMaster
```
---
## Sources
- Direct inspection of SEC EDGAR filings (March 2026)
- [secfilingdata.com/top-filing-agents](https://www.secfilingdata.com/top-filing-agents/) -- filing agent rankings
- [newstreetir.com -- Top SEC Filing Agents for Biotech IPOs](https://newstreetir.com/2025/05/14/who-are-the-top-sec-filing-agents-for-biotech-ipos/) -- biotech IPO market share
- [houseblend.io -- SEC Filing Software Platforms](https://www.houseblend.io/articles/sec-filing-software-platforms-pricing-compliance) -- vendor comparison
- [novaworkssoftware.com/inlinexbrl](https://www.novaworkssoftware.com/inlinexbrl.php) -- XDX format documentation
- [rdgfilings.com/thunderdome](https://rdgfilings.com/thunderdome-client-portal/) -- ThunderDome Portal
- [toppanmerrill.com/bridge](https://www.toppanmerrill.com/bridge/) -- Toppan Merrill Bridge
- [edgarmaster.com](https://edgarmaster.com/) -- EDGARMaster / XBRLMaster by NTDAS
- [pernasresearch.com -- DFIN analysis](https://pernasresearch.com/research-vault/donnelley-financial-initiation/) -- market share dynamics

View File

@ -271,6 +271,16 @@ No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1
Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.** Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
### Case 9: Generic regulatory compliance language
> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
The key distinctions:
- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted)
- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6)
- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity
--- ---
## Dimension 2: Specificity Level ## Dimension 2: Specificity Level

View File

@ -65,6 +65,7 @@ After extracting clean section text, splitting into paragraphs had its own chall
- **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. - **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless.
- **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge. - **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge.
- **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable. - **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable.
- **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `<td>` per bullet item, and use `&#183;` (middle dot in Symbol font) instead of the standard `&#8226;` bullet character. Since `stripHtml()` doesn't decode `&#183;` as a bullet marker, the bullet-aware merge logic never fires. Each bullet item starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments and merges them with the preceding block. This cascades: a Bancorp 34 filing had three separate elements — two bullet items about risk management processes and a standalone paragraph disclosing a $25,000 cybersecurity incident — concatenated into a single 114-word run-on sentence. The HTML structure was completely unambiguous (separate `<td>` and `<p>` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded.
### 8-K Extraction ### 8-K Extraction
@ -549,6 +550,174 @@ This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating
--- ---
## Phase 10: Data Quality Audit and Corpus Remediation
### The Discovery
While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data:
1. **Orphan words.** HTML source wraps text at fixed column width. When a `<span>` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs.
2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk.
### The Generator Investigation
Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup.
The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate.
By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform.
Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`.
### Six Surgical Patches
All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation.
| Patch | Method | Paragraphs | Annotated |
|-------|--------|-----------|-----------|
| 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 |
| 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 |
| 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 |
| 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 |
| 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 |
| **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** |
The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse.
### Orphan Word Re-Annotation
The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs:
- Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline)
- Management Role 0.49x under-represented
- Board Governance 0.60x under-represented
Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong.
**Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures.
**Results:**
- **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real
- **37 paragraphs (2.4%)** changed consensus specificity
- **152 total (9.9%)** changed on at least one dimension
- mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%)
- 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings
- Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34)
The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs.
### No-Cyber-Keyword Paragraphs: A False Alarm
The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other.
**Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept.
### Heading-Stripped Paragraphs: Labels Still Valid
For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label.
### Embedded Bullet Lists: The Cascade Failure
A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on:
> establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023...
The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `<td>` bullet items about risk management processes, and a standalone `<p>` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them.
**Root cause: a three-part cascade failure in the extraction pipeline.**
1. **Bullet character not recognized.** The HTML used `&#183;` (middle dot in Symbol font) instead of `&#8226;` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires.
2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block.
3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph.
The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text.
We identified two classes of this failure:
1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers).
2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text.
All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal.
### Sample Weighting for Fine-Tuning
The quality tier system maps directly to training sample weights:
| Tier | Weight | Rationale |
|------|--------|-----------|
| clean | 1.0 | No issues |
| headed | 1.0 | Heading removed, body text intact |
| minor | 1.0 | Orphan word restored |
| degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs |
This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data.
### Data Integrity Framework
The audit produced a formal data integrity framework:
1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor
2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash
3. Annotations link by UUID — stable across patches
4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes
5. Every patch is documented with scope, method, validation, and annotation impact
6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file
### Quality Tier System
Each paragraph gets a quality tier based on detected issues:
| Tier | Criteria | Count | % |
|------|----------|-------|---|
| clean | No detected issues | 58,165 | 80.7% |
| headed | Had inlined heading (now stripped) | 7,402 | 10.3% |
| degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% |
| minor | Had orphan word (now fixed) | 2,147 | 3.0% |
All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning.
---
## Phase 11: DAPT Corpus Preparation
### Corpus Cleaning
The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required:
**Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers.
**Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms.
**Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms.
The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed.
### Corpus Statistics (Final)
| Metric | Value |
|--------|-------|
| Documents | 14,756 (14,568 after <10K filter) |
| Total tokens | ~1.056 billion (ModernBERT tokenizer) |
| Median document | ~73K tokens (347K chars) |
| Training sequences (seq_len=8192) | ~136K |
| Steps per epoch (eff. batch=32) | ~4,257 |
| Estimated training time | ~4-8 hours per epoch (RTX 3090) |
### Sequence Length Decision
ModernBERT was pre-trained at 8192 tokens. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates. The tradeoff — batch_size drops from 4 to 1, compensated by gradient_accumulation=32 — results in comparable training time because 4x fewer steps offset slower per-step throughput.
### Epoch Decision
We train for 1 epoch (single pass), following the empirical consensus:
- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL):** Used a single pass over 2-8B token domain corpora. Sufficient for consistent downstream gains across all four domains tested.
- **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold.
Full procedure documented in `docs/DAPT-PROCEDURE.md`.
---
## Cost and Time Ledger ## Cost and Time Ledger
### Tooling ### Tooling
@ -565,7 +734,8 @@ All code was written collaboratively with **Claude Code** (Anthropic's agentic c
| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | | Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. |
| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | | Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). |
| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | | Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 |
| **Total API spend** | **$156** | **~213K unique** | Nano waste: $21.24 | | Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. |
| **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 |
Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem.
@ -578,9 +748,10 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | | Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). |
| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | | Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis |
| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | | Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision |
| Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs |
| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | | Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
| Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure | | Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure |
| **Total to date** | **~31h** | | | **Total to date** | **~35h** | |
### Remaining Work (estimated) ### Remaining Work (estimated)
@ -589,7 +760,7 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
| Human labeling (1,200 paragraphs, 6 annotators) | ~6-8h | $0 (team labor) | | Human labeling (1,200 paragraphs, 6 annotators) | ~6-8h | $0 (team labor) |
| Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 | | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
| Training data assembly | ~2h | $0 | | Training data assembly | ~2h | $0 |
| DAPT pre-training | ~48-72h GPU | $0 (own 3090) | | DAPT pre-training (1 epoch) | ~4-8h GPU | $0 (own 3090) |
| TAPT pre-training | ~2-3h GPU | $0 | | TAPT pre-training | ~2-3h GPU | $0 |
| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
| Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 | | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
@ -702,6 +873,11 @@ Three models from three providers — minimizes correlated errors.
| Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning | | Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning |
| Stage 1 prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v2.5) + buildJudgePrompt() | | Stage 1 prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v2.5) + buildJudgePrompt() |
| Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency | | Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency |
| Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 |
| Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis |
| No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs |
| Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers |
| Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles |
| Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data | | Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data |
| Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison | | Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison |
| Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation | | Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation |
@ -732,3 +908,6 @@ Three models from three providers — minimizes correlated errors.
- Systematic model biases are quantifiable and predictable. Use them as signal, not noise. - Systematic model biases are quantifiable and predictable. Use them as signal, not noise.
- Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change. - Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change.
- Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling. - Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling.
- **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying.
- **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average.
- **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic.

184
docs/SEC-HTML-CLEANING.md Normal file
View File

@ -0,0 +1,184 @@
# SEC Filing HTML Cleaning — Lessons & Pitfalls
Everything we've learned about cleaning SEC EDGAR HTML for text extraction, specifically for Item 1C (Cybersecurity) from 10-K filings. These lessons likely apply to any SEC filing text extraction pipeline.
## The HTML landscape
SEC filings come from thousands of different filers using dozens of different tools (Workiva/Toppan Merrill, Donnelley Financial, various legal/accounting software). There is no standard HTML structure. The same semantic content — a paragraph of body text — can appear as:
- `<p><span style="...">Text here</span></p>`
- `<div><font face="..." size="...">Text here</font></div>`
- Nested XBRL inline tags: `<ix:nonNumeric><p><span>Text</span></p></ix:nonNumeric>`
- Table-based layouts: `<table><tr><td><span>Text</span></td></tr></table>`
- Deeply nested `<div>` structures with inline styles
The only constant: it will be ugly.
## Inline element newlines (the orphan word problem)
**The bug:** Many filing generators produce HTML where the first word of a paragraph is on its own line within a `<span>` tag:
```html
<p><span style="font-family: Times New Roman; font-size: 10pt">Our
sole executive officer and director is responsible for assessing and
managing cybersecurity risks...</span></p>
```
When this is stripped to plain text, `Our` ends up on its own line. If downstream processing splits on newlines and filters short lines (< 20 words), `Our` is silently dropped. The paragraph becomes `sole executive officer and director is responsible...` missing its subject.
**Prevalence:** ~1.4% of filings (156/11,299) have this pattern in their Item 1C section. It produces ~2,500 affected paragraphs across the corpus.
**Common orphaned words:** `We` (73), `Our` (37), `The` (5), `To` (17), `As` (15), `In` (13), `Cybersecurity` (10), `Management` (6), `Following` (6). Basically any sentence-starting word.
**Why it happens:** The filing generator wraps text at a fixed column width in the HTML source. If the `<span>` opening tag + attributes eat most of a line, only the first word fits before the line break. The browser renders this identically (HTML treats source newlines as whitespace), but text extraction that preserves newlines from inline elements breaks.
**Detection (for patching existing data):** Match the pattern `<span...>Word\nlowercase continuation...` directly in the raw HTML. Three validation layers are needed:
1. **Same-tag check:** The orphan word and continuation must be within the same inline element (`<span>`, `<a>`, `<font>`, etc.). This distinguishes orphan first-words from section headings above paragraphs. Critically, exclude `<ix:...>` XBRL tags — these are structural, not inline, and their first text is often a section title.
2. **Bold/underline filter:** Skip matches inside `<b>`, `<strong>`, or `text-decoration: underline`. These are section headings that happen to have a line break mid-heading (e.g., `<b>Risk\nManagement and Strategy</b>`). Without this filter, headings get inlined into body text.
3. **Stripped-text validation:** After finding an orphan word in the raw HTML, confirm it exists as a standalone word in the `stripHtml()` output. This catches mid-word splits across adjacent spans (see below).
**Case-sensitivity matters:** If using a regex with the `i` (case-insensitive) flag for tag name matching, the `[a-z]` check on the continuation text becomes meaningless — it will match uppercase too, letting headings through. Either drop the `i` flag (and match tags as `[Ss][Pp][Aa][Nn]` etc.) or validate continuation case separately.
**Prevention (for future extractions):** In the paragraph segmenter, buffer single-word blocks that would otherwise be dropped (below minimum word count) and prepend them to the next block when it starts lowercase. This must happen at the segmentation stage, not in the extraction merge logic — changes to merge behavior cascade through downstream paragraph boundary decisions.
## Mid-word splits across adjacent spans
**The bug:** Some filing generators split a single word across multiple `<span>` tags, sometimes with empty formatting spans between them:
```html
<span style="font-size: 10pt">B</span>
<span style="font-size: 8pt"></span>
<span style="font-size: 10pt">lackrock
maintains a comprehensive cybersecurity risk management program...</span>
```
The HTML cleaner's adjacent-inline-boundary collapse correctly joins `B` + `lackrock` into `Blackrock` in the stripped text. But if a patching script operates on raw HTML (to find orphan patterns), it sees `<span>lackrock\nmaintains...` and incorrectly treats `lackrock` as an orphan word, prepending it to produce `lackrock maintains...` instead of the correct `Blackrock maintains...`.
**Detection:** After finding a candidate orphan word in raw HTML, verify it exists as a standalone word (surrounded by whitespace or at line boundaries) in the stripped text. If `stripHtml()` produces `Blackrock` (not `lackrock`), the candidate is a word fragment, not an orphan.
**Root cause:** The filing generator uses separate spans for styling changes (font-size) that happen to fall at character boundaries within words. The empty `<span style="font-size: 8pt"></span>` is a zero-width formatting artifact.
## Adjacent inline element boundaries
**The bug:** Different formatting applied to adjacent text creates word-joining when tags are stripped:
```html
<span style="color: black">word</span><span style="color: blue">The next word</span>
```
Naively stripping tags produces `wordThe next word`. The words at the span boundary merge.
**Fix:** Before stripping tags, collapse adjacent inline element boundaries to spaces:
```js
.replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi,
(_m, _tag, ws) => ws.length > 0 ? " " : "")
```
This replaces `</span><span>` (and similar) with a space, preventing word joins. The whitespace check (`ws.length > 0`) handles cases where whitespace already exists between tags.
Same treatment needed for XBRL inline tags (`</ix:nonNumeric><ix:nonNumeric>`).
## Source newlines vs block-element breaks
**The issue:** HTML source files contain newlines in two semantically different roles:
1. **Block-element breaks:** `</p>`, `</div>`, `<br>` — these are paragraph boundaries
2. **Source line wrapping:** Newlines within inline elements from the filing generator's line-length limit — these are meaningless whitespace
Both become `\n` in the stripped text. The extraction pipeline relies on newlines to separate paragraphs, so collapsing all newlines breaks paragraph detection. But preserving all newlines creates the orphan word problem.
**The tradeoff:** We chose to preserve newlines (they're needed for paragraph boundary detection in the extraction pass). The orphan word problem is handled downstream in the segmenter. An alternative (sentinel-based) approach — using `\x00` for block breaks, collapsing source newlines to spaces, then restoring sentinels — was tested but caused too many changes to paragraph segmentation across the corpus (18,589 paragraphs changed text in regression testing).
## XBRL inline tags (iXBRL / `ix:` namespace)
**What they are:** Starting in 2024, SEC filings use Inline XBRL to tag structured data directly in HTML. The `cyd:` taxonomy covers cybersecurity disclosures. Tags like `<ix:nonNumeric name="cyd:CybersecurityRiskManagementProcessesIntegratedTextBlock">` wrap entire sections.
**Pitfalls:**
- **Not inline formatting:** Despite being inline XML elements, `ix:` tags are structural — they wrap paragraphs, sections, even entire Items. Treating them like `<span>` for orphan detection will match section headings.
- **XBRL metadata leaks into text:** CIK numbers (`0000123456`), namespace URIs (`xbrli:`, `fasb.org`), ticker-date identifiers (`ae-20231231`) can appear in the text stream. Filter lines where >50% of tokens look like XBRL metadata.
- **`continuedAt` chains:** Long sections are split across multiple `ix:continuation` blocks. These can interrupt the visual flow of text.
## Running headers/footers and page artifacts
SEC HTML often retains print-formatting artifacts:
| Pattern | Example | Detection |
|---------|---------|-----------|
| Page numbers | `17`, `- 17 -`, `Page 17` | Regex: `/^[-–—\s]*[A-Za-z]?[-–—]?\s*\d+[-–—\s]*$/` |
| Running headers | `ACME CORP FORM 10-K` | Short line + company name + form type |
| Table of contents markers | `Table of Contents` | Exact match, strip trailing content |
| Back-to-top links | `(Back to Index)` | Regex: `/back\s+to\s+(index|top|toc)/i` |
| Part headings | `PART II` | Short line, roman numerals |
These appear mid-text because they're print-layout remnants. Filter them in the extraction pass, before paragraph segmentation.
## Subsidiary headers in combined filings
Holding companies file combined 10-Ks covering multiple subsidiaries. Each subsidiary section repeats a header:
```
ENTERGY ARKANSAS, LLC AND SUBSIDIARIES
```
These are ALL-CAPS, contain entity suffixes (LLC, INC, CORP, L.P.), and include "AND SUBSIDIARIES". Filter with:
```js
/^[A-Z][A-Z\s,.'&-]{5,}(?:LLC|INC|CORP|COMPANY|L\.?P\.?)\b.*\bAND\s+SUBSIDIARIES\b/
```
## PDF extraction artifacts
Some filings are PDF-converted-to-HTML, producing:
- **Missing spaces:** `word.Next` → fix with `/([a-z])\.([A-Z])/g`
- **CamelCase joins:** `wordThe next` → fix common English words: `/([a-z])(The|Our|We|This|...)\b/g`
- **Orphaned punctuation:** `Director ,` → fix with `/ ([,;:.!?)])/g`
- **Colon joins:** `word:Word` → fix with `/([a-z]):([A-Z])/g`
## Entity decoding
SEC HTML uses a mix of named entities, decimal entities, and hex entities. Common ones to handle:
```
&nbsp; &#160; &#xa0; → space
&amp;&
&mdash; &#8212; &#151; → —
&ndash; &#8211; &#150;
&rsquo; &#8217; &#146; → ' (right single quote, used as apostrophe)
&ldquo; &rdquo; → " (curly quotes)
&bull; &#8226; &#149; → •
&#153; → ™
```
Some filings use the Greek question mark (U+037E) instead of a semicolon — looks identical but breaks regex.
## Truncation detection
The extraction pipeline caps output at 50 blocks / 15,000 words. Filings that hit this cap may be truncated. Detection: check if the last paragraph of each filing ends with terminal punctuation (`[.!?;")]\s*$`). If not, the filing was likely cut mid-sentence — remove all its paragraphs from the training corpus.
**Limitation:** This only catches truncation at sentence boundaries. If the cap happens to fall at a sentence end, the filing appears complete even though content was lost. No fix for this without comparing against the full filing length.
## Merge logic and cascade effects
The extraction pipeline merges short/broken lines in multiple passes. **Any change to merge logic cascades:** merging two lines changes the resulting line's length, which affects whether subsequent lines trigger length-based merge thresholds, which changes the next merge decision, etc.
In regression testing, a single-word forward-merge change in the extraction pass caused 1,812 ripple-effect text changes across the corpus. Moving the fix to the segmentation stage (after all extraction merges complete) reduced ripples but still affected ~800 paragraphs.
**Lesson:** For retroactive data fixes, prefer surgical data patching (find-and-prepend on the JSONL) over re-running extraction. For future extraction, place fixes as late in the pipeline as possible to minimize cascade.
## Testing extraction changes
When modifying the HTML cleaner, extraction, or segmentation code, regression test against the full corpus:
1. Re-extract all cached HTML files with the modified code
2. Compare against existing paragraphs by `(accessionNumber, paragraphIndex)`
3. Classify changes:
- **Clean prefix** (new text ends with old text) — orphan word recovered
- **Clean suffix** (new text starts with old text) — fragment absorbed
- **Re-merge** (text differs in other ways) — cascade/ripple effect
- **Paragraph count change** — boundary shift, highest-risk regression
4. Investigate any paragraph count decreases and text shrinkages — these are the most likely regressions
For the orphan word fix, acceptable results were: 215 clean prefix fixes, 0 paragraph count changes, 0 text shrinkages.

248
python/audit_corpus.py Normal file
View File

@ -0,0 +1,248 @@
"""
Quality audit of the SEC-cyBERT DAPT training corpus.
Reads sharded JSONL files and performs qualitative checks on document content.
READ-ONLY does not modify any files.
"""
import json
import os
import random
import re
import sys
from pathlib import Path
CORPUS_DIR = Path(__file__).resolve().parent.parent / "data" / "dapt-corpus"
SHARDS = sorted(CORPUS_DIR.glob("shard-*.jsonl"))
random.seed(42)
def load_all_docs() -> list[dict]:
"""Load all documents from all shards."""
docs = []
for shard in SHARDS:
with open(shard) as f:
for line in f:
line = line.strip()
if line:
docs.append(json.loads(line))
return docs
def separator(title: str) -> None:
print("\n" + "=" * 80)
print(f" {title}")
print("=" * 80 + "\n")
def audit_smallest(docs: list[dict]) -> None:
separator("1. SMALLEST 20 DOCUMENTS (by chars)")
sorted_docs = sorted(docs, key=lambda d: d["chars"])
for i, doc in enumerate(sorted_docs[:20], 1):
text = doc["text"]
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---")
# Show full text for tiny docs, cap at 2000 chars
display = text if len(text) <= 2000 else text[:2000] + "\n... [TRUNCATED]"
print(display)
print()
def audit_largest(docs: list[dict]) -> None:
separator("2. LARGEST 5 DOCUMENTS (first/last 500 chars)")
sorted_docs = sorted(docs, key=lambda d: d["chars"], reverse=True)
for i, doc in enumerate(sorted_docs[:5], 1):
text = doc["text"]
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | words={doc['words']} ---")
print("FIRST 500 CHARS:")
print(text[:500])
print("\n... [GAP] ...\n")
print("LAST 500 CHARS:")
print(text[-500:])
print()
def audit_mid_samples(docs: list[dict]) -> None:
separator("3. RANDOM MID-DOCUMENT SAMPLES (10 docs, 500 chars from 50% point)")
sample = random.sample(docs, 10)
for i, doc in enumerate(sample, 1):
text = doc["text"]
mid = len(text) // 2
start = max(0, mid - 250)
end = min(len(text), mid + 250)
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
print(text[start:end])
print()
def audit_xbrl_contamination(docs: list[dict]) -> None:
separator("4. XBRL-CONTAMINATED STARTS (first 200 chars with XBRL patterns)")
xbrl_pattern = re.compile(
r"(0000\d{6}|xbrli:|fasb\.org|us-gaap:|dei:|srt:|^\d{4}-\d{2}-\d{2}\s*$)",
re.MULTILINE,
)
found = []
for doc in docs:
first200 = doc["text"][:200]
if xbrl_pattern.search(first200):
found.append(doc)
if len(found) >= 10:
break
if not found:
print("No XBRL-contaminated documents found in initial scan.")
print("Trying broader pattern...")
# Try a broader search
broad_pattern = re.compile(r"(xmlns|xbrl|0001\d{6})", re.IGNORECASE)
for doc in docs:
first200 = doc["text"][:200]
if broad_pattern.search(first200):
found.append(doc)
if len(found) >= 10:
break
for i, doc in enumerate(found[:10], 1):
text = doc["text"]
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
print("FIRST 500 CHARS:")
print(text[:500])
# Find where XBRL junk ends and real text begins
# Look for "UNITED STATES" or "FORM 10-K" as transition marker
for marker in ["UNITED STATES", "FORM 10-K", "FORM 10-k", "ANNUAL REPORT"]:
idx = text.find(marker)
if idx > 0 and idx < 5000:
print(f"\n >> Transition to real text at char {idx} (marker: '{marker}')")
break
print()
def audit_short_lines(docs: list[dict]) -> None:
separator("5. DOCS WITH MOST SHORT LINES (<10 chars, excluding empty)")
scored = []
for doc in docs:
lines = doc["text"].split("\n")
non_empty = [l for l in lines if l.strip()]
short = [l for l in non_empty if 0 < len(l.strip()) < 10]
if non_empty:
ratio = len(short) / len(non_empty)
scored.append((ratio, len(short), len(non_empty), doc, short))
scored.sort(key=lambda x: x[0], reverse=True)
for i, (ratio, n_short, n_total, doc, short_lines) in enumerate(scored[:10], 1):
print(
f"--- #{i} | accession={doc['accession']} | ratio={ratio:.2%} "
f"| {n_short}/{n_total} short lines ---"
)
# Show 20 short lines with surrounding context
text = doc["text"]
lines = text.split("\n")
shown = 0
for j, line in enumerate(lines):
stripped = line.strip()
if 0 < len(stripped) < 10 and shown < 20:
# Show line with 1 line of context on each side
ctx_start = max(0, j - 1)
ctx_end = min(len(lines), j + 2)
for k in range(ctx_start, ctx_end):
prefix = ">>>" if k == j else " "
print(f" {prefix} L{k+1}: {lines[k][:100]}")
print()
shown += 1
print()
def audit_transitions(docs: list[dict]) -> None:
separator("6. TRANSITION ZONES (SEC cover page -> company content)")
# Find docs that have the SEC header
candidates = [d for d in docs if "SECURITIES AND EXCHANGE COMMISSION" in d["text"][:2000]]
sample = random.sample(candidates, min(5, len(candidates)))
for i, doc in enumerate(sample, 1):
text = doc["text"]
idx = text.find("SECURITIES AND EXCHANGE COMMISSION")
if idx < 0:
continue
# Find end of cover page area — look for company-specific content markers
# like "Item 1" or "PART I" or "Table of Contents"
transition_markers = ["Item 1", "ITEM 1", "PART I", "TABLE OF CONTENTS", "Table of Contents"]
transition_idx = -1
for marker in transition_markers:
t = text.find(marker, idx + 100)
if t > 0 and (transition_idx < 0 or t < transition_idx):
transition_idx = t
if transition_idx > 0:
start = max(0, transition_idx - 250)
end = min(len(text), transition_idx + 250)
print(f"--- #{i} | accession={doc['accession']} ---")
print(f"Cover page at char {idx}, transition at char {transition_idx}")
print(f"SHOWING chars {start}-{end}:")
print(text[start:end])
else:
# Just show around the SEC header
start = max(0, idx - 50)
end = min(len(text), idx + 450)
print(f"--- #{i} | accession={doc['accession']} ---")
print(f"Cover page at char {idx}, no clear transition marker found")
print(text[start:end])
print()
def audit_financial_tables(docs: list[dict]) -> None:
separator("7. FINANCIAL TABLE QUALITY (>30% lines with $ or mostly numeric)")
scored = []
dollar_or_numeric = re.compile(r"(\$|^\s*[\d,.\-()]+\s*$)")
for doc in docs:
lines = doc["text"].split("\n")
non_empty = [l for l in lines if l.strip()]
if not non_empty:
continue
matching = sum(1 for l in non_empty if dollar_or_numeric.search(l))
ratio = matching / len(non_empty)
if ratio > 0.30:
scored.append((ratio, doc))
scored.sort(key=lambda x: x[0], reverse=True)
for i, (ratio, doc) in enumerate(scored[:5], 1):
text = doc["text"]
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} | numeric ratio={ratio:.1%} ---")
# Find a dense numeric section
lines = text.split("\n")
# Find a window of 20 lines with the most dollar/numeric content
best_start = 0
best_count = 0
window = 20
for j in range(len(lines) - window):
count = sum(1 for l in lines[j : j + window] if dollar_or_numeric.search(l))
if count > best_count:
best_count = count
best_start = j
print(f"DENSEST 20-LINE WINDOW (starting at line {best_start + 1}, {best_count}/{window} numeric):")
for l in lines[best_start : best_start + window]:
print(f" | {l[:120]}")
print()
def audit_endings(docs: list[dict]) -> None:
separator("8. END-OF-DOCUMENT QUALITY (last 300 chars of 15 random docs)")
sample = random.sample(docs, 15)
for i, doc in enumerate(sample, 1):
text = doc["text"]
print(f"--- #{i} | accession={doc['accession']} | chars={doc['chars']} ---")
print(text[-300:])
print()
def main() -> None:
print("Loading all documents from corpus...")
docs = load_all_docs()
print(f"Loaded {len(docs)} documents from {len(SHARDS)} shards.\n")
audit_smallest(docs)
audit_largest(docs)
audit_mid_samples(docs)
audit_xbrl_contamination(docs)
audit_short_lines(docs)
audit_transitions(docs)
audit_financial_tables(docs)
audit_endings(docs)
separator("AUDIT COMPLETE")
print(f"Total documents audited: {len(docs)}")
if __name__ == "__main__":
main()

View File

@ -7,7 +7,7 @@ model:
data: data:
corpus_path: ../data/dapt-corpus corpus_path: ../data/dapt-corpus
text_field: text text_field: text
max_seq_length: 2048 max_seq_length: 8192
validation_split: 0.02 validation_split: 0.02
training: training:
@ -15,8 +15,8 @@ training:
learning_rate: 5.0e-5 learning_rate: 5.0e-5
mlm_probability: 0.30 mlm_probability: 0.30
num_train_epochs: 1 num_train_epochs: 1
per_device_train_batch_size: 4 per_device_train_batch_size: 1
gradient_accumulation_steps: 8 # effective batch = 32 gradient_accumulation_steps: 32 # effective batch = 32
warmup_ratio: 0.05 warmup_ratio: 0.05
weight_decay: 0.01 weight_decay: 0.01
bf16: true bf16: true

View File

@ -47,6 +47,14 @@ def train(config: DAPTConfig) -> None:
dataset = load_corpus(config.data.corpus_path, config.data.text_field) dataset = load_corpus(config.data.corpus_path, config.data.text_field)
print(f" Raw documents: {len(dataset):,}") print(f" Raw documents: {len(dataset):,}")
# Filter tiny documents (cover pages, empty filings)
min_chars = 10_000
before = len(dataset)
dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars)
filtered = before - len(dataset)
if filtered > 0:
print(f" Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining")
print(f" Tokenizing and chunking to {config.data.max_seq_length} tokens...") print(f" Tokenizing and chunking to {config.data.max_seq_length} tokens...")
chunked = tokenize_and_chunk( chunked = tokenize_and_chunk(
dataset, dataset,

View File

@ -0,0 +1,334 @@
#!/usr/bin/env python3
"""
Quantify how EFiling/XDX generator quality issues affect the annotated paragraph set.
READ-ONLY analysis does not modify any files.
"""
import json
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
# Reuse detect_generator from the existing script
sys.path.insert(0, str(Path(__file__).parent))
from detect_generators import detect_generator
# Paths
HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
PARAGRAPHS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/paragraphs/paragraphs-clean.jsonl")
ANNOTATIONS_PATH = Path("/home/joey/Documents/sec-cyBERT/data/annotations/stage1.jsonl")
SEP = "=" * 100
def load_paragraphs():
"""Load paragraphs, return dict: id -> paragraph dict."""
paragraphs = {}
with open(PARAGRAPHS_PATH) as f:
for line in f:
p = json.loads(line)
paragraphs[p["id"]] = p
return paragraphs
def load_annotations():
"""Load annotations, return dict: paragraphId -> annotation dict."""
annotations = {}
with open(ANNOTATIONS_PATH) as f:
for line in f:
a = json.loads(line)
pid = a["paragraphId"]
# Keep the first annotation per paragraph (or overwrite — doesn't matter for counts)
annotations[pid] = a
return annotations
def detect_all_generators():
"""Detect generators for all HTML files. Return dict: accession -> generator."""
accession_to_gen = {}
files = sorted(HTML_DIR.glob("*.html"))
total = len(files)
for i, fp in enumerate(files):
accession = fp.stem
gen, _evidence = detect_generator(str(fp))
accession_to_gen[accession] = gen
if (i + 1) % 3000 == 0:
print(f" Scanned {i + 1}/{total} HTML files...", file=sys.stderr)
print(f" Scanned {total}/{total} HTML files.", file=sys.stderr)
return accession_to_gen
def starts_lowercase(text: str) -> bool:
"""True if text starts with a lowercase letter (orphan word candidate)."""
if not text:
return False
return text[0].islower()
def is_list_item(text: str) -> bool:
"""True if text looks like a list item (starts with bullet, dash, number+period, etc.)."""
stripped = text.strip()
if not stripped:
return False
# Common list patterns: "- ", "• ", "* ", "1. ", "a) ", "(a) ", "(i) "
if re.match(r'^[-•*▪◦]\s', stripped):
return True
if re.match(r'^\d+[.)]\s', stripped):
return True
if re.match(r'^\([a-z0-9ivx]+\)\s', stripped, re.I):
return True
if re.match(r'^[a-z][.)]\s', stripped):
return True
return False
def looks_like_inlined_header(text: str) -> bool:
"""
True if text starts with a section heading run into body text, e.g.:
"Risk Management and Strategy We recognize the importance..."
"Cybersecurity Governance Our Board of Directors oversees..."
Key distinction from normal sentences: the heading portion is a noun phrase
(not a full sentence subject like "Our Board" or "The Company"), and is
immediately followed by a new sentence that starts a different thought.
We look for known SEC cybersecurity section heading patterns followed by
body text starting with a capital letter (new sentence) with no punctuation
separating them (no period, colon, or newline just a space).
"""
# Known heading patterns for SEC Item 1C disclosures
heading_patterns = [
r'(?:Cybersecurity\s+)?Risk\s+Management(?:\s+and\s+Strategy)?',
r'(?:Cybersecurity\s+)?Governance(?:\s+and\s+Risk\s+Management)?',
r'Cybersecurity\s+Governance',
r'Cybersecurity\s+Risk\s+Management\s+and\s+Strategy',
r'Board\s+Oversight(?:\s+of\s+(?:Risks?\s+from\s+)?Cybersecurity(?:\s+(?:Threats?|Risks?))?)?',
r'Management(?:\'s)?\s+Role\s+in\s+(?:Managing\s+)?Cybersecurity',
r'Governance\s+(?:Related\s+to|Oversight\s+of)\s+Cybersecurity(?:\s+Risks?)?',
r'Impact\s+of\s+Cybersecurity\s+(?:Risks?|Threats?)',
r'Cybersecurity\s+(?:Strategy|Overview|Program)',
r'(?:Management\s+and|Management|Governance)\s+(?:Strategy|Overview)',
r'Risk\s+Factors?',
r'Oversight\s+of\s+Cybersecurity\s+Risk\s+Management',
]
for pat in heading_patterns:
# Heading immediately followed by body text (capital letter starting new sentence)
m = re.match(rf'^({pat})\s+([A-Z])', text)
if m:
return True
# Also catch heading followed by lowercase (rarer but possible)
m = re.match(rf'^({pat})\s+([a-z])', text)
if m:
return True
return False
def main():
print("Loading data...")
paragraphs = load_paragraphs()
annotations = load_annotations()
print(f" Paragraphs: {len(paragraphs):,}")
print(f" Annotations: {len(annotations):,}")
# Unique annotated paragraph IDs
annotated_ids = set(annotations.keys()) & set(paragraphs.keys())
print(f" Annotated paragraphs with matching paragraph data: {len(annotated_ids):,}")
print("\nDetecting generators for all HTML files...")
accession_to_gen = detect_all_generators()
print(f" HTML files scanned: {len(accession_to_gen):,}")
# Map each paragraph to its generator
para_to_gen = {}
missing_accessions = set()
for pid, p in paragraphs.items():
acc = p["filing"]["accessionNumber"]
gen = accession_to_gen.get(acc)
if gen is None:
missing_accessions.add(acc)
gen = "NO_HTML_FILE"
para_to_gen[pid] = gen
if missing_accessions:
print(f"\n WARNING: {len(missing_accessions)} accession numbers in paragraphs have no HTML file")
# =====================================================================
# SECTION 1: Annotated paragraphs by generator
# =====================================================================
print(f"\n{SEP}")
print("SECTION 1: Annotated paragraphs by generator")
print(SEP)
ann_gen_counts = Counter()
for pid in annotated_ids:
ann_gen_counts[para_to_gen[pid]] += 1
total_ann = len(annotated_ids)
print(f"\n{'Generator':<50} {'Count':>7} {'%':>7}")
print("-" * 70)
for gen, count in ann_gen_counts.most_common():
pct = count / total_ann * 100
print(f"{gen:<50} {count:>7} {pct:>6.1f}%")
print("-" * 70)
print(f"{'TOTAL':<50} {total_ann:>7} {100.0:>6.1f}%")
# =====================================================================
# SECTION 2: Lowercase-start (orphan word) analysis for annotated set
# =====================================================================
print(f"\n{SEP}")
print("SECTION 2: Lowercase-start paragraphs in annotated set")
print(SEP)
# All annotated lowercase-start
ann_lc = {pid for pid in annotated_ids if starts_lowercase(paragraphs[pid]["text"])}
ann_lc_nonlist = {pid for pid in ann_lc if not is_list_item(paragraphs[pid]["text"])}
print(f"\nAnnotated paragraphs starting with lowercase: {len(ann_lc):,} / {total_ann:,} ({len(ann_lc)/total_ann*100:.2f}%)")
print(f" Of those, excluding list items: {len(ann_lc_nonlist):,} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)")
# Breakdown by generator for lowercase-start non-list
lc_by_gen = Counter()
for pid in ann_lc_nonlist:
lc_by_gen[para_to_gen[pid]] += 1
print(f"\n{'Generator':<50} {'LC-start':>9} {'Total ann':>10} {'% of gen':>9}")
print("-" * 85)
for gen, _ in ann_gen_counts.most_common():
lc_count = lc_by_gen.get(gen, 0)
gen_total = ann_gen_counts[gen]
pct = lc_count / gen_total * 100 if gen_total else 0
if lc_count > 0:
print(f"{gen:<50} {lc_count:>9} {gen_total:>10} {pct:>8.1f}%")
# Specific callouts
efiling_gens = {"EFiling/EDGAR Agent", "EFiling XDX"}
efiling_ann = {pid for pid in annotated_ids if para_to_gen[pid] in efiling_gens}
efiling_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] in efiling_gens}
compsci_ann = {pid for pid in annotated_ids if para_to_gen[pid] == "CompSci Transform"}
compsci_lc = {pid for pid in ann_lc_nonlist if para_to_gen[pid] == "CompSci Transform"}
print(f"\n--- Specific callouts ---")
print(f"EFiling/XDX annotated paragraphs starting lowercase (non-list): {len(efiling_lc):,} / {len(efiling_ann):,} ({len(efiling_lc)/len(efiling_ann)*100:.1f}% of EFiling/XDX)" if efiling_ann else "EFiling/XDX: 0 annotated paragraphs")
print(f"CompSci Transform annotated paragraphs starting lowercase (non-list): {len(compsci_lc):,} / {len(compsci_ann):,} ({len(compsci_lc)/len(compsci_ann)*100:.1f}% of CompSci)" if compsci_ann else "CompSci Transform: 0 annotated paragraphs")
print(f"\nTotal affected annotated paragraphs (LC non-list): {len(ann_lc_nonlist):,} / {total_ann:,} = {len(ann_lc_nonlist)/total_ann*100:.2f}%")
# =====================================================================
# SECTION 3: Orphan-word paragraphs detail
# =====================================================================
print(f"\n{SEP}")
print("SECTION 3: Orphan-word paragraph details (LC-start, non-list, annotated)")
print(SEP)
# Breakdown by generator
print(f"\nBreakdown by generator:")
print(f"{'Generator':<50} {'Count':>7} {'% of orphan':>12}")
print("-" * 75)
for gen, count in lc_by_gen.most_common():
pct = count / len(ann_lc_nonlist) * 100
print(f"{gen:<50} {count:>7} {pct:>11.1f}%")
# 10 example texts with labels
print(f"\n10 example orphan-word annotated paragraphs:")
print("-" * 100)
examples = sorted(ann_lc_nonlist)[:10]
for pid in examples:
text = paragraphs[pid]["text"][:150]
ann = annotations[pid]
label = ann.get("label", {})
cat = label.get("content_category", "?")
spec = label.get("specificity_level", "?")
gen = para_to_gen[pid]
print(f" [{gen}] cat={cat}, spec={spec}")
print(f" \"{text}...\"")
print()
# Category distribution in orphan-word paragraphs vs overall
print(f"\nCategory distribution: orphan-word vs overall annotated set")
print("-" * 80)
orphan_cats = Counter()
for pid in ann_lc_nonlist:
cat = annotations[pid].get("label", {}).get("content_category", "Unknown")
orphan_cats[cat] += 1
overall_cats = Counter()
for pid in annotated_ids:
cat = annotations[pid].get("label", {}).get("content_category", "Unknown")
overall_cats[cat] += 1
all_cats = sorted(set(orphan_cats.keys()) | set(overall_cats.keys()))
print(f"{'Category':<40} {'Orphan':>7} {'Orphan%':>8} {'Overall':>8} {'Overall%':>9} {'Over-rep':>9}")
print("-" * 85)
for cat in all_cats:
o_count = orphan_cats.get(cat, 0)
a_count = overall_cats.get(cat, 0)
o_pct = o_count / len(ann_lc_nonlist) * 100 if ann_lc_nonlist else 0
a_pct = a_count / total_ann * 100
ratio = (o_pct / a_pct) if a_pct > 0 else 0
flag = " <<<" if ratio > 1.5 else ""
print(f"{cat:<40} {o_count:>7} {o_pct:>7.1f}% {a_count:>8} {a_pct:>8.1f}% {ratio:>8.2f}x{flag}")
# =====================================================================
# SECTION 4: Inlined headers analysis
# =====================================================================
print(f"\n{SEP}")
print("SECTION 4: Inlined headers in annotated paragraphs")
print(SEP)
ann_inlined = set()
for pid in annotated_ids:
text = paragraphs[pid]["text"]
if looks_like_inlined_header(text):
ann_inlined.add(pid)
print(f"\nAnnotated paragraphs with inlined headers: {len(ann_inlined):,} / {total_ann:,} ({len(ann_inlined)/total_ann*100:.2f}%)")
inlined_by_gen = Counter()
for pid in ann_inlined:
inlined_by_gen[para_to_gen[pid]] += 1
print(f"\n{'Generator':<50} {'Inlined':>8} {'Total ann':>10} {'% of gen':>9}")
print("-" * 85)
for gen, _ in ann_gen_counts.most_common():
ih_count = inlined_by_gen.get(gen, 0)
gen_total = ann_gen_counts[gen]
pct = ih_count / gen_total * 100 if gen_total else 0
if ih_count > 0:
print(f"{gen:<50} {ih_count:>8} {gen_total:>10} {pct:>8.1f}%")
# Show some examples
print(f"\n10 example inlined-header paragraphs:")
print("-" * 100)
examples_ih = sorted(ann_inlined)[:10]
for pid in examples_ih:
text = paragraphs[pid]["text"][:150]
gen = para_to_gen[pid]
cat = annotations[pid].get("label", {}).get("content_category", "?")
print(f" [{gen}] cat={cat}")
print(f" \"{text}...\"")
print()
# =====================================================================
# SECTION 5: Combined impact summary
# =====================================================================
print(f"\n{SEP}")
print("SECTION 5: Combined impact summary")
print(SEP)
affected = ann_lc_nonlist | ann_inlined
print(f"\nOrphan-word (LC non-list): {len(ann_lc_nonlist):>6} ({len(ann_lc_nonlist)/total_ann*100:.2f}%)")
print(f"Inlined headers: {len(ann_inlined):>6} ({len(ann_inlined)/total_ann*100:.2f}%)")
print(f"Either issue (union): {len(affected):>6} ({len(affected)/total_ann*100:.2f}%)")
print(f"Total annotated set: {total_ann:>6}")
# EFiling/XDX specifically
efiling_affected = {pid for pid in affected if para_to_gen[pid] in efiling_gens}
print(f"\nEFiling/XDX affected (either issue): {len(efiling_affected):,} / {len(efiling_ann):,}")
if __name__ == "__main__":
main()

435
scripts/audit_corpus.py Normal file
View File

@ -0,0 +1,435 @@
#!/usr/bin/env python3
"""Audit sec-cyBERT paragraph corpus for text quality issues."""
import json
import re
import random
import os
from collections import Counter, defaultdict
from pathlib import Path
DATA_FILE = Path("data/paragraphs/paragraphs-clean.jsonl")
HTML_DIR = Path("data/raw/html")
# ── Load all paragraphs ──────────────────────────────────────────────────────
print("Loading paragraphs...")
paragraphs = []
with open(DATA_FILE) as f:
for line in f:
paragraphs.append(json.loads(line))
print(f"Loaded {len(paragraphs):,} paragraphs.\n")
def show(text, limit=200):
"""Truncate text for display."""
if len(text) <= limit:
return text
return text[:limit] + "..."
def header(title):
print("\n" + "=" * 80)
print(f" {title}")
print("=" * 80 + "\n")
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 1: Inlined headers
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 1: Inlined Headers")
inlined_header_examples = []
# Detect heading+body merged into one paragraph.
# A heading is a short (2-10 word) title-case or ALL-CAPS phrase at the start,
# immediately followed (no colon/period separator) by a sentence starting with
# a common sentence-opener like We/Our/The/As/In/This/A/An/Each/Management/For/Since/During.
pat_merged_header = re.compile(
r"^([A-Z][A-Za-z\s,&/\-\']+?)(?<![.;:!\?\)])\s+"
r"(We |Our |The |As |In |This |A |An |Each |To |Management |During |Since |For )"
)
STOP_WORDS = {"and", "of", "the", "for", "in", "to", "on", "with", "our",
"its", "an", "a", "or", "&"}
for p in paragraphs:
text = p["text"]
if len(text) < 50:
continue
m = pat_merged_header.match(text)
if not m:
continue
heading_candidate = m.group(1).strip()
words = heading_candidate.split()
if not (2 <= len(words) <= 10):
continue
# Must look like a heading: title case or all caps
is_title = all(
w[0].isupper() or w.lower() in STOP_WORDS
for w in words if w
)
is_allcaps = heading_candidate == heading_candidate.upper() and len(heading_candidate) > 5
if is_title or is_allcaps:
kind = "ALLCAPS" if is_allcaps else "TITLECASE"
inlined_header_examples.append((kind, p, heading_candidate))
print(f"Found {len(inlined_header_examples):,} paragraphs with potential inlined headers.")
print(f" - ALLCAPS pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='ALLCAPS'):,}")
print(f" - TITLECASE pattern: {sum(1 for t,_,_ in inlined_header_examples if t=='TITLECASE'):,}")
print()
# Show 20 examples, mix of both types
random.seed(42)
sample = random.sample(inlined_header_examples, min(20, len(inlined_header_examples)))
for i, (kind, p, hdr) in enumerate(sample, 1):
print(f" [{i}] ({kind}) Header: \"{hdr}\" [{p['filing']['companyName'][:30]}]")
print(f" {show(p['text'])}")
print()
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 2: Sentence boundary violations
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 2: Sentence Boundary Violations")
boundary_examples = []
# word.Next — period followed immediately by uppercase letter (not abbreviations)
pat_dotcap = re.compile(r"[a-z]\.([A-Z][a-z])")
# word,Next — comma followed immediately by uppercase letter
pat_commacap = re.compile(r"[a-z],([A-Z][a-z])")
# Two words jammed: lowercase then uppercase with no space/punct
pat_jammed = re.compile(r"[a-z]{2}[A-Z][a-z]{2}")
# Common false positives for dot-cap: abbreviations, names
false_pos_dot = re.compile(
r"(?:Mr|Mrs|Ms|Dr|Jr|Sr|Inc|Corp|Ltd|Co|No|vs|St|Dept|Gen|Gov|Sec|Vol|Rev|etc|U\.S|U\.K)\."
)
for p in paragraphs:
text = p["text"]
issues = []
for m in pat_dotcap.finditer(text):
start = max(0, m.start() - 10)
context = text[start : m.end() + 10]
# skip if it's a known abbreviation
if not false_pos_dot.search(text[max(0, m.start() - 5) : m.end()]):
issues.append(("dot-cap", context))
for m in pat_commacap.finditer(text):
start = max(0, m.start() - 10)
context = text[start : m.end() + 10]
issues.append(("comma-cap", context))
if issues:
boundary_examples.append((p, issues))
print(f"Found {len(boundary_examples):,} paragraphs with sentence boundary violations.")
print()
random.seed(43)
sample = random.sample(boundary_examples, min(20, len(boundary_examples)))
for i, (p, issues) in enumerate(sample, 1):
print(f" [{i}] [{p['filing']['companyName'][:30]}]")
for kind, ctx in issues[:3]:
print(f" ({kind}) ...{ctx}...")
print(f" Full start: {show(p['text'], 150)}")
print()
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 3: Garbled / nonsensical text
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 3: Garbled / Nonsensical Text")
garbled_examples = []
# Spaced-out characters: single chars separated by spaces
pat_spaced = re.compile(r"(?:\b[a-zA-Z]\s){4,}")
for p in paragraphs:
text = p["text"]
reason = None
# Check spaced-out characters
if pat_spaced.search(text):
reason = "spaced-chars"
# Check long non-ASCII runs
non_ascii = sum(1 for c in text if ord(c) > 127)
if non_ascii > len(text) * 0.15 and len(text) > 20:
reason = f"non-ASCII ({non_ascii}/{len(text)} chars)"
# Check mostly numbers/symbols (>50% non-alpha)
alpha = sum(1 for c in text if c.isalpha())
if len(text) > 20 and alpha < len(text) * 0.4:
reason = f"low-alpha ({alpha}/{len(text)} = {alpha/len(text):.0%})"
if reason:
garbled_examples.append((reason, p))
print(f"Found {len(garbled_examples):,} potentially garbled paragraphs.")
reason_counts = Counter(r.split("(")[0].strip() for r, _ in garbled_examples)
for r, c in reason_counts.most_common():
print(f" - {r}: {c}")
print()
random.seed(44)
sample = random.sample(garbled_examples, min(10, len(garbled_examples)))
for i, (reason, p) in enumerate(sample, 1):
print(f" [{i}] ({reason}) [{p['filing']['companyName'][:30]}] wc={p['wordCount']}")
print(f" {show(p['text'], 250)}")
print()
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 4: HTML / markup artifacts
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 4: HTML / Markup Artifacts")
html_examples = []
pat_html_tag = re.compile(r"<[a-zA-Z/][^>]*>")
pat_html_entity = re.compile(r"&(?:amp|lt|gt|nbsp|quot|#\d+|#x[0-9a-fA-F]+);")
pat_xbrl = re.compile(r"\b(?:ix|us-gaap|dei|xbrli):")
pat_css = re.compile(r"(?:font-family|font-size|color:|margin:|padding:|text-align|line-height)", re.IGNORECASE)
for p in paragraphs:
text = p["text"]
reasons = []
if pat_html_tag.search(text):
reasons.append("html-tag")
if pat_html_entity.search(text):
reasons.append("html-entity")
if pat_xbrl.search(text):
reasons.append("xbrl")
if pat_css.search(text):
reasons.append("css")
if reasons:
html_examples.append((reasons, p))
print(f"Found {len(html_examples):,} paragraphs with HTML/markup artifacts.")
reason_counts = Counter()
for reasons, _ in html_examples:
for r in reasons:
reason_counts[r] += 1
for r, c in reason_counts.most_common():
print(f" - {r}: {c}")
print()
random.seed(45)
sample = random.sample(html_examples, min(10, len(html_examples)))
for i, (reasons, p) in enumerate(sample, 1):
print(f" [{i}] ({', '.join(reasons)}) [{p['filing']['companyName'][:30]}]")
print(f" {show(p['text'], 250)}")
print()
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 5: Truncated paragraphs
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 5: Truncated Paragraphs")
truncated = []
# Common abbreviations that end sentences without terminal punct being an issue
abbrevs = {"inc", "corp", "ltd", "co", "mr", "mrs", "ms", "dr", "jr", "sr",
"etc", "al", "eg", "ie", "vs", "no", "approx", "dept", "gov"}
for p in paragraphs:
text = p["text"].rstrip()
if not text:
continue
# Check if ends with terminal punctuation
last_char = text[-1]
if last_char in ".!?:;)\"'""'":
continue
# Check if it's a very short text (likely a heading)
if p["wordCount"] <= 5:
continue
# Check if last word is a common abbreviation
last_word = text.split()[-1].lower().rstrip(".,;:!?")
if last_word in abbrevs:
continue
truncated.append(p)
print(f"Found {len(truncated):,} potentially truncated paragraphs (no terminal punctuation, >5 words).")
print()
random.seed(46)
sample = random.sample(truncated, min(10, len(truncated)))
for i, p in enumerate(sample, 1):
text = p["text"]
print(f" [{i}] [{p['filing']['companyName'][:30]}] wc={p['wordCount']}")
# Show the END of the text
if len(text) > 200:
print(f" ...{text[-200:]}")
else:
print(f" {text}")
print()
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 6: Duplicate text across filings
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 6: Cross-Filing Duplicate Text")
# Group by textHash
hash_to_paras = defaultdict(list)
for p in paragraphs:
hash_to_paras[p["textHash"]].append(p)
# Find hashes that appear in multiple different filings
cross_filing_dupes = {}
for h, ps in hash_to_paras.items():
accessions = set(p["filing"]["accessionNumber"] for p in ps)
if len(accessions) > 1:
cross_filing_dupes[h] = ps
total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values())
print(f"Unique textHashes appearing in multiple filings: {len(cross_filing_dupes):,}")
print(f"Total paragraphs involved: {total_dupe_paragraphs:,}")
print()
# Sort by number of filings (most duplicated first)
sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(set(p["filing"]["accessionNumber"] for p in x[1])), reverse=True)
print("Top 15 most duplicated paragraphs:")
for i, (h, ps) in enumerate(sorted_dupes[:15], 1):
accessions = set(p["filing"]["accessionNumber"] for p in ps)
companies = set(p["filing"]["companyName"] for p in ps)
print(f"\n [{i}] Hash={h}, in {len(accessions)} filings, {len(companies)} companies")
print(f" Companies: {', '.join(list(companies)[:5])}{'...' if len(companies) > 5 else ''}")
print(f" Text: {show(ps[0]['text'], 200)}")
# Check for same-company cross-year dupes vs different-company dupes
same_company_dupes = 0
diff_company_dupes = 0
for h, ps in cross_filing_dupes.items():
companies = set(p["filing"]["companyName"] for p in ps)
if len(companies) == 1:
same_company_dupes += 1
else:
diff_company_dupes += 1
print(f"\n\nBreakdown:")
print(f" Same company, different filings (likely year-over-year boilerplate): {same_company_dupes:,}")
print(f" Different companies (likely industry boilerplate or extraction error): {diff_company_dupes:,}")
# ══════════════════════════════════════════════════════════════════════════════
# CHECK 7: Ground truth spot-check
# ══════════════════════════════════════════════════════════════════════════════
header("CHECK 7: Ground Truth Spot-Check (10 random paragraphs vs. source HTML)")
def normalize_html_to_plain(html_text):
"""Convert raw HTML to normalized plain text for comparison."""
plain = re.sub(r"<[^>]+>", " ", html_text)
# Decode common HTML entities
plain = re.sub(r"&nbsp;?", " ", plain)
plain = re.sub(r"&amp;", "&", plain)
plain = re.sub(r"&lt;", "<", plain)
plain = re.sub(r"&gt;", ">", plain)
plain = re.sub(r"&rsquo;|&#8217;|&#x2019;", "\u2019", plain)
plain = re.sub(r"&lsquo;|&#8216;|&#x2018;", "\u2018", plain)
plain = re.sub(r"&rdquo;|&#8221;|&#x201D;", "\u201D", plain)
plain = re.sub(r"&ldquo;|&#8220;|&#x201C;", "\u201C", plain)
plain = re.sub(r"&mdash;|&#8212;", "\u2014", plain)
plain = re.sub(r"&ndash;|&#8211;", "\u2013", plain)
plain = re.sub(r"&#(\d+);", lambda m: chr(int(m.group(1))), plain)
plain = re.sub(r"&#x([0-9a-fA-F]+);", lambda m: chr(int(m.group(1), 16)), plain)
plain = re.sub(r"&\w+;", " ", plain)
plain = re.sub(r"\s+", " ", plain)
return plain
random.seed(99)
spot_check_sample = random.sample(paragraphs, 10)
match_count = 0
partial_count = 0
not_found_count = 0
for i, p in enumerate(spot_check_sample, 1):
acc = p["filing"]["accessionNumber"]
html_path = HTML_DIR / f"{acc}.html"
print(f" [{i}] {p['filing']['companyName'][:40]} | {acc}")
print(f" Paragraph index: {p['paragraphIndex']}, word count: {p['wordCount']}")
corpus_text = p["text"]
corpus_norm = re.sub(r"\s+", " ", corpus_text).strip()
if not html_path.exists():
print(f" *** HTML file not found: {html_path}")
print(f" Corpus text: {show(corpus_text, 150)}")
not_found_count += 1
print()
continue
with open(html_path, "r", errors="replace") as f:
html_content = f.read()
plain_html = normalize_html_to_plain(html_content)
# Check if the entire corpus text appears verbatim in the HTML plain text
if corpus_norm in plain_html:
print(f" VERBATIM MATCH: Corpus text found exactly in HTML source.")
match_count += 1
else:
# Try to find a distinctive substring to locate the paragraph
# Use multiple probes from different positions
found = False
for start_frac in [0.3, 0.5, 0.1, 0.7]:
start_pos = int(len(corpus_norm) * start_frac)
probe = corpus_norm[start_pos:start_pos + 40]
if not probe:
continue
idx = plain_html.find(probe)
if idx >= 0:
found = True
# Show surrounding context from HTML
ctx_start = max(0, idx - 80)
ctx_end = min(len(plain_html), idx + len(corpus_norm) + 80)
html_ctx = plain_html[ctx_start:ctx_end].strip()
print(f" PARTIAL MATCH: Text found in HTML but paragraph boundaries differ.")
print(f" Corpus first 120: {corpus_norm[:120]}")
print(f" HTML context 120: {html_ctx[:120]}")
partial_count += 1
break
if not found:
print(f" NOT FOUND in HTML plain text!")
print(f" Corpus text: {show(corpus_text, 150)}")
not_found_count += 1
print()
print(f"Spot-check results: {match_count} verbatim, {partial_count} partial, {not_found_count} not found")
# ══════════════════════════════════════════════════════════════════════════════
# SUMMARY
# ══════════════════════════════════════════════════════════════════════════════
header("SUMMARY")
print(f"Total paragraphs: {len(paragraphs):,}")
print(f" 1. Inlined headers: {len(inlined_header_examples):,}")
print(f" 2. Sentence boundary violations: {len(boundary_examples):,}")
print(f" 3. Garbled / nonsensical text: {len(garbled_examples):,}")
print(f" 4. HTML / markup artifacts: {len(html_examples):,}")
print(f" 5. Truncated paragraphs: {len(truncated):,}")
print(f" 6. Cross-filing duplicates: {len(cross_filing_dupes):,} unique texts in {total_dupe_paragraphs:,} paragraphs")
print()

405
scripts/audit_paragraphs.py Normal file
View File

@ -0,0 +1,405 @@
"""
Audit SEC-cyBERT paragraph corpus for boundary errors.
Run from project root: python3 scripts/audit_paragraphs.py
"""
import json
import random
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
DATA_PATH = Path("data/paragraphs/paragraphs-clean.jsonl")
def load_paragraphs():
paragraphs = []
with open(DATA_PATH) as f:
for line in f:
paragraphs.append(json.loads(line))
return paragraphs
def section_header(title):
bar = "=" * 80
print(f"\n{bar}")
print(f" {title}")
print(bar)
def truncate(text, n):
if len(text) <= n:
return text
return text[:n] + "..."
# ---------------------------------------------------------------------------
# Load
# ---------------------------------------------------------------------------
print("Loading paragraphs...")
paragraphs = load_paragraphs()
print(f"Loaded {len(paragraphs):,} paragraphs")
# Group by accessionNumber
by_filing = defaultdict(list)
for p in paragraphs:
acc = p["filing"]["accessionNumber"]
by_filing[acc].append(p)
print(f"Unique filings: {len(by_filing):,}")
# ---------------------------------------------------------------------------
# 1. Paragraphs-per-filing distribution
# ---------------------------------------------------------------------------
section_header("1. PARAGRAPHS-PER-FILING DISTRIBUTION")
counts = sorted([len(ps) for ps in by_filing.values()])
n = len(counts)
import math
mean = sum(counts) / n
variance = sum((c - mean) ** 2 for c in counts) / n
stdev = math.sqrt(variance)
def percentile(sorted_list, pct):
idx = pct / 100 * (len(sorted_list) - 1)
lo = int(math.floor(idx))
hi = int(math.ceil(idx))
if lo == hi:
return sorted_list[lo]
frac = idx - lo
return sorted_list[lo] * (1 - frac) + sorted_list[hi] * frac
print(f" Min: {counts[0]}")
print(f" P5: {percentile(counts, 5):.1f}")
print(f" P25: {percentile(counts, 25):.1f}")
print(f" Median: {percentile(counts, 50):.1f}")
print(f" P75: {percentile(counts, 75):.1f}")
print(f" P95: {percentile(counts, 95):.1f}")
print(f" Max: {counts[-1]}")
print(f" Stdev: {stdev:.2f}")
print(f" Mean: {mean:.2f}")
# Histogram buckets
buckets = [1, 2, 3, 5, 10, 15, 20, 30, 50, 100, 200]
print("\n Histogram:")
prev = 0
for b in buckets:
c = sum(1 for x in counts if prev < x <= b)
if c > 0:
print(f" ({prev+1}-{b}]: {c:>5} filings")
prev = b
c = sum(1 for x in counts if x > buckets[-1])
if c > 0:
print(f" (>{buckets[-1]}): {c:>5} filings")
# Fewest paragraphs
print("\n --- 10 filings with FEWEST paragraphs ---")
sorted_filings = sorted(by_filing.items(), key=lambda x: len(x[1]))
for acc, ps in sorted_filings[:10]:
company = ps[0]["filing"]["companyName"]
print(f"\n [{acc}] {company}{len(ps)} paragraph(s):")
for p in sorted(ps, key=lambda x: x["paragraphIndex"]):
print(f" p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}")
# Most paragraphs
print("\n --- 10 filings with MOST paragraphs ---")
for acc, ps in sorted_filings[-10:]:
company = ps[0]["filing"]["companyName"]
print(f"\n [{acc}] {company}{len(ps)} paragraph(s):")
for p in sorted(ps, key=lambda x: x["paragraphIndex"])[:5]:
print(f" p{p['paragraphIndex']} ({p['wordCount']}w): {truncate(p['text'], 150)}")
if len(ps) > 5:
print(f" ... ({len(ps) - 5} more)")
# ---------------------------------------------------------------------------
# 2. Suspiciously long paragraphs
# ---------------------------------------------------------------------------
section_header("2. SUSPICIOUSLY LONG PARAGRAPHS (top 20 by word count)")
sorted_by_wc = sorted(paragraphs, key=lambda p: p["wordCount"], reverse=True)
for i, p in enumerate(sorted_by_wc[:20]):
acc = p["filing"]["accessionNumber"]
company = p["filing"]["companyName"]
text = p["text"]
first200 = text[:200]
last200 = text[-200:] if len(text) > 400 else ""
print(f"\n #{i+1}: {p['wordCount']} words | p{p['paragraphIndex']} | {company}")
print(f" Acc: {acc}")
print(f" FIRST 200: {first200}")
if last200:
print(f" LAST 200: {last200}")
# Check for signs of merged paragraphs
issues = []
if p["wordCount"] > 300:
issues.append("VERY LONG (>300w)")
# Look for heading-like patterns mid-text (capitalized lines, bold markers)
lines = text.split("\n")
if len(lines) > 1:
issues.append(f"CONTAINS {len(lines)} LINES (possible merge)")
# Look for sentence-ending followed by topic shift
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) > 8:
issues.append(f"{len(sentences)} sentences")
if issues:
print(f" FLAGS: {', '.join(issues)}")
# ---------------------------------------------------------------------------
# 3. Suspiciously short paragraphs
# ---------------------------------------------------------------------------
section_header("3. SUSPICIOUSLY SHORT PARAGRAPHS (<25 words)")
short = [p for p in paragraphs if p["wordCount"] < 25]
print(f"\n Total paragraphs <25 words: {len(short)} ({100*len(short)/len(paragraphs):.1f}%)")
# Categorize
headings = []
standalone = []
fragments = []
list_items = []
heading_patterns = re.compile(
r"^(risk management|cybersecurity|governance|strategy|board|"
r"oversight|incident|material|information security|"
r"risk factors|item 1c|risk management and strategy|"
r"risk management, strategy|governance, risk management)"
, re.IGNORECASE
)
for p in short:
text = p["text"].strip()
lower = text.lower()
# Heading detection: short, no period at end, title-case-ish
is_heading = False
if len(text.split()) <= 8 and not text.endswith("."):
is_heading = True
if heading_patterns.match(lower):
is_heading = True
if text.isupper() and len(text.split()) <= 10:
is_heading = True
# List item: starts with bullet, dash, number, or letter
is_list = bool(re.match(r"^(\d+[.)]\s|[-•●◦▪]\s|[a-z][.)]\s|\([a-z]\)\s|\(\d+\)\s)", text))
# Fragment: doesn't end with period/question/exclamation and not a heading
is_fragment = not is_heading and not is_list and not re.search(r'[.!?"]$', text.rstrip())
if is_heading:
headings.append(p)
elif is_list:
list_items.append(p)
elif is_fragment:
fragments.append(p)
else:
standalone.append(p)
print(f" Headings: {len(headings)}")
print(f" Standalone sentences:{len(standalone)}")
print(f" Fragments: {len(fragments)}")
print(f" List items: {len(list_items)}")
def show_examples(label, items, count):
sample = items[:count] if len(items) <= count else random.sample(items, count)
print(f"\n --- {label} (showing {len(sample)} of {len(items)}) ---")
for p in sample:
acc = p["filing"]["accessionNumber"]
print(f" [{p['wordCount']}w] p{p['paragraphIndex']} | {truncate(p['text'], 120)}")
print(f" {p['filing']['companyName']} | {acc}")
random.seed(42)
show_examples("Headings", headings, 10)
show_examples("Standalone sentences", standalone, 8)
show_examples("Fragments", fragments, 8)
show_examples("List items", list_items, 4)
# ---------------------------------------------------------------------------
# 4. Sequential paragraph coherence
# ---------------------------------------------------------------------------
section_header("4. SEQUENTIAL PARAGRAPH COHERENCE (20 random filings)")
random.seed(123)
sample_accs = random.sample(list(by_filing.keys()), min(20, len(by_filing)))
mid_sentence_breaks = []
topic_shifts = []
for acc in sample_accs:
ps = sorted(by_filing[acc], key=lambda x: x["paragraphIndex"])
for i in range(len(ps) - 1):
curr = ps[i]
nxt = ps[i + 1]
curr_text = curr["text"].strip()
nxt_text = nxt["text"].strip()
# Check: does current paragraph end mid-sentence?
# Signs: ends with comma, semicolon, conjunction, lowercase word, no terminal punctuation
ends_mid = False
if curr_text and not re.search(r'[.!?:"\)]$', curr_text):
ends_mid = True
if curr_text and re.search(r'(,|;|\band\b|\bor\b|\bbut\b|\bthat\b|\bwhich\b)\s*$', curr_text):
ends_mid = True
# Check: does next paragraph start with lowercase (continuation)?
starts_lower = bool(nxt_text) and nxt_text[0].islower()
if ends_mid or starts_lower:
mid_sentence_breaks.append({
"acc": acc,
"company": curr["filing"]["companyName"],
"curr_idx": curr["paragraphIndex"],
"nxt_idx": nxt["paragraphIndex"],
"curr_end": curr_text[-150:] if len(curr_text) > 150 else curr_text,
"nxt_start": nxt_text[:150] if len(nxt_text) > 150 else nxt_text,
"ends_mid": ends_mid,
"starts_lower": starts_lower,
})
print(f"\n Checked {len(sample_accs)} filings")
print(f" Potential mid-sentence breaks found: {len(mid_sentence_breaks)}")
print("\n --- Examples of mid-sentence / continuation breaks ---")
for ex in mid_sentence_breaks[:5]:
print(f"\n [{ex['acc']}] {ex['company']}")
print(f" p{ex['curr_idx']} ENDS: ...{ex['curr_end']}")
print(f" p{ex['nxt_idx']} STARTS: {ex['nxt_start']}...")
flags = []
if ex["ends_mid"]:
flags.append("no terminal punctuation")
if ex["starts_lower"]:
flags.append("next starts lowercase")
print(f" FLAGS: {', '.join(flags)}")
if len(mid_sentence_breaks) == 0:
print(" (none found)")
# Also check for topic shifts within single paragraphs (long ones in sampled filings)
print("\n --- Checking for intra-paragraph topic shifts ---")
shift_examples = []
for acc in sample_accs:
for p in by_filing[acc]:
if p["wordCount"] < 150:
continue
text = p["text"]
# Look for heading-like substrings mid-text
# e.g., "Risk Management" or "Governance" appearing after a sentence end
matches = list(re.finditer(
r'(?<=[.!?]\s)(Risk Management|Governance|Strategy|Cybersecurity|'
r'Board of Directors|Incident Response|Overview|Third.Party)',
text
))
if matches:
shift_examples.append({
"acc": acc,
"company": p["filing"]["companyName"],
"idx": p["paragraphIndex"],
"wordCount": p["wordCount"],
"match": matches[0].group(),
"context": text[max(0, matches[0].start()-80):matches[0].end()+80],
})
print(f" Paragraphs with possible embedded topic headers: {len(shift_examples)}")
for ex in shift_examples[:5]:
print(f"\n [{ex['acc']}] {ex['company']} p{ex['idx']} ({ex['wordCount']}w)")
print(f" Found '{ex['match']}' mid-paragraph:")
print(f" ...{ex['context']}...")
# ---------------------------------------------------------------------------
# 5. Paragraph index gaps
# ---------------------------------------------------------------------------
section_header("5. PARAGRAPH INDEX GAPS & DUPLICATES")
gap_filings = []
dup_filings = []
for acc, ps in by_filing.items():
indices = sorted(p["paragraphIndex"] for p in ps)
# Check for duplicates
if len(indices) != len(set(indices)):
counter = Counter(indices)
dups = {k: v for k, v in counter.items() if v > 1}
dup_filings.append((acc, ps[0]["filing"]["companyName"], dups))
# Check for gaps (should be 0, 1, 2, ...)
expected = list(range(indices[0], indices[0] + len(indices)))
if indices != expected:
missing = set(expected) - set(indices)
extra = set(indices) - set(expected)
if missing or extra:
gap_filings.append((acc, ps[0]["filing"]["companyName"], sorted(missing), sorted(extra), indices))
print(f"\n Filings with duplicate paragraph indices: {len(dup_filings)}")
for acc, company, dups in dup_filings[:10]:
print(f" [{acc}] {company}: duplicates at indices {dups}")
print(f"\n Filings with index gaps: {len(gap_filings)}")
for acc, company, missing, extra, indices in gap_filings[:10]:
print(f" [{acc}] {company}")
if missing:
print(f" Missing indices: {missing}")
if extra:
print(f" Unexpected indices: {extra}")
print(f" Actual indices: {indices}")
# Check if all start at 0
non_zero_start = [(acc, ps) for acc, ps in by_filing.items()
if min(p["paragraphIndex"] for p in ps) != 0]
print(f"\n Filings not starting at index 0: {len(non_zero_start)}")
for acc, ps in non_zero_start[:5]:
start = min(p["paragraphIndex"] for p in ps)
print(f" [{acc}] {ps[0]['filing']['companyName']}: starts at {start}")
# ---------------------------------------------------------------------------
# 6. Cross-filing duplicate paragraphs
# ---------------------------------------------------------------------------
section_header("6. CROSS-FILING DUPLICATE PARAGRAPHS")
# Group by textHash
by_hash = defaultdict(list)
for p in paragraphs:
by_hash[p["textHash"]].append(p)
# Find hashes appearing in multiple filings
cross_filing_dupes = {}
for h, ps in by_hash.items():
accs = set(p["filing"]["accessionNumber"] for p in ps)
if len(accs) > 1:
cross_filing_dupes[h] = ps
total_dupe_paragraphs = sum(len(ps) for ps in cross_filing_dupes.values())
unique_dupe_texts = len(cross_filing_dupes)
print(f"\n Unique paragraph texts appearing in >1 filing: {unique_dupe_texts}")
print(f" Total paragraphs that are cross-filing duplicates: {total_dupe_paragraphs} ({100*total_dupe_paragraphs/len(paragraphs):.1f}%)")
# Also count same-hash within same filing
within_filing_dupes = 0
for h, ps in by_hash.items():
accs = [p["filing"]["accessionNumber"] for p in ps]
if len(accs) != len(set(accs)):
within_filing_dupes += 1
print(f" Hashes duplicated WITHIN a single filing: {within_filing_dupes}")
# Top 20 most duplicated
sorted_dupes = sorted(cross_filing_dupes.items(), key=lambda x: len(x[1]), reverse=True)
print("\n --- Top 20 most duplicated texts across filings ---")
for i, (h, ps) in enumerate(sorted_dupes[:20]):
n_filings = len(set(p["filing"]["accessionNumber"] for p in ps))
text = ps[0]["text"]
print(f"\n #{i+1}: hash={h} | {n_filings} filings | {ps[0]['wordCount']}w")
print(f" TEXT: {truncate(text, 200)}")
# Boilerplate analysis: texts appearing in 3+ filings
boilerplate_threshold = 3
boilerplate_hashes = {h for h, ps in cross_filing_dupes.items()
if len(set(p["filing"]["accessionNumber"] for p in ps)) >= boilerplate_threshold}
boilerplate_paragraphs = sum(len(by_hash[h]) for h in boilerplate_hashes)
print(f"\n Boilerplate (text in {boilerplate_threshold}+ filings):")
print(f" Unique texts: {len(boilerplate_hashes)}")
print(f" Total paragraphs: {boilerplate_paragraphs} ({100*boilerplate_paragraphs/len(paragraphs):.1f}%)")
print("\n" + "=" * 80)
print(" AUDIT COMPLETE")
print("=" * 80)

View File

@ -0,0 +1,539 @@
#!/usr/bin/env python3
"""
Novel data quality audit for paragraphs-clean.jsonl.
READ-ONLY: prints findings to stdout, does not modify any files.
"""
import json
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
DATA_PATH = Path(__file__).resolve().parent.parent / "data" / "paragraphs" / "paragraphs-clean.jsonl"
# ── Cybersecurity domain keywords (broad) ──────────────────────────────
CYBER_KEYWORDS = {
"cyber", "cybersecurity", "security", "breach", "incident", "threat",
"vulnerability", "malware", "ransomware", "phishing", "firewall",
"encryption", "intrusion", "unauthorized", "attack", "hacker",
"data protection", "information security", "network security",
"access control", "authentication", "risk management", "ciso",
"chief information security", "chief information officer",
"information technology", "it systems", "data privacy", "privacy",
"personally identifiable", "pii", "soc", "nist", "iso 27001",
"penetration test", "disaster recovery", "business continuity",
"third party", "vendor", "supply chain", "cloud", "endpoint",
"monitoring", "detection", "response", "remediation", "patch",
"compliance", "regulatory", "safeguard", "protect", "secure",
"confidential", "integrity", "availability", "resilience",
"governance", "oversight", "board of directors", "audit committee",
"risk factor", "material", "disclosure", "1c", "item 1c",
}
# ── Non-cyber legal boilerplate patterns ────────────────────────────────
BOILERPLATE_PATTERNS = [
re.compile(r"forward[- ]looking\s+statements?", re.I),
re.compile(r"safe\s+harbor", re.I),
re.compile(r"private\s+securities\s+litigation\s+reform\s+act", re.I),
re.compile(r"cautionary\s+statement", re.I),
re.compile(r"except\s+as\s+required\s+by\s+law.*no\s+obligation\s+to\s+update", re.I),
re.compile(r"this\s+(annual\s+)?report\s+(on\s+form\s+10-k\s+)?contains?\s+forward", re.I),
]
# ── SEC item cross-reference pattern ────────────────────────────────────
SEC_ITEM_RE = re.compile(r"\bItem\s+(\d+[A-Z]?)\b", re.I)
# ── Dollar amount pattern ──────────────────────────────────────────────
DOLLAR_RE = re.compile(r"\$[\d,]+(?:\.\d+)?\s*(?:thousand|million|billion|trillion)?", re.I)
# ── Date patterns (unusual formats) ────────────────────────────────────
DATE_PATTERNS = [
# MM/DD/YYYY or MM-DD-YYYY
re.compile(r"\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b"),
# Month DD, YYYY
re.compile(r"\b(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}\b", re.I),
# DD Month YYYY
re.compile(r"\b\d{1,2}\s+(?:January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\b", re.I),
# YYYY-MM-DD (ISO)
re.compile(r"\b\d{4}-\d{2}-\d{2}\b"),
]
# ── Bullet point characters ────────────────────────────────────────────
BULLET_RE = re.compile(r"[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25AB\u25CF\u25CB\u25A0\u25A1]")
# ── Helpers ─────────────────────────────────────────────────────────────
def truncate(text: str, max_len: int = 200) -> str:
if len(text) <= max_len:
return text
return text[:max_len] + "..."
def print_section(title: str):
print(f"\n{'=' * 80}")
print(f" {title}")
print(f"{'=' * 80}")
def print_finding(name: str, concern: str, count: int, total: int, examples: list[dict]):
pct = count / total * 100 if total else 0
print(f"\n--- {name} [{concern} CONCERN] ---")
print(f" Count: {count:,} / {total:,} ({pct:.2f}%)")
for i, ex in enumerate(examples[:5]):
filing = ex.get("filing", {})
company = filing.get("companyName", "?")
print(f" Example {i+1} [{company}]:")
print(f" {truncate(ex['text'], 300)}")
if count > 5:
print(f" ... and {count - 5:,} more")
def has_cyber_relevance(text_lower: str) -> bool:
for kw in CYBER_KEYWORDS:
if kw in text_lower:
return True
return False
# ── Load data ──────────────────────────────────────────────────────────
def load_data():
paragraphs = []
with open(DATA_PATH) as f:
for line in f:
paragraphs.append(json.loads(line))
return paragraphs
def main():
print("Loading data...")
paragraphs = load_data()
total = len(paragraphs)
print(f"Loaded {total:,} paragraphs.\n")
# Pre-compute lowercase texts
texts_lower = [p["text"].lower() for p in paragraphs]
# ════════════════════════════════════════════════════════════════════
print_section("1. CHARACTER-LEVEL ANOMALIES")
# ════════════════════════════════════════════════════════════════════
# 1a. High uppercase ratio (>30%)
high_upper = []
for p in paragraphs:
t = p["text"]
alpha = sum(1 for c in t if c.isalpha())
if alpha < 10:
continue
upper = sum(1 for c in t if c.isupper())
ratio = upper / alpha
if ratio > 0.30:
high_upper.append({**p, "_ratio": ratio})
high_upper.sort(key=lambda x: x["_ratio"], reverse=True)
print_finding("High uppercase ratio (>30% of alpha chars)", "MEDIUM",
len(high_upper), total, high_upper)
# 1b. Unusual punctuation density
high_punct = []
for p in paragraphs:
t = p["text"]
if len(t) < 30:
continue
semis = t.count(";")
colons = t.count(":")
dashes = t.count("") + t.count("") + t.count("-")
punct_count = semis + colons + dashes
density = punct_count / len(t)
if density > 0.05:
high_punct.append({**p, "_density": density, "_semis": semis, "_colons": colons, "_dashes": dashes})
high_punct.sort(key=lambda x: x["_density"], reverse=True)
print_finding("High punctuation density (semicolons/colons/dashes >5% of chars)", "LOW",
len(high_punct), total, high_punct)
# 1c. Non-ASCII characters
non_ascii_paras = []
non_ascii_chars_all = Counter()
for p in paragraphs:
t = p["text"]
non_ascii = [(c, hex(ord(c)), ord(c)) for c in t if ord(c) > 127]
if non_ascii:
chars_found = set((c, h) for c, h, _ in non_ascii)
for c, h, _ in non_ascii:
non_ascii_chars_all[f"{c} ({h})"] += 1
non_ascii_paras.append({**p, "_chars": chars_found})
print_finding("Paragraphs with non-ASCII characters", "MEDIUM",
len(non_ascii_paras), total, non_ascii_paras)
if non_ascii_chars_all:
print("\n Non-ASCII character frequency:")
for char_repr, cnt in non_ascii_chars_all.most_common(20):
print(f" {char_repr}: {cnt:,} occurrences")
# 1d. Unusual whitespace (multiple spaces, tabs)
multi_space_re = re.compile(r" +")
tab_re = re.compile(r"\t")
whitespace_issues = []
for p in paragraphs:
t = p["text"]
multi = len(multi_space_re.findall(t))
tabs = len(tab_re.findall(t))
if multi > 0 or tabs > 0:
whitespace_issues.append({**p, "_multi_spaces": multi, "_tabs": tabs})
print_finding("Unusual whitespace (multiple spaces or tabs)", "MEDIUM",
len(whitespace_issues), total, whitespace_issues)
# ════════════════════════════════════════════════════════════════════
print_section("2. CONTENT ANOMALIES")
# ════════════════════════════════════════════════════════════════════
# 2a. Dollar amounts
dollar_paras = []
for p in paragraphs:
matches = DOLLAR_RE.findall(p["text"])
if matches:
dollar_paras.append({**p, "_amounts": matches})
print_finding("Paragraphs with dollar amounts", "MEDIUM",
len(dollar_paras), total, dollar_paras)
if dollar_paras:
# Show distribution of dollar amounts
all_amounts = []
for dp in dollar_paras:
all_amounts.extend(dp["_amounts"])
print(f"\n Total dollar amount mentions: {len(all_amounts):,}")
amount_counter = Counter(all_amounts)
print(" Most common amounts:")
for amt, cnt in amount_counter.most_common(10):
print(f" {amt}: {cnt:,}")
# 2b. Dates in text
date_paras = []
for p in paragraphs:
t = p["text"]
found_dates = []
for pat in DATE_PATTERNS:
found_dates.extend(pat.findall(t))
if found_dates:
date_paras.append({**p, "_dates": found_dates})
print_finding("Paragraphs containing dates", "LOW",
len(date_paras), total, date_paras)
if date_paras:
all_dates = []
for dp in date_paras:
all_dates.extend(dp["_dates"])
print(f"\n Total date mentions: {len(all_dates):,}")
# 2c. Cross-references to other SEC items
cross_ref_paras = []
for p in paragraphs:
matches = SEC_ITEM_RE.findall(p["text"])
# Filter out Item 1C (that's expected)
other_items = [m for m in matches if m.upper() != "1C"]
if other_items:
cross_ref_paras.append({**p, "_items": other_items})
# Count which items are referenced
item_counts = Counter()
for crp in cross_ref_paras:
for item in crp["_items"]:
item_counts[f"Item {item}"] += 1
print_finding("Cross-references to non-1C SEC items", "HIGH",
len(cross_ref_paras), total, cross_ref_paras)
if item_counts:
print("\n Referenced items:")
for item, cnt in item_counts.most_common():
print(f" {item}: {cnt:,}")
# 2d. Non-cyber legal boilerplate
boilerplate_paras = []
for p in paragraphs:
t = p["text"]
matched = []
for pat in BOILERPLATE_PATTERNS:
if pat.search(t):
matched.append(pat.pattern[:60])
if matched:
boilerplate_paras.append({**p, "_patterns": matched})
print_finding("Non-cybersecurity legal boilerplate", "HIGH",
len(boilerplate_paras), total, boilerplate_paras)
# ════════════════════════════════════════════════════════════════════
print_section("3. STRUCTURAL ANOMALIES")
# ════════════════════════════════════════════════════════════════════
# 3a. Bullet points mid-text
bullet_paras = []
for p in paragraphs:
t = p["text"]
if BULLET_RE.search(t):
bullet_paras.append(p)
elif re.search(r"(?:^|\n)\s*[-*]\s+\w", t):
bullet_paras.append(p)
print_finding("Paragraphs with bullet points mid-text", "MEDIUM",
len(bullet_paras), total, bullet_paras)
# 3b. Embedded newlines
newline_paras = []
for p in paragraphs:
t = p["text"]
nl_count = t.count("\n")
if nl_count > 0:
newline_paras.append({**p, "_newlines": nl_count})
newline_paras.sort(key=lambda x: x["_newlines"], reverse=True)
print_finding("Paragraphs with embedded newlines", "MEDIUM",
len(newline_paras), total, newline_paras)
# 3c. Mid-paragraph headings (ALL CAPS phrase of 3+ words followed by different content)
mid_heading_re = re.compile(r"(?<=\. )([A-Z][A-Z\s]{10,}[A-Z])(?=\.?\s+[A-Z][a-z])")
mid_heading_paras = []
for p in paragraphs:
t = p["text"]
matches = mid_heading_re.findall(t)
if matches:
mid_heading_paras.append({**p, "_headings": matches})
print_finding("Mid-paragraph headings (ALL CAPS phrase mid-sentence)", "MEDIUM",
len(mid_heading_paras), total, mid_heading_paras)
# ════════════════════════════════════════════════════════════════════
print_section("4. OUTLIER DETECTION")
# ════════════════════════════════════════════════════════════════════
# 4a. Extremely high word count (>400)
long_paras = [p for p in paragraphs if p["wordCount"] > 400]
long_paras.sort(key=lambda x: x["wordCount"], reverse=True)
print_finding("Extremely long paragraphs (>400 words)", "HIGH",
len(long_paras), total, long_paras)
if long_paras:
wc_values = [p["wordCount"] for p in long_paras]
print(f"\n Word count range: {min(wc_values)} - {max(wc_values)}")
print(f" Mean: {sum(wc_values)/len(wc_values):.0f}")
# 4b. Low information density
# Common English stopwords
STOPWORDS = {
"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for",
"of", "with", "by", "from", "is", "are", "was", "were", "be", "been",
"being", "have", "has", "had", "do", "does", "did", "will", "would",
"could", "should", "may", "might", "shall", "can", "that", "which",
"who", "whom", "this", "these", "those", "it", "its", "we", "our",
"us", "they", "their", "them", "he", "she", "his", "her", "as",
"if", "not", "no", "nor", "so", "than", "too", "very", "such",
"also", "each", "any", "all", "both", "other", "some", "into",
"through", "during", "before", "after", "about", "between", "under",
"over", "above", "up", "down", "out", "off", "then", "once",
}
low_info_paras = []
for p in paragraphs:
words = re.findall(r"[a-z]+", p["text"].lower())
if len(words) < 20:
continue
stop_ratio = sum(1 for w in words if w in STOPWORDS) / len(words)
if stop_ratio > 0.65:
low_info_paras.append({**p, "_stop_ratio": stop_ratio})
low_info_paras.sort(key=lambda x: x["_stop_ratio"], reverse=True)
print_finding("Low information density (>65% stopwords)", "LOW",
len(low_info_paras), total, low_info_paras)
# 4c. Exact substring matches across filings
print("\n--- Exact substring matches across filings [HIGH CONCERN] ---")
print(" (Checking paragraphs that appear as substrings of others in different filings...)")
# Group by accession number for efficiency
by_accession = defaultdict(list)
for p in paragraphs:
acc = p["filing"]["accessionNumber"]
by_accession[acc].append(p)
# For efficiency, only check paragraphs 50-200 chars (likely fragments/duplicates)
# Sort by length so shorter ones are checked as substrings of longer ones
candidates = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"], p["id"])
for p in paragraphs if 50 <= len(p["text"]) <= 200]
longer_texts = [(p["text"], p["filing"]["accessionNumber"], p["filing"]["companyName"])
for p in paragraphs if len(p["text"]) > 200]
substring_matches = []
# Use a set for dedup
seen = set()
# Only check a sample for performance
check_limit = min(len(candidates), 3000)
for i in range(check_limit):
cand_text, cand_acc, cand_co, cand_id = candidates[i]
for long_text, long_acc, long_co in longer_texts[:5000]:
if cand_acc == long_acc:
continue # same filing, skip
if cand_text in long_text and cand_id not in seen:
seen.add(cand_id)
substring_matches.append({
"text": cand_text,
"filing": {"companyName": cand_co, "accessionNumber": cand_acc},
"_found_in": long_co,
})
break
print(f" Count (sampled {check_limit:,} short paras against {min(len(longer_texts), 5000):,} long paras): {len(substring_matches):,}")
for i, ex in enumerate(substring_matches[:5]):
print(f" Example {i+1} [{ex['filing']['companyName']}] (also in {ex['_found_in']}):")
print(f" {truncate(ex['text'], 300)}")
if len(substring_matches) > 5:
print(f" ... and {len(substring_matches) - 5:,} more")
# ════════════════════════════════════════════════════════════════════
print_section("5. SEMANTIC COHERENCE")
# ════════════════════════════════════════════════════════════════════
# 5a. Company name mismatch — look for SPECIFIC named companies in text
# that differ from the filing company. Filter out generic refs like "the Company".
company_name_mismatches = []
# Pattern: proper noun(s) + legal suffix at end, NOT preceded by "the "
specific_company_re = re.compile(
r"(?<!\bthe )(?<!\bThe )(?<!\ba )(?<!\bA )"
r"\b([A-Z][A-Za-z&\.']+(?:\s+[A-Z][A-Za-z&\.']+){0,5})"
r",?\s+(Corp(?:oration)?|Inc(?:orporated)?|LLC|Ltd|L\.P\.|Holdings|Partners)\b\.?"
)
# Generic phrases to ignore
GENERIC_COMPANY_REFS = {
"the company", "our company", "a company", "each company",
"any company", "this company", "such company", "parent company",
"holding company", "shell company", "blank check company",
"portfolio company", "operating company", "management company",
"insurance company", "affiliated company",
}
for p in paragraphs:
t = p["text"]
filing_company = p["filing"]["companyName"]
matches = specific_company_re.findall(t)
if not matches:
continue
filing_words = set(w.lower() for w in re.findall(r"[A-Za-z]{3,}", filing_company))
for name_part, suffix in matches:
full = f"{name_part} {suffix}".strip()
if full.lower() in GENERIC_COMPANY_REFS:
continue
mention_words = set(w.lower() for w in re.findall(r"[A-Za-z]{3,}", full))
generic = {"inc", "corp", "corporation", "incorporated", "company", "group",
"holdings", "the", "and", "llc", "ltd", "partners", "new"}
meaningful_filing = filing_words - generic
meaningful_mention = mention_words - generic
if meaningful_mention and not (meaningful_mention & meaningful_filing):
company_name_mismatches.append({
**p,
"_mentioned": full,
"_filing_company": filing_company,
})
break
print_finding("Company name in text doesn't match filing metadata", "HIGH",
len(company_name_mismatches), total, company_name_mismatches)
if company_name_mismatches:
print("\n Sample mismatches (mentioned vs filing):")
for ex in company_name_mismatches[:15]:
print(f" Mentioned: '{ex['_mentioned']}' | Filing: '{ex['_filing_company']}'")
# 5b. No cybersecurity keywords at all
no_cyber = []
for i, p in enumerate(paragraphs):
if not has_cyber_relevance(texts_lower[i]):
no_cyber.append(p)
print_finding("No cybersecurity keywords at all", "HIGH",
len(no_cyber), total, no_cyber)
if no_cyber:
# Show word count distribution of non-cyber paragraphs
wc_dist = Counter()
for p in no_cyber:
bucket = (p["wordCount"] // 50) * 50
wc_dist[f"{bucket}-{bucket+49}"] += 1
print("\n Word count distribution of non-cyber paragraphs:")
for bucket, cnt in sorted(wc_dist.items()):
print(f" {bucket} words: {cnt:,}")
# ════════════════════════════════════════════════════════════════════
print_section("BONUS: ADDITIONAL NOVEL CHECKS")
# ════════════════════════════════════════════════════════════════════
# 6a. Paragraphs that are mostly a URL or contain URLs
url_re = re.compile(r"https?://\S+|www\.\S+")
url_paras = []
for p in paragraphs:
urls = url_re.findall(p["text"])
if urls:
url_ratio = sum(len(u) for u in urls) / len(p["text"])
url_paras.append({**p, "_urls": urls, "_ratio": url_ratio})
url_paras.sort(key=lambda x: x["_ratio"], reverse=True)
print_finding("Paragraphs containing URLs", "MEDIUM",
len(url_paras), total, url_paras)
# 6b. Paragraphs with parenthetical references that look like citations/footnotes
footnote_re = re.compile(r"\(\d+\)|\[\d+\]|(?:footnote|fn\.?)\s*\d+", re.I)
footnote_paras = []
for p in paragraphs:
if footnote_re.search(p["text"]):
footnote_paras.append(p)
print_finding("Paragraphs with footnote/citation references", "LOW",
len(footnote_paras), total, footnote_paras)
# 6c. Paragraphs that look like table data (multiple numeric values separated by whitespace)
table_re = re.compile(r"(?:\d[\d,.]*\s+){3,}")
table_paras = []
for p in paragraphs:
if table_re.search(p["text"]):
table_paras.append(p)
print_finding("Paragraphs that look like table/numeric data", "HIGH",
len(table_paras), total, table_paras)
# 6d. Encoding artifacts (replacement chars, zero-width spaces, BOM, etc.)
encoding_re = re.compile(r"[\ufffd\u200b\u200c\u200d\ufeff\u00a0]")
encoding_paras = []
for p in paragraphs:
matches = encoding_re.findall(p["text"])
if matches:
encoding_paras.append({**p, "_artifacts": Counter(f"U+{ord(c):04X} ({c!r})" for c in matches)})
print_finding("Encoding artifacts (replacement chars, NBSP, zero-width, BOM)", "HIGH",
len(encoding_paras), total, encoding_paras)
if encoding_paras:
all_artifacts = Counter()
for ep in encoding_paras:
all_artifacts.update(ep["_artifacts"])
print("\n Artifact frequency:")
for art, cnt in all_artifacts.most_common():
print(f" {art}: {cnt:,}")
# 6e. Repeated sentences within a paragraph
repeated_sent_paras = []
for p in paragraphs:
t = p["text"]
# Split on sentence boundaries
sentences = re.split(r'(?<=[.!?])\s+', t)
if len(sentences) < 3:
continue
sent_counter = Counter(s.strip().lower() for s in sentences if len(s.strip()) > 20)
dupes = {s: c for s, c in sent_counter.items() if c > 1}
if dupes:
repeated_sent_paras.append({**p, "_dupes": dupes})
print_finding("Paragraphs with repeated sentences", "HIGH",
len(repeated_sent_paras), total, repeated_sent_paras)
# ════════════════════════════════════════════════════════════════════
print_section("SUMMARY")
# ════════════════════════════════════════════════════════════════════
print(f"\n Total paragraphs analyzed: {total:,}")
print(f"\n HIGH concern findings:")
print(f" - Cross-references to non-1C items: {len(cross_ref_paras):,}")
print(f" - Non-cyber legal boilerplate: {len(boilerplate_paras):,}")
print(f" - Extremely long paragraphs (>400 words): {len(long_paras):,}")
print(f" - Company name mismatches: {len(company_name_mismatches):,}")
print(f" - No cybersecurity keywords: {len(no_cyber):,}")
print(f" - Table/numeric data: {len(table_paras):,}")
print(f" - Encoding artifacts: {len(encoding_paras):,}")
print(f" - Repeated sentences: {len(repeated_sent_paras):,}")
print(f" - Exact substring matches (sampled): {len(substring_matches):,}")
print(f"\n MEDIUM concern findings:")
print(f" - High uppercase ratio: {len(high_upper):,}")
print(f" - Non-ASCII characters: {len(non_ascii_paras):,}")
print(f" - Unusual whitespace: {len(whitespace_issues):,}")
print(f" - Dollar amounts: {len(dollar_paras):,}")
print(f" - Bullet points mid-text: {len(bullet_paras):,}")
print(f" - Embedded newlines: {len(newline_paras):,}")
print(f" - Mid-paragraph headings: {len(mid_heading_paras):,}")
print(f" - URLs in text: {len(url_paras):,}")
print(f"\n LOW concern findings:")
print(f" - High punctuation density: {len(high_punct):,}")
print(f" - Date mentions: {len(date_paras):,}")
print(f" - Low information density: {len(low_info_paras):,}")
print(f" - Footnote references: {len(footnote_paras):,}")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,537 @@
#!/usr/bin/env python3
"""
Detect HTML generators for all SEC filing HTML files.
Phase 1: Exhaustive signature detection
Phase 2: Cluster remaining unknowns
Phase 3: Summary statistics
"""
import os
import re
import sys
from collections import defaultdict, Counter
from pathlib import Path
HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
READ_BYTES = 20_000
# Known SEC filing agent CIKs (accession number prefixes)
FILING_AGENT_CIKS = {
"0000950170": "Donnelley Financial Solutions",
"0001193125": "Donnelley Financial Solutions",
"0001558370": "Toppan Merrill",
"0001654954": "Toppan Merrill",
}
def detect_generator(filepath: str) -> tuple[str, str]:
"""Read first 20KB of file and detect generator. Returns (generator, evidence)."""
with open(filepath, "rb") as f:
raw = f.read(READ_BYTES)
text = raw.decode("utf-8", errors="replace")
text_lower = text.lower()
# --- Explicit generator metadata ---
# 1. <meta name="generator" content="..."> (both attribute orderings)
m = re.search(
r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
text, re.I,
)
if not m:
m = re.search(
r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']',
text, re.I,
)
if m:
return _normalize_generator(m.group(1)), f'meta generator: {m.group(1)}'
# 2. <meta name="Creator" content="...">
m = re.search(
r'<meta\s+name\s*=\s*["\']Creator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
text, re.I,
)
if m:
return _normalize_generator(m.group(1)), f'meta Creator: {m.group(1)}'
# 4. <meta name="Producer" content="...">
m = re.search(
r'<meta\s+name\s*=\s*["\']Producer["\']\s+content\s*=\s*["\']([^"\']+)["\']',
text, re.I,
)
if m:
return _normalize_generator(m.group(1)), f'meta Producer: {m.group(1)}'
# 15. ProgId meta tag (Word, Excel converters)
m = re.search(
r'<meta\s+name\s*=\s*["\']ProgId["\']\s+content\s*=\s*["\']([^"\']+)["\']',
text, re.I,
)
if m:
progid = m.group(1)
if "word" in progid.lower():
return "Microsoft Word", f"ProgId: {progid}"
if "excel" in progid.lower():
return "Microsoft Excel", f"ProgId: {progid}"
return _normalize_generator(progid), f"ProgId: {progid}"
# --- HTML comment signatures (search full 20KB) ---
# Workiva / Wdesk
if re.search(r"<!--.*Created with the Workiva Platform.*-->", text, re.I):
return "Workiva", "comment: Created with the Workiva Platform"
if re.search(r"<!--.*Copyright\s+\d{4}\s+Workiva.*-->", text, re.I):
return "Workiva", "comment: Copyright Workiva"
if re.search(r"<!--.*Document created using Wdesk.*-->", text, re.I):
return "Workiva", "comment: Document created using Wdesk"
# Toppan Merrill / Bridge
if re.search(r"<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->", text, re.I):
return "Toppan Merrill", "comment: Toppan Merrill"
if re.search(r"<!--.*Merrill\s*Bridge.*-->", text, re.I):
return "Toppan Merrill", "comment: Merrill Bridge"
# Donnelley Financial Solutions / RR Donnelley
if re.search(r"<!--.*Donnelley Financial Solutions.*-->", text, re.I):
return "Donnelley Financial Solutions", "comment: Donnelley Financial Solutions"
if re.search(r"<!--.*RR\s*Donnelley.*-->", text, re.I):
return "Donnelley Financial Solutions", "comment: RR Donnelley"
# Broadridge PROfile
if re.search(r"<!--.*Broadridge\s+PROfile.*-->", text, re.I):
return "Broadridge PROfile", "comment: Broadridge PROfile"
# Also match "Licensed to: ... Document created using Broadridge PROfile"
if "broadridge" in text_lower:
return "Broadridge PROfile", "keyword: broadridge"
# SEC Publisher (in title or comment)
m_title = re.search(r"<title[^>]*>([^<]+)</title>", text, re.I)
title_text = m_title.group(1).strip() if m_title else ""
if "sec publisher" in text_lower or "sec publisher" in title_text.lower():
return "SEC Publisher", "title/keyword: SEC Publisher"
# IRIS Carbon (various filing agents using IRIS Carbon platform)
m = re.search(r"<!--.*Powered by IRIS Carbon.*-->", text, re.I)
if m:
# Extract the filing agent name before "Powered by IRIS Carbon"
m2 = re.search(r"<!--\s*([^,]+),\s*Powered by IRIS Carbon", text, re.I)
agent = m2.group(1).strip() if m2 else "Unknown agent"
return "IRIS Carbon", f"comment: {agent} via IRIS Carbon"
# Certent Disclosure Management
if re.search(r"<!--.*Certent\s+Disclosure\s+Management.*-->", text, re.I):
return "Certent", "comment: Certent Disclosure Management"
if "certent" in text_lower:
return "Certent", "keyword: certent"
# CompSci Resources, LLC
if re.search(r"<!--.*CompSci Resources.*-->", text, re.I):
return "CompSci Transform", "comment: CompSci Resources"
# RDG Portal
if re.search(r"<!--.*RDG Portal.*-->", text, re.I):
return "RDG Portal", "comment: RDG Portal"
# PDF to EDGAR
if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]:
return "PDF to EDGAR", "title/keyword: PDF to EDGAR"
# Generic generated/created by comments (but NOT bare dates)
m = re.search(r"<!--\s*Generated\s+by\s+([^-]+?)-->", text, re.I)
if m:
val = m.group(1).strip()
if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
return _normalize_generator(val), f"comment: Generated by {val}"
m = re.search(r"<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->", text, re.I)
if m:
val = m.group(1).strip()
if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
return _normalize_generator(val), f"comment: Created by/with {val}"
# --- Keyword signatures in full text ---
# 5. Workiva
if re.search(r"\bwdesk\b", text_lower):
return "Workiva", "keyword: wdesk"
if re.search(r"\bworkiva\b", text_lower):
return "Workiva", "keyword: workiva"
# 6. Donnelley/DFIN
if re.search(r"\brrdonnelley\b", text_lower):
return "Donnelley Financial Solutions", "keyword: rrdonnelley"
if re.search(r"\bedgar-online\b", text_lower):
return "Donnelley Financial Solutions", "keyword: edgar-online"
# 7. Toppan Merrill
if re.search(r"\btoppan\b", text_lower):
return "Toppan Merrill", "keyword: toppan"
if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower):
return "Toppan Merrill", "keyword: merrill + bridge/xbrl"
if re.search(r"\bbowne\b", text_lower):
return "Toppan Merrill", "keyword: bowne"
# 8. CompSci Transform
if re.search(r"\bcompsci\b", text_lower):
return "CompSci Transform", "keyword: compsci"
# 9. ThunderDome
if re.search(r"\bthunderdome\b", text_lower):
return "ThunderDome", "keyword: thunderdome"
# 10. GoXBRL
if re.search(r"\bgoxbrl\b", text_lower):
return "GoXBRL", "keyword: goxbrl"
# 16. CSS class naming patterns
if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower):
return "Workiva", "CSS class prefix: wk_"
# --- SGML document wrapper detection ---
has_sgml = re.search(r"<DOCUMENT>\s*\n?\s*<TYPE>", text, re.I)
if has_sgml:
m_fn = re.search(r"<FILENAME>\s*([\w\-\.]+)", text, re.I)
if m_fn:
filename = m_fn.group(1).lower()
# d + digits = Donnelley Financial Solutions
if re.match(r"d\d+", filename):
return "Donnelley Financial Solutions", f"SGML filename: {m_fn.group(1)}"
# tm + digits = Toppan Merrill
if re.match(r"tm\d+", filename):
return "Toppan Merrill", f"SGML filename: {m_fn.group(1)}"
# ea + digits = EFiling/EDGAR Agent
if re.match(r"ea\d+", filename):
return "EFiling/EDGAR Agent", f"SGML filename: {m_fn.group(1)}"
# SGML-wrapped but no known filename pattern — check for other signals inside
# Rule-Page comments = Broadridge/EFiling variant
if "<!-- field: rule-page" in text_lower or "rule-page" in text_lower[:5000]:
return "Broadridge PROfile", "SGML + Rule-Page field comments"
# Field: Set comments with xdx = EFiling XDX tool
if "field: set; name: xdx" in text_lower:
return "EFiling XDX", "SGML + xdx Field:Set comments"
# <!-- Field: Set --> or <!-- Field: Rule --> without xdx
if "<!-- field:" in text_lower[:5000]:
return "EFiling/EDGAR Agent", "SGML + Field comments"
# Donnelley structural pattern: Center/DIV 8.5in
if re.search(r'<Center><DIV STYLE="width:8\.5in"', text):
return "Donnelley Financial Solutions", "SGML + Center/DIV 8.5in layout"
# Check accession prefix for known filing agents
basename = os.path.basename(filepath)
accession_prefix = basename.split("-")[0]
if accession_prefix in FILING_AGENT_CIKS:
return FILING_AGENT_CIKS[accession_prefix], f"SGML + filing agent CIK {accession_prefix}"
# Remaining SGML-wrapped: classify by structural patterns
font_count = text_lower.count("<font")
if font_count > 5:
return "SGML-wrapped (legacy/font-based)", f"SGML + {font_count} <font> tags"
return "SGML-wrapped (unknown)", "SGML wrapper, no specific generator"
# --- Inline XBRL detection for non-SGML files ---
has_ix_ns = "xmlns:ix=" in text_lower or "<ix:header" in text_lower
# 12. Structural: Donnelley uppercase P STYLE + Center DIV 8.5in
if re.search(r'<P STYLE="[^"]*font-family:Times New Roman"', text) and re.search(
r'<Center><DIV STYLE="width:8\.5in"', text
):
return "Donnelley Financial Solutions", "structural: uppercase P STYLE + Center DIV 8.5in"
# 14. Title tag tool names
if title_text:
title_lower = title_text.lower()
if "workiva" in title_lower or "wdesk" in title_lower:
return "Workiva", f"title: {title_text}"
if has_ix_ns:
# 11. ix:header with tool info / Field comments
if "field: set; name: xdx" in text_lower:
return "EFiling XDX", "iXBRL + xdx Field:Set comments"
if "<!-- field: rule" in text_lower:
return "Broadridge PROfile", "iXBRL + Rule-Page field comments"
if "<!-- field:" in text_lower[:5000]:
return "EFiling/EDGAR Agent", "iXBRL + Field comments"
# Filing agent CIK-based detection
basename = os.path.basename(filepath)
accession_prefix = basename.split("-")[0]
if accession_prefix in FILING_AGENT_CIKS:
agent = FILING_AGENT_CIKS[accession_prefix]
return f"{agent}", f"iXBRL + filing agent CIK {accession_prefix}"
# 13. XML declaration encoding as structural signal
if '<?xml version="1.0" encoding="utf-8"' in text_lower[:200]:
return "Inline XBRL (utf-8 toolchain)", "iXBRL + utf-8 XML declaration"
if "<?xml version='1.0' encoding='ascii'?>" in text_lower[:200]:
if re.search(r'<div style="display:none"><ix:header>', text_lower[:3000]):
return "Inline XBRL (SEC/EDGAR standard)", "iXBRL + ASCII XML + hidden ix:header"
return "Inline XBRL (SEC/EDGAR standard)", "iXBRL + ASCII XML declaration"
# Generic inline XBRL with no other signal
return "Inline XBRL (tool unresolved)", "iXBRL namespace only"
# --- Structural fallbacks for non-XBRL files ---
font_count = text_lower.count("<font")
td_count = text_lower.count("<td")
span_count = text_lower.count("<span")
if font_count > 20:
return "Legacy generator (font-based)", f"structural: {font_count} <font> tags"
if td_count > 50 and span_count < 10:
return "Table-based generator", f"structural: {td_count} <td> tags"
data_attr_count = len(re.findall(r"\bdata-\w+", text_lower))
if data_attr_count > 10:
return "Modern web tooling", f"structural: {data_attr_count} data- attributes"
return "Unknown", "no signature detected"
def _normalize_generator(raw: str) -> str:
"""Normalize generator names to canonical forms."""
r = raw.strip().lower()
if "workiva" in r or "wdesk" in r:
return "Workiva"
if "donnelley" in r or "dfin" in r or "rrdonnelley" in r:
return "Donnelley Financial Solutions"
if ("toppan" in r) or ("merrill" in r and "bridge" in r):
return "Toppan Merrill"
if "word" in r and "microsoft" in r:
return "Microsoft Word"
if "excel" in r and "microsoft" in r:
return "Microsoft Excel"
if "thunderdome" in r:
return "ThunderDome"
if "goxbrl" in r:
return "GoXBRL"
if "compsci" in r:
return "CompSci Transform"
if "certent" in r:
return "Certent"
if "iris carbon" in r:
return "IRIS Carbon"
if "broadridge" in r or "profile" in r:
return "Broadridge PROfile"
if "sec publisher" in r:
return "SEC Publisher"
return raw.strip()
def extract_body_snippet(filepath: str) -> str:
"""Extract first 200 bytes after <body> tag."""
with open(filepath, "rb") as f:
raw = f.read(READ_BYTES)
text = raw.decode("utf-8", errors="replace")
m = re.search(r"<body[^>]*>(.*)", text, re.I | re.S)
if m:
body = m.group(1)[:200].strip()
return re.sub(r"\s+", " ", body)
return re.sub(r"\s+", " ", text[:200])
def extract_class_names(filepath: str, max_elements: int = 10) -> list[str]:
"""Extract CSS class names from first N elements."""
with open(filepath, "rb") as f:
raw = f.read(READ_BYTES)
text = raw.decode("utf-8", errors="replace")
classes = re.findall(r'class\s*=\s*["\']([^"\']+)["\']', text, re.I)
return classes[:max_elements]
def main():
files = sorted(HTML_DIR.glob("*.html"))
total = len(files)
print(f"Processing {total} HTML files...\n")
results: dict[str, tuple[str, str]] = {}
generator_examples: dict[str, list[str]] = defaultdict(list)
generator_methods: dict[str, set[str]] = defaultdict(set)
for i, fp in enumerate(files):
accession = fp.stem
gen, evidence = detect_generator(str(fp))
results[accession] = (gen, evidence)
generator_examples[gen].append(accession)
method = evidence.split(":")[0].strip()
generator_methods[gen].add(method)
if (i + 1) % 2000 == 0:
print(f" Processed {i + 1}/{total}...", file=sys.stderr)
# --- Phase 1 output ---
print("=" * 110)
print("PHASE 1: Generator Detection Results")
print("=" * 110)
gen_counts = Counter(gen for gen, _ in results.values())
for gen, count in gen_counts.most_common():
pct = count / total * 100
examples = generator_examples[gen][:3]
methods = ", ".join(sorted(generator_methods[gen]))
print(f"\n {gen}")
print(f" Count: {count:,} ({pct:.1f}%)")
print(f" Methods: {methods}")
print(f" Examples: {', '.join(examples)}")
# --- Phase 2: Cluster unknowns ---
unknowns = [acc for acc, (gen, _) in results.items() if gen == "Unknown"]
print(f"\n\n{'=' * 110}")
print(f"PHASE 2: Clustering {len(unknowns)} Unknown Files")
print("=" * 110)
if unknowns:
fingerprints: dict[str, list[str]] = defaultdict(list)
for acc in unknowns:
fp = HTML_DIR / f"{acc}.html"
with open(fp, "rb") as f:
raw_bytes = f.read(READ_BYTES)
text = raw_bytes.decode("utf-8", errors="replace")
text_lower = text.lower()
has_xml_decl = text.startswith("<?xml")
has_doctype = "<!doctype" in text_lower[:500]
first_tag_m = re.search(r"<(\w+)", text)
first_tag = first_tag_m.group(1).lower() if first_tag_m else ""
td_c = text_lower.count("<td")
span_c = text_lower.count("<span")
div_c = text_lower.count("<div")
p_c = text_lower.count("<p ")
font_c = text_lower.count("<font")
counts = {"td": td_c, "span": span_c, "div": div_c, "p": p_c, "font": font_c}
dominant = max(counts, key=counts.get) if max(counts.values()) > 0 else "empty"
classes = re.findall(r'class\s*=\s*["\']([^"\']+)["\']', text[:5000], re.I)
class_prefix = ""
if classes:
fc = classes[0].split()[0]
if "_" in fc:
class_prefix = fc.split("_")[0] + "_"
elif "-" in fc:
class_prefix = fc.split("-")[0] + "-"
else:
class_prefix = fc[:4]
fingerprint = (
f"xml={has_xml_decl}|doctype={has_doctype}|first={first_tag}"
f"|layout={dominant}|cls={class_prefix}"
)
fingerprints[fingerprint].append(acc)
for idx, (fp_key, accs) in enumerate(
sorted(fingerprints.items(), key=lambda x: -len(x[1]))
):
print(f"\n Cluster {idx + 1} ({len(accs)} files): {fp_key}")
for acc in accs[:5]:
filepath = HTML_DIR / f"{acc}.html"
snippet = extract_body_snippet(str(filepath))
cls = extract_class_names(str(filepath), 5)
print(f" {acc}:")
print(f" Snippet: {snippet[:120]}")
if cls:
print(f" Classes: {cls[:5]}")
if len(accs) > 5:
print(f" ... and {len(accs) - 5} more files")
else:
print(" No truly unknown files remain!")
# --- Phase 3: Summary ---
print(f"\n\n{'=' * 110}")
print("PHASE 3: Summary Statistics")
print("=" * 110)
# Compute consolidated generator groups for the summary
# Group small variants under their parent
GROUP_MAP = {
"Inline XBRL (utf-8 toolchain)": "Inline XBRL (tool unresolved)",
"Inline XBRL (tool unresolved)": "Inline XBRL (tool unresolved)",
}
header = (
f"\n{'Generator':<45} {'Count':>7} {'%':>7} "
f"{'Detection Methods':<50} {'Examples (up to 3)'}"
)
print(header)
print("-" * 170)
for gen, count in gen_counts.most_common():
pct = count / total * 100
examples = ", ".join(generator_examples[gen][:3])
methods = ", ".join(sorted(generator_methods[gen]))
if len(methods) > 50:
methods = methods[:47] + "..."
print(f"{gen:<45} {count:>7} {pct:>6.1f}% {methods:<50} {examples}")
print("-" * 170)
print(f"{'TOTAL':<45} {total:>7} {100.0:>6.1f}%")
unknown_count = gen_counts.get("Unknown", 0)
identified = total - unknown_count
print(f"\nIdentified: {identified:,} / {total:,} ({identified / total * 100:.1f}%)")
print(f"Truly unidentified: {unknown_count:,} / {total:,} ({unknown_count / total * 100:.1f}%)")
# Consolidated view: group by parent tool family
print(f"\n\n{'=' * 110}")
print("CONSOLIDATED VIEW (grouped by tool family)")
print("=" * 110)
family_map = {
"Workiva": "Workiva",
"Donnelley Financial Solutions": "Donnelley Financial Solutions",
"Toppan Merrill": "Toppan Merrill",
"CompSci Transform": "CompSci Transform",
"ThunderDome": "ThunderDome",
"EFiling/EDGAR Agent": "EFiling/EDGAR Agent",
"EFiling XDX": "EFiling/EDGAR Agent",
"Broadridge PROfile": "Broadridge PROfile",
"SEC Publisher": "SEC Publisher",
"IRIS Carbon": "IRIS Carbon",
"RDG Portal": "RDG Portal",
"Certent": "Certent",
"PDF to EDGAR": "PDF to EDGAR",
"GoXBRL": "GoXBRL",
"Microsoft Word": "Microsoft Word",
"Microsoft Excel": "Microsoft Excel",
"Inline XBRL (SEC/EDGAR standard)": "Inline XBRL (unattributed)",
"Inline XBRL (utf-8 toolchain)": "Inline XBRL (unattributed)",
"Inline XBRL (tool unresolved)": "Inline XBRL (unattributed)",
"SGML-wrapped (legacy/font-based)": "SGML-wrapped (unattributed)",
"SGML-wrapped (unknown)": "SGML-wrapped (unattributed)",
"Legacy generator (font-based)": "Other/Legacy",
"Table-based generator": "Other/Legacy",
"Modern web tooling": "Other/Legacy",
"Unknown": "Unknown",
}
family_counts: Counter = Counter()
family_examples: dict[str, list[str]] = defaultdict(list)
for gen, count in gen_counts.items():
family = family_map.get(gen, gen)
family_counts[family] += count
family_examples[family].extend(generator_examples[gen][:3])
print(f"\n{'Tool Family':<45} {'Count':>7} {'%':>7}")
print("-" * 65)
for family, count in family_counts.most_common():
pct = count / total * 100
examples = ", ".join(family_examples[family][:3])
print(f"{family:<45} {count:>7} {pct:>6.1f}% {examples}")
print("-" * 65)
print(f"{'TOTAL':<45} {total:>7} {100.0:>6.1f}%")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,511 @@
"""
Heading candidate detection in SEC-cyBERT paragraph data.
Searches for inlined section headings that previous passes missed.
READ-ONLY: does not modify data, prints analysis to stdout.
"""
import json
import re
from collections import Counter, defaultdict
from pathlib import Path
DATA_PATH = Path(__file__).resolve().parent.parent / "data" / "paragraphs" / "paragraphs-clean.jsonl"
# ── Load data ──────────────────────────────────────────────────────────────────
print(f"Loading data from {DATA_PATH} ...")
paragraphs = []
with open(DATA_PATH) as f:
for line in f:
paragraphs.append(json.loads(line))
print(f"Loaded {len(paragraphs):,} paragraphs.\n")
# ── Helpers ────────────────────────────────────────────────────────────────────
def preview(text: str, n: int = 150) -> str:
"""First n chars, single-line."""
return text[:n].replace("\n", " ").strip()
COMMON_SENTENCE_STARTERS = {
"we", "our", "the", "a", "an", "as", "in", "on", "to", "for", "if",
"this", "these", "that", "those", "it", "its", "such", "no", "not",
"with", "from", "at", "by", "or", "and", "all", "any", "each",
"while", "when", "where", "although", "because", "since", "after",
"before", "during", "under", "over", "between", "through", "into",
"upon", "about", "there", "here", "however", "additionally",
"furthermore", "moreover", "also", "finally", "similarly",
"accordingly", "consequently", "therefore", "thus", "nonetheless",
"notwithstanding", "specifically", "generally", "currently",
"recently", "historically", "collectively", "certain",
}
HEADING_KEYWORDS = {
"oversight", "framework", "assessment", "compliance", "integration",
"governance", "strategy", "management", "disclosure", "reporting",
"response", "recovery", "prevention", "detection", "monitoring",
"awareness", "training", "policy", "policies", "procedures",
"controls", "cybersecurity", "information", "security", "risk",
"board", "committee", "audit", "technology", "infrastructure",
"incident", "incidents", "threat", "threats", "vulnerability",
"program", "processes", "overview", "background", "introduction",
"summary", "conclusion", "material", "materiality",
}
HEADING_GERUNDS = {
"protecting", "monitoring", "assessing", "managing", "overseeing",
"implementing", "establishing", "maintaining", "identifying",
"evaluating", "mitigating", "addressing", "enhancing", "ensuring",
"integrating", "reporting", "disclosing", "detecting", "preventing",
"responding", "recovering", "training", "educating", "reviewing",
"governing", "supervising", "coordinating", "leveraging",
"strengthening", "safeguarding", "securing",
}
SEPARATOR_LINE = "=" * 100
def print_section(title: str):
print(f"\n{SEPARATOR_LINE}")
print(f" {title}")
print(SEPARATOR_LINE)
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 1: First-sentence grammatical analysis
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 1: First-clause looks like a heading (title case prefix → sentence body)")
# Pattern: first N words are in title case, then a transition to normal
# sentence text. E.g. "Risk Management and Strategy We have..."
approach1_hits = []
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 6:
continue
# Find the transition point: where title-case words stop
title_words = 0
for w in words:
# Strip punctuation for checking
clean = re.sub(r"[^a-zA-Z]", "", w)
if not clean:
title_words += 1
continue
# "and", "of", "the", "for", "in", "on" can be lowercase in titles
if clean.lower() in {"and", "of", "the", "for", "in", "on", "a", "an", "or", "to", "by", "with"}:
title_words += 1
continue
if clean[0].isupper():
title_words += 1
else:
break
# We want 3+ title-case words at the start, then a transition
if title_words >= 3 and title_words < len(words) - 2:
# Check that the word after the title block starts lowercase (sentence body)
rest_start = words[title_words] if title_words < len(words) else ""
rest_clean = re.sub(r"[^a-zA-Z]", "", rest_start)
if rest_clean and rest_clean[0].islower():
heading_part = " ".join(words[:title_words])
# Skip if heading part is just common sentence starters
if heading_part.lower().split()[0] not in COMMON_SENTENCE_STARTERS:
approach1_hits.append({
"id": p["id"],
"heading_words": title_words,
"heading": heading_part,
"preview": preview(text),
})
# Count heading patterns
heading_counter = Counter(h["heading"] for h in approach1_hits)
print(f"\nFound {len(approach1_hits):,} paragraphs with title-case prefix → lowercase body.")
print(f"Unique heading prefixes: {len(heading_counter):,}")
print(f"\nTOP 30 most common heading prefixes:")
for heading, count in heading_counter.most_common(30):
# Find an example
ex = next(h for h in approach1_hits if h["heading"] == heading)
print(f" [{count:4d}x] \"{heading}\"")
print(f" Example: {ex['preview']}")
print(f"\nSample of UNIQUE (1x) heading prefixes (first 30):")
unique_headings = [(h, ex) for h, ex in ((h, next(x for x in approach1_hits if x["heading"] == h)) for h in heading_counter if heading_counter[h] == 1)]
for heading, ex in unique_headings[:30]:
print(f" \"{heading}\"")
print(f"{ex['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 2: Capitalization anomalies
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 2: Capitalization anomalies")
# 2a: ALL CAPS at start
allcaps_hits = []
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 4:
continue
# Check first 3+ words are ALL CAPS
caps_count = 0
for w in words:
clean = re.sub(r"[^a-zA-Z]", "", w)
if not clean:
caps_count += 1
continue
if clean.isupper() and len(clean) > 1:
caps_count += 1
else:
break
if caps_count >= 3:
allcaps_hits.append({
"id": p["id"],
"caps_words": caps_count,
"preview": preview(text),
})
print(f"\n2a. ALL CAPS for first 3+ words: {len(allcaps_hits):,} paragraphs")
for h in allcaps_hits[:20]:
print(f" [{h['caps_words']} caps words] {h['preview']}")
# 2b: First word is capitalized but NOT a common sentence starter
# and looks like a heading keyword
heading_start_hits = []
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 4:
continue
first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
if first_word in HEADING_KEYWORDS and first_word not in COMMON_SENTENCE_STARTERS:
heading_start_hits.append({
"id": p["id"],
"first_word": first_word,
"preview": preview(text),
})
heading_start_counter = Counter(h["first_word"] for h in heading_start_hits)
print(f"\n2b. First word is a heading keyword (not a sentence starter): {len(heading_start_hits):,} paragraphs")
print("Breakdown by keyword:")
for kw, count in heading_start_counter.most_common(30):
ex = next(h for h in heading_start_hits if h["first_word"] == kw)
print(f" [{count:4d}x] \"{kw}\"{ex['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 3: Separator patterns
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 3: Separator patterns (heading followed by separator then sentence)")
separator_patterns = {
"period": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\.\s+([A-Z][a-z])"),
"dash/em-dash": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\s*[–—-]\s*([A-Z][a-z])"),
"semicolon": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60});\s*([A-Z][a-z])"),
"double space": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\s{2,}([A-Z][a-z])"),
"colon": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60}):\s*([A-Z][a-z])"),
"parenthetical prefix": re.compile(r"^\([a-z0-9ivx]+\)\s*([A-Z][A-Za-z\s,&]{3,60})\s+([a-z])"),
"bullet/pipe prefix": re.compile(r"^[•●■▪◦‣|]\s*([A-Z][A-Za-z\s,&]{3,60})\s+([a-z])"),
"tab separator": re.compile(r"^([A-Z][A-Za-z\s,&]{3,60})\t+(.+)"),
}
for sep_name, pattern in separator_patterns.items():
hits = []
for p in paragraphs:
text = p["text"].strip()
m = pattern.match(text)
if m:
heading_candidate = m.group(1).strip() if m.lastindex >= 1 else ""
# Filter: heading should have at least 2 words
if len(heading_candidate.split()) >= 2:
hits.append({
"id": p["id"],
"heading": heading_candidate,
"preview": preview(text),
})
heading_counts = Counter(h["heading"] for h in hits)
print(f"\n Separator: {sep_name}{len(hits):,} hits")
if hits:
for heading, count in heading_counts.most_common(20):
ex = next(h for h in hits if h["heading"] == heading)
print(f" [{count:4d}x] \"{heading}\"")
print(f" {ex['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 4: Repeated first-3-words analysis
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 4: Repeated first-3-word phrases")
first3_counter = Counter()
first3_examples = {}
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 4:
continue
first3 = " ".join(words[:3])
first3_counter[first3] += 1
if first3 not in first3_examples:
first3_examples[first3] = preview(text)
# Filter to phrases appearing 3+ times that look heading-like
# (not common sentence starters)
common_starts = {
"we have implemented", "we have established", "we have adopted",
"we have not", "we do not", "we are not", "we believe that",
"we use a", "we rely on", "we have a", "we also have",
"our board of", "the board of", "the company has",
"the audit committee", "in addition to", "as part of",
"as a result", "in the event", "as of the",
"in accordance with", "with respect to",
}
print(f"\nFirst-3-word phrases appearing 5+ times (excluding common sentence starts):")
for phrase, count in first3_counter.most_common(200):
if count < 5:
break
if phrase.lower() in common_starts:
continue
# Check if it looks heading-like: title case or contains heading keywords
words_lower = phrase.lower().split()
is_heading_like = (
all(w[0].isupper() or w in {"and", "of", "the", "for", "in", "on", "a", "or", "to"}
for w in phrase.split() if re.sub(r"[^a-zA-Z]", "", w))
and words_lower[0] not in COMMON_SENTENCE_STARTERS
)
label = " [HEADING-LIKE]" if is_heading_like else ""
print(f" [{count:4d}x] \"{phrase}\"{label}")
print(f" {first3_examples[phrase]}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 5: Cross-paragraph heading detection (short para → sentence para)
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 5: Cross-paragraph heading detection (standalone short headings)")
# Group paragraphs by accession number, sorted by index
by_filing = defaultdict(list)
for p in paragraphs:
acc = p["filing"]["accessionNumber"]
by_filing[acc].append(p)
for acc in by_filing:
by_filing[acc].sort(key=lambda x: x["paragraphIndex"])
standalone_headings = []
for acc, pars in by_filing.items():
for i in range(len(pars) - 1):
curr = pars[i]
nxt = pars[i + 1]
curr_text = curr["text"].strip()
curr_words = curr_text.split()
nxt_text = nxt["text"].strip()
# Current paragraph is short (< 10 words)
if len(curr_words) > 10 or len(curr_words) < 2:
continue
# Current paragraph looks like a heading:
# - Title case or all caps
# - No period at end (headings rarely end with period)
# - Not a single common word
if curr_text.endswith(".") and not curr_text.endswith("etc."):
continue
# Check title-case-ish
alpha_words = [w for w in curr_words if re.sub(r"[^a-zA-Z]", "", w)]
if not alpha_words:
continue
title_case_ratio = sum(
1 for w in alpha_words
if re.sub(r"[^a-zA-Z]", "", w)[0].isupper()
or re.sub(r"[^a-zA-Z]", "", w).lower() in {"and", "of", "the", "for", "in", "on", "a", "or", "to", "by", "with"}
) / len(alpha_words)
if title_case_ratio < 0.8:
continue
# Next paragraph should start with a sentence (lowercase second word or common starter)
nxt_words = nxt_text.split()
if len(nxt_words) < 3:
continue
standalone_headings.append({
"id": curr["id"],
"heading_text": curr_text,
"next_preview": preview(nxt_text),
"accession": acc,
"company": curr["filing"]["companyName"],
})
heading_text_counter = Counter(h["heading_text"] for h in standalone_headings)
print(f"\nFound {len(standalone_headings):,} potential standalone heading paragraphs.")
print(f"Unique heading texts: {len(heading_text_counter):,}")
print(f"\nTOP 30 most common standalone headings:")
for heading, count in heading_text_counter.most_common(30):
ex = next(h for h in standalone_headings if h["heading_text"] == heading)
print(f" [{count:4d}x] \"{heading}\"")
print(f" Next para: {ex['next_preview']}")
print(f"\nSample of UNIQUE standalone headings (first 30):")
unique_standalone = [h for h in standalone_headings if heading_text_counter[h["heading_text"]] == 1]
for h in unique_standalone[:30]:
print(f" \"{h['heading_text']}\" ({h['company']})")
print(f" Next: {h['next_preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 6: Unusual word patterns at paragraph start
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 6: Unusual starting words (gerunds, heading nouns)")
# 6a: Gerunds at start
gerund_hits = []
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 4:
continue
first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
if first_word.endswith("ing") and len(first_word) > 4:
if first_word in HEADING_GERUNDS or first_word not in COMMON_SENTENCE_STARTERS:
gerund_hits.append({
"id": p["id"],
"first_word": first_word,
"preview": preview(text),
})
gerund_counter = Counter(h["first_word"] for h in gerund_hits)
print(f"\n6a. Paragraphs starting with gerunds: {len(gerund_hits):,}")
print("TOP 20 gerunds:")
for word, count in gerund_counter.most_common(20):
ex = next(h for h in gerund_hits if h["first_word"] == word)
print(f" [{count:4d}x] \"{word}\"{ex['preview']}")
# 6b: Heading nouns at start (already covered in 2b, but let's look at
# multi-word patterns starting with heading nouns)
noun_phrase_hits = []
for p in paragraphs:
text = p["text"].strip()
words = text.split()
if len(words) < 4:
continue
first_word = re.sub(r"[^a-zA-Z]", "", words[0]).lower()
if first_word in HEADING_KEYWORDS:
# Check if the first 2-3 words form a heading-like phrase
first_few = " ".join(words[:min(4, len(words))])
noun_phrase_hits.append({
"id": p["id"],
"first_few": first_few,
"preview": preview(text),
})
noun_counter = Counter(h["first_few"] for h in noun_phrase_hits)
print(f"\n6b. Paragraphs starting with heading keyword nouns: {len(noun_phrase_hits):,}")
print("TOP 20 opening phrases:")
for phrase, count in noun_counter.most_common(20):
ex = next(h for h in noun_phrase_hits if h["first_few"] == phrase)
print(f" [{count:4d}x] \"{phrase}\"{ex['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 7: Numbers/letters at start (list items / numbered headings)
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 7: Numbered/lettered items at paragraph start")
numbered_patterns = {
"roman_paren": re.compile(r"^\((?:i{1,3}|iv|v|vi{0,3}|ix|x)\)\s"),
"letter_paren": re.compile(r"^\([a-z]\)\s"),
"number_paren": re.compile(r"^\(\d+\)\s"),
"number_dot": re.compile(r"^\d+\.\s"),
"letter_dot": re.compile(r"^[a-z]\.\s"),
"roman_dot": re.compile(r"^(?:i{1,3}|iv|v|vi{0,3}|ix|x)\.\s"),
"bullet_chars": re.compile(r"^[•●■▪◦‣►▸→·]\s"),
"dash_bullet": re.compile(r"^[-–—]\s+[A-Z]"),
}
for pattern_name, pattern in numbered_patterns.items():
hits = []
for p in paragraphs:
text = p["text"].strip()
if pattern.match(text):
hits.append({
"id": p["id"],
"preview": preview(text),
"wordCount": p["wordCount"],
})
print(f"\n Pattern: {pattern_name}{len(hits):,} hits")
if hits:
# Show word count distribution
short = sum(1 for h in hits if h["wordCount"] < 15)
medium = sum(1 for h in hits if 15 <= h["wordCount"] < 50)
long = sum(1 for h in hits if h["wordCount"] >= 50)
print(f" Length distribution: <15 words: {short}, 15-49: {medium}, 50+: {long}")
print(f" Examples (first 10):")
for h in hits[:10]:
print(f" [{h['wordCount']:3d}w] {h['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# APPROACH 8 (BONUS): Colon-separated inline headings deep dive
# ══════════════════════════════════════════════════════════════════════════════
print_section("APPROACH 8 (BONUS): Known heading phrases appearing ANYWHERE in first sentence")
# Check for known SEC 1C heading phrases appearing at the start of a paragraph
# even if not perfectly title-cased
known_heading_phrases = [
"risk management", "risk assessment", "risk factors",
"governance", "board oversight", "board of directors",
"incident response", "third party", "third-party",
"cybersecurity program", "cybersecurity risk", "cybersecurity governance",
"information security", "data protection", "data privacy",
"security operations", "security awareness",
"management oversight", "committee oversight",
"risk management and strategy", "risk management, strategy",
"material cybersecurity", "materiality assessment",
"disclosure controls",
]
phrase_hits = defaultdict(list)
for p in paragraphs:
text = p["text"].strip()
# Only look at the first ~80 chars
first_part = text[:80].lower()
for phrase in known_heading_phrases:
if first_part.startswith(phrase):
phrase_hits[phrase].append({
"id": p["id"],
"preview": preview(text),
})
print(f"\nParagraphs starting with known heading phrases:")
for phrase in sorted(phrase_hits.keys(), key=lambda x: -len(phrase_hits[x])):
hits = phrase_hits[phrase]
print(f"\n \"{phrase}\"{len(hits)} hits")
for h in hits[:5]:
print(f" {h['preview']}")
# ══════════════════════════════════════════════════════════════════════════════
# SUMMARY
# ══════════════════════════════════════════════════════════════════════════════
print_section("SUMMARY")
print(f"""
Approach 1 (title-case prefix body): {len(approach1_hits):,} hits
Approach 2a (ALL CAPS start): {len(allcaps_hits):,} hits
Approach 2b (heading keyword start): {len(heading_start_hits):,} hits
Approach 3 (separator patterns): see above per-separator
Approach 5 (standalone short headings): {len(standalone_headings):,} hits
Approach 6a (gerund starts): {len(gerund_hits):,} hits
Approach 6b (heading noun starts): {len(noun_phrase_hits):,} hits
Approach 7 (numbered/lettered): see above per-pattern
Approach 8 (known phrase starts): {sum(len(v) for v in phrase_hits.values()):,} hits
""")
print("Done.")

View File

@ -0,0 +1,471 @@
"""
Investigate whether certain SEC filing generators produce systematically worse
text extraction in the SEC-cyBERT corpus. READ-ONLY analysis.
"""
import json
import os
import random
import re
from collections import Counter, defaultdict
from pathlib import Path
random.seed(42)
HTML_DIR = Path("data/raw/html")
PARAGRAPHS_FILE = Path("data/paragraphs/paragraphs-clean.jsonl")
# ─────────────────────────────────────────────────────────────────────────────
# Helpers
# ─────────────────────────────────────────────────────────────────────────────
def extract_generator(header_bytes: bytes) -> str:
"""Extract generator from first ~5KB of an HTML file."""
text = header_bytes.decode("utf-8", errors="replace")
# 1. <meta name="generator" content="...">
m = re.search(
r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']',
text, re.IGNORECASE
)
if m:
return m.group(1).strip()
# Also try content before name order
m = re.search(
r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']',
text, re.IGNORECASE
)
if m:
return m.group(1).strip()
# 2. <!-- Generated by ... -->
m = re.search(r'<!--\s*Generated\s+by\s+([^->]+)', text, re.IGNORECASE)
if m:
return m.group(1).strip()
# 3. Distinctive patterns
if "Workiva" in text or "wkiva" in text.lower():
return "Workiva (pattern)"
if "ix:header" in text.lower() or "ix:hidden" in text.lower():
# iXBRL inline — common but not a specific generator
pass
if "toppanmerrill" in text.lower() or "toppan" in text.lower():
return "Toppan Merrill (pattern)"
if "donnelley" in text.lower() or "EDGAR Online" in text.lower():
return "Donnelley/EDGAR Online (pattern)"
if "GoXBRL" in text:
return "GoXBRL (pattern)"
return "UNKNOWN"
def normalize_generator(raw: str) -> str:
"""Normalize generator strings to canonical names."""
low = raw.lower()
if "workiva" in low or "wdesk" in low or "wkiva" in low:
return "Workiva"
if "toppan" in low or "merrill" in low:
return "Toppan Merrill"
if "donnelley" in low or "edgar online" in low:
return "Donnelley"
if "goxbrl" in low:
return "GoXBRL"
if "word" in low or "microsoft" in low:
return "Microsoft Word"
if "webfilings" in low:
return "WebFilings"
if "novaworks" in low:
return "Novaworks"
if "ez-xbrl" in low or "ezxbrl" in low:
return "EZ-XBRL"
if "ixbrl" in low or "inline xbrl" in low:
return "iXBRL Generator"
if "vintage" in low:
return "Vintage (Donnelley)"
if "edgar" in low:
return "EDGAR"
if raw == "UNKNOWN":
return "UNKNOWN"
return raw # keep as-is if no match
def read_generator_for_file(filepath: Path) -> str:
"""Read the first 5KB and extract the generator."""
try:
with open(filepath, "rb") as f:
header = f.read(5000)
return normalize_generator(extract_generator(header))
except Exception:
return "ERROR"
# ─────────────────────────────────────────────────────────────────────────────
# Step 0: Load paragraphs
# ─────────────────────────────────────────────────────────────────────────────
print("Loading paragraphs...")
paragraphs = []
filing_paragraphs = defaultdict(list) # accession -> [paragraph dicts]
with open(PARAGRAPHS_FILE) as f:
for line in f:
p = json.loads(line)
paragraphs.append(p)
acc = p["filing"]["accessionNumber"]
filing_paragraphs[acc].append(p)
print(f" Loaded {len(paragraphs):,} paragraphs from {len(filing_paragraphs):,} filings\n")
# ─────────────────────────────────────────────────────────────────────────────
# Step 1: Identify filing generators (500 random HTML files)
# ─────────────────────────────────────────────────────────────────────────────
print("=" * 80)
print("STEP 1: IDENTIFY FILING GENERATORS (500-file sample)")
print("=" * 80)
all_html_files = sorted(HTML_DIR.glob("*.html"))
sample_files = random.sample(all_html_files, min(500, len(all_html_files)))
sample_generators = {} # filename_stem -> generator
raw_generator_strings = []
for f in sample_files:
try:
with open(f, "rb") as fh:
header = fh.read(5000)
raw = extract_generator(header)
raw_generator_strings.append(raw)
gen = normalize_generator(raw)
sample_generators[f.stem] = gen
except Exception:
sample_generators[f.stem] = "ERROR"
gen_counts = Counter(sample_generators.values())
print(f"\nGenerator distribution (500-file sample):\n")
print(f" {'Generator':<30} {'Count':>6} {'%':>7}")
print(f" {'-'*30} {'-'*6} {'-'*7}")
for gen, count in gen_counts.most_common():
print(f" {gen:<30} {count:>6} {count/5:.1f}%")
print(f"\nRaw generator strings (unique):")
raw_counts = Counter(raw_generator_strings)
for raw, count in raw_counts.most_common(20):
print(f" [{count:>4}] {raw[:80]}")
# ─────────────────────────────────────────────────────────────────────────────
# Step 2: Generator-specific quality metrics
# ─────────────────────────────────────────────────────────────────────────────
print("\n" + "=" * 80)
print("STEP 2: GENERATOR-SPECIFIC QUALITY METRICS")
print("=" * 80)
# Major generators: those with >20 filings in sample
major_gens = {g for g, c in gen_counts.items() if c > 20}
print(f"\nMajor generators (>20 in sample): {sorted(major_gens)}\n")
# For each sampled filing that has paragraphs, compute metrics
gen_metrics = defaultdict(lambda: {
"filing_count": 0,
"para_counts": [],
"word_counts": [],
"lowercase_starts": 0,
"total_paras": 0,
"short_paras": 0, # <25 words
"html_sizes": [],
"text_sizes": [],
})
for stem, gen in sample_generators.items():
if gen not in major_gens:
continue
acc = stem # filename stem is the accession number
paras = filing_paragraphs.get(acc, [])
m = gen_metrics[gen]
m["filing_count"] += 1
m["para_counts"].append(len(paras))
# HTML file size
html_path = HTML_DIR / f"{stem}.html"
try:
html_size = html_path.stat().st_size
except Exception:
html_size = 0
m["html_sizes"].append(html_size)
total_text_len = 0
for p in paras:
wc = p.get("wordCount", len(p["text"].split()))
m["word_counts"].append(wc)
m["total_paras"] += 1
total_text_len += len(p["text"])
if p["text"] and p["text"][0].islower():
m["lowercase_starts"] += 1
if wc < 25:
m["short_paras"] += 1
m["text_sizes"].append(total_text_len)
# Print table
print(f" {'Generator':<22} {'Files':>5} {'Avg ¶':>7} {'Avg WC':>7} {'%lc':>6} {'%short':>7} {'ExtRatio':>9}")
print(f" {'-'*22} {'-'*5} {'-'*7} {'-'*7} {'-'*6} {'-'*7} {'-'*9}")
for gen in sorted(major_gens):
m = gen_metrics[gen]
n = m["filing_count"]
if n == 0:
continue
avg_paras = sum(m["para_counts"]) / n if n else 0
avg_wc = sum(m["word_counts"]) / len(m["word_counts"]) if m["word_counts"] else 0
pct_lc = (m["lowercase_starts"] / m["total_paras"] * 100) if m["total_paras"] else 0
pct_short = (m["short_paras"] / m["total_paras"] * 100) if m["total_paras"] else 0
# Extraction ratio: total text bytes / html bytes
total_html = sum(m["html_sizes"])
total_text = sum(m["text_sizes"])
ext_ratio = (total_text / total_html * 100) if total_html else 0
print(f" {gen:<22} {n:>5} {avg_paras:>7.1f} {avg_wc:>7.1f} {pct_lc:>5.1f}% {pct_short:>6.1f}% {ext_ratio:>8.2f}%")
# ─────────────────────────────────────────────────────────────────────────────
# Step 3: HTML structure analysis — representative snippets
# ─────────────────────────────────────────────────────────────────────────────
print("\n" + "=" * 80)
print("STEP 3: HTML STRUCTURE ANALYSIS (paragraph encoding by generator)")
print("=" * 80)
top5_gens = [g for g, _ in gen_counts.most_common(5)]
for gen in top5_gens:
# Find a sample file for this generator
sample_acc = None
for stem, g in sample_generators.items():
if g == gen:
sample_acc = stem
break
if not sample_acc:
continue
html_path = HTML_DIR / f"{sample_acc}.html"
try:
with open(html_path, "r", errors="replace") as fh:
content = fh.read(50000) # read enough to find a paragraph
# Find a <p> tag or similar paragraph structure
# Look for a <p tag with content
m = re.search(r'(<p\b[^>]*>[^<]{20,})', content, re.IGNORECASE)
if m:
snippet = m.group(1)[:200]
else:
# Try <div> or <span> with text
m = re.search(r'(<(?:div|span)\b[^>]*>[^<]{20,})', content, re.IGNORECASE)
if m:
snippet = m.group(1)[:200]
else:
snippet = "(no paragraph tag found in first 50KB)"
except Exception as e:
snippet = f"(error: {e})"
print(f"\n Generator: {gen}")
print(f" File: {sample_acc}.html")
print(f" Snippet: {snippet}")
print()
# ─────────────────────────────────────────────────────────────────────────────
# Step 4: Generator fingerprinting of problem paragraphs
# ─────────────────────────────────────────────────────────────────────────────
print("=" * 80)
print("STEP 4: GENERATOR FINGERPRINTING OF PROBLEM PARAGRAPHS")
print("=" * 80)
# Identify problem paragraphs
lowercase_paras = []
long_paras = [] # >300 words
short_paras = [] # <25 words
for p in paragraphs:
wc = p.get("wordCount", len(p["text"].split()))
if p["text"] and p["text"][0].islower():
lowercase_paras.append(p)
if wc > 300:
long_paras.append(p)
if wc < 25:
short_paras.append(p)
print(f"\n Problem paragraph counts:")
print(f" Lowercase starts: {len(lowercase_paras):,}")
print(f" Long (>300 words): {len(long_paras):,}")
print(f" Short (<25 words): {len(short_paras):,}")
print(f" Total paragraphs: {len(paragraphs):,}")
# For each category, sample up to 200 and look up generators
# We need a cache of accession -> generator since we may need to read many files
print("\n Building generator cache for problem filings...")
problem_accessions = set()
for p in lowercase_paras:
problem_accessions.add(p["filing"]["accessionNumber"])
for p in long_paras:
problem_accessions.add(p["filing"]["accessionNumber"])
for p in short_paras:
problem_accessions.add(p["filing"]["accessionNumber"])
# Also get generators for ALL filings to compute baseline
print(" Reading generators for ALL filings in the corpus...")
all_accessions = set(filing_paragraphs.keys())
acc_generator = {}
for acc in all_accessions:
html_path = HTML_DIR / f"{acc}.html"
if html_path.exists():
acc_generator[acc] = read_generator_for_file(html_path)
else:
acc_generator[acc] = "FILE_MISSING"
# Baseline distribution
baseline_gen_counts = Counter(acc_generator.values())
print(f"\n Full corpus generator distribution ({len(acc_generator):,} filings):\n")
print(f" {'Generator':<30} {'Count':>6} {'%':>7}")
print(f" {'-'*30} {'-'*6} {'-'*7}")
total_filings = len(acc_generator)
for gen, count in baseline_gen_counts.most_common(15):
print(f" {gen:<30} {count:>6} {count/total_filings*100:>6.1f}%")
def analyze_problem_category(name, problem_list, acc_generator, baseline_gen_counts, total_filings):
"""Analyze which generators are over-represented in a problem category."""
print(f"\n --- {name} ({len(problem_list):,} paragraphs) ---")
# Count generators for problem paragraphs (by paragraph, not by filing)
gen_para_counts = Counter()
for p in problem_list:
acc = p["filing"]["accessionNumber"]
gen = acc_generator.get(acc, "UNKNOWN")
gen_para_counts[gen] += 1
total_problem = len(problem_list)
total_all = len(paragraphs)
print(f" {'Generator':<30} {'# Problem':>9} {'% of Prob':>9} {'% of All':>9} {'Over-rep':>9}")
print(f" {'-'*30} {'-'*9} {'-'*9} {'-'*9} {'-'*9}")
# Compute total paragraphs per generator
gen_all_para_counts = Counter()
for p in paragraphs:
acc = p["filing"]["accessionNumber"]
gen = acc_generator.get(acc, "UNKNOWN")
gen_all_para_counts[gen] += 1
for gen, prob_count in gen_para_counts.most_common(10):
pct_of_problem = prob_count / total_problem * 100 if total_problem else 0
all_count = gen_all_para_counts.get(gen, 1)
pct_of_all = all_count / total_all * 100 if total_all else 0
over_rep = pct_of_problem / pct_of_all if pct_of_all else 0
print(f" {gen:<30} {prob_count:>9,} {pct_of_problem:>8.1f}% {pct_of_all:>8.1f}% {over_rep:>8.2f}x")
# Show a few example problem texts
print(f"\n Example texts:")
for p in problem_list[:3]:
text = p["text"][:120].replace("\n", " ")
acc = p["filing"]["accessionNumber"]
gen = acc_generator.get(acc, "?")
print(f" [{gen}] {text}...")
analyze_problem_category("Lowercase starts (orphan words)", lowercase_paras, acc_generator, baseline_gen_counts, total_filings)
analyze_problem_category("Long paragraphs (>300 words, potential merges)", long_paras, acc_generator, baseline_gen_counts, total_filings)
analyze_problem_category("Short paragraphs (<25 words, potential fragments)", short_paras, acc_generator, baseline_gen_counts, total_filings)
# ─────────────────────────────────────────────────────────────────────────────
# Step 5: Filing size vs extraction quality
# ─────────────────────────────────────────────────────────────────────────────
print("\n" + "=" * 80)
print("STEP 5: FILING SIZE vs EXTRACTION QUALITY")
print("=" * 80)
# Compute HTML size and paragraph count for all filings
size_para_data = []
for acc, paras_list in filing_paragraphs.items():
html_path = HTML_DIR / f"{acc}.html"
try:
html_size = html_path.stat().st_size
except Exception:
continue
size_para_data.append({
"acc": acc,
"html_size": html_size,
"para_count": len(paras_list),
"generator": acc_generator.get(acc, "UNKNOWN"),
})
# Bin by size ranges
size_bins = [
(0, 50_000, "<50KB"),
(50_000, 200_000, "50-200KB"),
(200_000, 500_000, "200-500KB"),
(500_000, 1_000_000, "500KB-1MB"),
(1_000_000, 5_000_000, "1-5MB"),
(5_000_000, float("inf"), ">5MB"),
]
print(f"\n HTML Size vs Extracted Paragraphs:\n")
print(f" {'Size Range':<15} {'Files':>6} {'Avg ¶':>7} {'Med ¶':>7} {'Min ¶':>6} {'Max ¶':>6}")
print(f" {'-'*15} {'-'*6} {'-'*7} {'-'*7} {'-'*6} {'-'*6}")
for lo, hi, label in size_bins:
in_bin = [d for d in size_para_data if lo <= d["html_size"] < hi]
if not in_bin:
continue
counts = sorted([d["para_count"] for d in in_bin])
avg = sum(counts) / len(counts)
med = counts[len(counts) // 2]
print(f" {label:<15} {len(in_bin):>6} {avg:>7.1f} {med:>7} {min(counts):>6} {max(counts):>6}")
# Large HTML files with very few paragraphs — likely extraction failures
print(f"\n Potential extraction failures (HTML >1MB but ≤2 paragraphs):\n")
big_few = [d for d in size_para_data if d["html_size"] > 1_000_000 and d["para_count"] <= 2]
big_few.sort(key=lambda d: d["html_size"], reverse=True)
if not big_few:
# Relax threshold
print(" (None found with >1MB and ≤2 paragraphs. Relaxing to >500KB and ≤3 paragraphs)\n")
big_few = [d for d in size_para_data if d["html_size"] > 500_000 and d["para_count"] <= 3]
big_few.sort(key=lambda d: d["html_size"], reverse=True)
print(f" {'Accession':<30} {'HTML Size':>12} {'Paras':>6} {'Generator':<25}")
print(f" {'-'*30} {'-'*12} {'-'*6} {'-'*25}")
for d in big_few[:10]:
size_str = f"{d['html_size']/1024/1024:.2f} MB" if d['html_size'] > 1_000_000 else f"{d['html_size']/1024:.0f} KB"
print(f" {d['acc']:<30} {size_str:>12} {d['para_count']:>6} {d['generator']:<25}")
# Also show the reverse: small HTML with many paragraphs
print(f"\n Unusual: Small HTML (<50KB) with many paragraphs (>15):\n")
small_many = [d for d in size_para_data if d["html_size"] < 50_000 and d["para_count"] > 15]
small_many.sort(key=lambda d: d["para_count"], reverse=True)
print(f" {'Accession':<30} {'HTML Size':>12} {'Paras':>6} {'Generator':<25}")
print(f" {'-'*30} {'-'*12} {'-'*6} {'-'*25}")
for d in small_many[:10]:
size_str = f"{d['html_size']/1024:.0f} KB"
print(f" {d['acc']:<30} {size_str:>12} {d['para_count']:>6} {d['generator']:<25}")
# ─────────────────────────────────────────────────────────────────────────────
# Summary
# ─────────────────────────────────────────────────────────────────────────────
print("\n" + "=" * 80)
print("SUMMARY")
print("=" * 80)
print("""
Key findings are printed above. Look for:
1. Which generators dominate the corpus
2. Whether any generator has notably worse extraction metrics (low para count,
high % lowercase starts, low extraction ratio)
3. Whether problem paragraphs cluster around specific generators (over-rep > 1.5x)
4. Whether large-HTML / few-paragraph cases cluster on a specific generator
""")

View File

@ -0,0 +1,627 @@
#!/usr/bin/env python3
"""
Cross-reference SEC filing generators with paragraph quality metrics.
Reuses detection logic from detect_generators.py, then computes quality
metrics per generator from paragraphs-clean.jsonl.
"""
import json
import os
import re
import sys
import statistics
from collections import defaultdict, Counter
from pathlib import Path
HTML_DIR = Path("/home/joey/Documents/sec-cyBERT/data/raw/html")
PARAGRAPHS_FILE = Path("/home/joey/Documents/sec-cyBERT/data/paragraphs/paragraphs-clean.jsonl")
READ_BYTES = 20_000
# ── Generator detection (copied from detect_generators.py) ──
FILING_AGENT_CIKS = {
"0000950170": "Donnelley Financial Solutions",
"0001193125": "Donnelley Financial Solutions",
"0001558370": "Toppan Merrill",
"0001654954": "Toppan Merrill",
}
def _normalize_generator(raw: str) -> str:
r = raw.strip().lower()
if "workiva" in r or "wdesk" in r:
return "Workiva"
if "donnelley" in r or "dfin" in r or "rrdonnelley" in r:
return "Donnelley Financial Solutions"
if ("toppan" in r) or ("merrill" in r and "bridge" in r):
return "Toppan Merrill"
if "word" in r and "microsoft" in r:
return "Microsoft Word"
if "excel" in r and "microsoft" in r:
return "Microsoft Excel"
if "thunderdome" in r:
return "ThunderDome"
if "goxbrl" in r:
return "GoXBRL"
if "compsci" in r:
return "CompSci Transform"
if "certent" in r:
return "Certent"
if "iris carbon" in r:
return "IRIS Carbon"
if "broadridge" in r or "profile" in r:
return "Broadridge PROfile"
if "sec publisher" in r:
return "SEC Publisher"
return raw.strip()
def detect_generator(filepath: str) -> str:
"""Read first 20KB and return generator name."""
with open(filepath, "rb") as f:
raw = f.read(READ_BYTES)
text = raw.decode("utf-8", errors="replace")
text_lower = text.lower()
# meta generator
m = re.search(r'<meta\s+name\s*=\s*["\']generator["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
if not m:
m = re.search(r'<meta\s+content\s*=\s*["\']([^"\']+)["\']\s+name\s*=\s*["\']generator["\']', text, re.I)
if m:
return _normalize_generator(m.group(1))
m = re.search(r'<meta\s+name\s*=\s*["\']Creator["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
if m:
return _normalize_generator(m.group(1))
m = re.search(r'<meta\s+name\s*=\s*["\']Producer["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
if m:
return _normalize_generator(m.group(1))
m = re.search(r'<meta\s+name\s*=\s*["\']ProgId["\']\s+content\s*=\s*["\']([^"\']+)["\']', text, re.I)
if m:
progid = m.group(1)
if "word" in progid.lower():
return "Microsoft Word"
if "excel" in progid.lower():
return "Microsoft Excel"
return _normalize_generator(progid)
# Comment signatures
if re.search(r"<!--.*Created with the Workiva Platform.*-->", text, re.I):
return "Workiva"
if re.search(r"<!--.*Copyright\s+\d{4}\s+Workiva.*-->", text, re.I):
return "Workiva"
if re.search(r"<!--.*Document created using Wdesk.*-->", text, re.I):
return "Workiva"
if re.search(r"<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->", text, re.I):
return "Toppan Merrill"
if re.search(r"<!--.*Merrill\s*Bridge.*-->", text, re.I):
return "Toppan Merrill"
if re.search(r"<!--.*Donnelley Financial Solutions.*-->", text, re.I):
return "Donnelley Financial Solutions"
if re.search(r"<!--.*RR\s*Donnelley.*-->", text, re.I):
return "Donnelley Financial Solutions"
if re.search(r"<!--.*Broadridge\s+PROfile.*-->", text, re.I):
return "Broadridge PROfile"
if "broadridge" in text_lower:
return "Broadridge PROfile"
m_title = re.search(r"<title[^>]*>([^<]+)</title>", text, re.I)
title_text = m_title.group(1).strip() if m_title else ""
if "sec publisher" in text_lower or "sec publisher" in title_text.lower():
return "SEC Publisher"
m = re.search(r"<!--.*Powered by IRIS Carbon.*-->", text, re.I)
if m:
return "IRIS Carbon"
if re.search(r"<!--.*Certent\s+Disclosure\s+Management.*-->", text, re.I):
return "Certent"
if "certent" in text_lower:
return "Certent"
if re.search(r"<!--.*CompSci Resources.*-->", text, re.I):
return "CompSci Transform"
if re.search(r"<!--.*RDG Portal.*-->", text, re.I):
return "RDG Portal"
if title_text.lower() == "pdf to edgar" or "pdf to edgar" in text_lower[:2000]:
return "PDF to EDGAR"
m = re.search(r"<!--\s*Generated\s+by\s+([^-]+?)-->", text, re.I)
if m:
val = m.group(1).strip()
if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
return _normalize_generator(val)
m = re.search(r"<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->", text, re.I)
if m:
val = m.group(1).strip()
if not re.match(r"^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}", val):
return _normalize_generator(val)
# Keyword signatures
if re.search(r"\bwdesk\b", text_lower):
return "Workiva"
if re.search(r"\bworkiva\b", text_lower):
return "Workiva"
if re.search(r"\brrdonnelley\b", text_lower):
return "Donnelley Financial Solutions"
if re.search(r"\bedgar-online\b", text_lower):
return "Donnelley Financial Solutions"
if re.search(r"\btoppan\b", text_lower):
return "Toppan Merrill"
if re.search(r"\bmerrill\b", text_lower) and re.search(r"\b(?:bridge|ixbrl|xbrl)\b", text_lower):
return "Toppan Merrill"
if re.search(r"\bbowne\b", text_lower):
return "Toppan Merrill"
if re.search(r"\bcompsci\b", text_lower):
return "CompSci Transform"
if re.search(r"\bthunderdome\b", text_lower):
return "ThunderDome"
if re.search(r"\bgoxbrl\b", text_lower):
return "GoXBRL"
if re.search(r'class\s*=\s*["\'][^"\']*\bwk_\w+', text_lower):
return "Workiva"
# SGML document wrapper
has_sgml = re.search(r"<DOCUMENT>\s*\n?\s*<TYPE>", text, re.I)
if has_sgml:
m_fn = re.search(r"<FILENAME>\s*([\w\-\.]+)", text, re.I)
if m_fn:
filename = m_fn.group(1).lower()
if re.match(r"d\d+", filename):
return "Donnelley Financial Solutions"
if re.match(r"tm\d+", filename):
return "Toppan Merrill"
if re.match(r"ea\d+", filename):
return "EFiling/EDGAR Agent"
if "<!-- field: rule-page" in text_lower or "rule-page" in text_lower[:5000]:
return "Broadridge PROfile"
if "field: set; name: xdx" in text_lower:
return "EFiling XDX"
if "<!-- field:" in text_lower[:5000]:
return "EFiling/EDGAR Agent"
if re.search(r'<Center><DIV STYLE="width:8\.5in"', text):
return "Donnelley Financial Solutions"
basename = os.path.basename(filepath)
accession_prefix = basename.split("-")[0]
if accession_prefix in FILING_AGENT_CIKS:
return FILING_AGENT_CIKS[accession_prefix]
font_count = text_lower.count("<font")
if font_count > 5:
return "SGML-wrapped (legacy)"
return "SGML-wrapped (unknown)"
# Inline XBRL
has_ix_ns = "xmlns:ix=" in text_lower or "<ix:header" in text_lower
if re.search(r'<P STYLE="[^"]*font-family:Times New Roman"', text) and re.search(
r'<Center><DIV STYLE="width:8\.5in"', text
):
return "Donnelley Financial Solutions"
if title_text:
title_lower = title_text.lower()
if "workiva" in title_lower or "wdesk" in title_lower:
return "Workiva"
if has_ix_ns:
if "field: set; name: xdx" in text_lower:
return "EFiling XDX"
if "<!-- field: rule" in text_lower:
return "Broadridge PROfile"
if "<!-- field:" in text_lower[:5000]:
return "EFiling/EDGAR Agent"
basename = os.path.basename(filepath)
accession_prefix = basename.split("-")[0]
if accession_prefix in FILING_AGENT_CIKS:
return FILING_AGENT_CIKS[accession_prefix]
if '<?xml version="1.0" encoding="utf-8"' in text_lower[:200]:
return "Inline XBRL (utf-8 toolchain)"
if "<?xml version='1.0' encoding='ascii'?>" in text_lower[:200]:
return "Inline XBRL (SEC/EDGAR standard)"
return "Inline XBRL (tool unresolved)"
# Structural fallbacks
font_count = text_lower.count("<font")
td_count = text_lower.count("<td")
span_count = text_lower.count("<span")
if font_count > 20:
return "Legacy generator (font-based)"
if td_count > 50 and span_count < 10:
return "Table-based generator"
data_attr_count = len(re.findall(r"\bdata-\w+", text_lower))
if data_attr_count > 10:
return "Modern web tooling"
return "Unknown"
# ── Consolidate to ~14 families ──
FAMILY_MAP = {
"Workiva": "Workiva",
"Donnelley Financial Solutions": "Donnelley Financial Solutions",
"Toppan Merrill": "Toppan Merrill",
"CompSci Transform": "CompSci Transform",
"ThunderDome": "ThunderDome",
"EFiling/EDGAR Agent": "EFiling/EDGAR Agent",
"EFiling XDX": "EFiling/EDGAR Agent",
"Broadridge PROfile": "Broadridge PROfile",
"SEC Publisher": "SEC Publisher",
"IRIS Carbon": "IRIS Carbon",
"RDG Portal": "RDG Portal",
"Certent": "Certent",
"PDF to EDGAR": "PDF to EDGAR",
"GoXBRL": "GoXBRL",
"Microsoft Word": "Microsoft Word",
"Microsoft Excel": "Microsoft Excel",
"Inline XBRL (SEC/EDGAR standard)": "Inline XBRL (unattributed)",
"Inline XBRL (utf-8 toolchain)": "Inline XBRL (unattributed)",
"Inline XBRL (tool unresolved)": "Inline XBRL (unattributed)",
"SGML-wrapped (legacy)": "SGML-wrapped (unattributed)",
"SGML-wrapped (unknown)": "SGML-wrapped (unattributed)",
"Legacy generator (font-based)": "Other/Legacy",
"Table-based generator": "Other/Legacy",
"Modern web tooling": "Other/Legacy",
"Unknown": "Unknown",
}
# ── Quality metric helpers ──
# Common non-heading start words to exclude from title-case detection
NON_HEADING_STARTS = {
"we", "our", "the", "in", "a", "an", "as", "to", "on", "at", "by",
"for", "it", "is", "if", "or", "no", "so", "do", "its", "this",
"that", "with", "from", "has", "had", "have", "will", "may", "can",
"all", "any", "are", "was", "were", "been", "not", "but", "each",
"such", "these", "those", "also", "when", "there", "their",
"they", "them", "than", "who", "what", "how", "where",
}
# Section name fragments for Item 1C
SECTION_KEYWORDS = [
"risk management", "board oversight", "governance", "incident",
"strategy", "third party", "management role", "cybersecurity",
"risk factors", "material", "overview",
]
RE_ALLCAPS_HEADER = re.compile(r"^[A-Z][A-Z\s,&\-]{10,}[a-z]")
def is_inlined_header(text: str) -> bool:
"""Check if paragraph starts with an inlined heading pattern."""
# ALL-CAPS header followed by body text
if RE_ALLCAPS_HEADER.match(text):
return True
# Title-case heading: 2+ consecutive capitalized words at start (not common sentence starters)
words = text.split()
if len(words) < 4:
return False
cap_count = 0
for w in words:
clean = w.strip(".,;:!?()\"'")
if not clean:
continue
if clean[0].isupper() and clean.lower() not in NON_HEADING_STARTS:
cap_count += 1
else:
break
if cap_count >= 2:
# Check rest of text continues as a sentence (not just a short title)
remaining = " ".join(words[cap_count:])
if len(remaining) > 20:
return True
# Section keyword match at start
text_lower = text[:80].lower()
for kw in SECTION_KEYWORDS:
if text_lower.startswith(kw):
# Must have more text after the heading
if len(text) > len(kw) + 10:
return True
return False
def is_orphan_word(text: str) -> bool:
"""Check if paragraph starts with lowercase (excluding list patterns)."""
if not text:
return False
first_char = text[0]
if not first_char.islower():
return False
# Exclude list pattern starters
list_starters = ["and ", "or ", "including ", "i.e.", "e.g."]
text_lower = text[:15].lower()
for starter in list_starters:
if text_lower.startswith(starter):
return False
# Exclude bullet-like patterns
if text[0] in "•·-–—":
return False
return True
RE_TERMINAL = re.compile(r'[.!?;")]\s*$')
def is_truncated(text: str) -> bool:
"""Paragraph NOT ending with terminal punctuation."""
return not RE_TERMINAL.search(text)
def is_fragment(text: str) -> bool:
return len(text.split()) < 25
def main():
# ── Step 1: Detect generators for all HTML files ──
print("Step 1: Detecting generators for all HTML files...", file=sys.stderr)
accession_to_generator = {}
files = sorted(HTML_DIR.glob("*.html"))
for i, fp in enumerate(files):
accession = fp.stem
gen_raw = detect_generator(str(fp))
gen_family = FAMILY_MAP.get(gen_raw, gen_raw)
accession_to_generator[accession] = gen_family
if (i + 1) % 3000 == 0:
print(f" {i+1}/{len(files)} files processed...", file=sys.stderr)
print(f" Done: {len(files)} files, {len(set(accession_to_generator.values()))} generator families", file=sys.stderr)
# ── Step 2: Load paragraphs and compute per-filing stats ──
print("Step 2: Loading paragraphs...", file=sys.stderr)
# Per-filing data
filing_paragraphs = defaultdict(list) # accession -> list of paragraph dicts
text_hash_counts = Counter() # textHash -> count of filings containing it
# First pass: collect all textHashes and their filing counts
text_hash_filings = defaultdict(set) # textHash -> set of accessions
all_paragraphs = []
with open(PARAGRAPHS_FILE) as f:
for line in f:
p = json.loads(line)
acc = p["filing"]["accessionNumber"]
all_paragraphs.append(p)
filing_paragraphs[acc].append(p)
text_hash_filings[p["textHash"]].add(acc)
print(f" {len(all_paragraphs)} paragraphs across {len(filing_paragraphs)} filings", file=sys.stderr)
# Boilerplate: textHash appearing in 3+ filings
boilerplate_hashes = {h for h, accs in text_hash_filings.items() if len(accs) >= 3}
print(f" {len(boilerplate_hashes)} boilerplate hashes (in 3+ filings)", file=sys.stderr)
# ── Step 3: Compute metrics per generator ──
print("Step 3: Computing metrics...", file=sys.stderr)
# Per-generator aggregate
gen_stats = defaultdict(lambda: {
"total_paragraphs": 0,
"total_filings": 0,
"paragraphs_per_filing": [],
"word_counts": [],
"inlined_header": 0,
"orphan_word": 0,
"fragment": 0,
"truncated": 0,
"boilerplate": 0,
})
# Per-filing issue rates for "most problematic" analysis
filing_issue_rates = {} # accession -> {metrics..., combined_rate}
# Filings not in HTML dir (no generator detected)
missing_gen = 0
for acc, paragraphs in filing_paragraphs.items():
gen = accession_to_generator.get(acc)
if gen is None:
missing_gen += 1
gen = "(no HTML file)"
stats = gen_stats[gen]
stats["total_filings"] += 1
stats["total_paragraphs"] += len(paragraphs)
stats["paragraphs_per_filing"].append(len(paragraphs))
# Per-filing counters for issue rate
f_inlined = 0
f_orphan = 0
f_fragment = 0
f_truncated = 0
f_boilerplate = 0
for p in paragraphs:
text = p["text"]
wc = p.get("wordCount", len(text.split()))
stats["word_counts"].append(wc)
if is_inlined_header(text):
stats["inlined_header"] += 1
f_inlined += 1
if is_orphan_word(text):
stats["orphan_word"] += 1
f_orphan += 1
if is_fragment(text):
stats["fragment"] += 1
f_fragment += 1
if is_truncated(text):
stats["truncated"] += 1
f_truncated += 1
if p["textHash"] in boilerplate_hashes:
stats["boilerplate"] += 1
f_boilerplate += 1
n = len(paragraphs)
if n > 0:
filing_issue_rates[acc] = {
"generator": gen,
"n_paragraphs": n,
"inlined_header_rate": f_inlined / n,
"orphan_word_rate": f_orphan / n,
"fragment_rate": f_fragment / n,
"truncation_rate": f_truncated / n,
"boilerplate_rate": f_boilerplate / n,
"combined_rate": (f_inlined + f_orphan + f_fragment + f_truncated + f_boilerplate) / (5 * n),
}
if missing_gen:
print(f" Note: {missing_gen} filings had no matching HTML file", file=sys.stderr)
# ── Step 4: Output ──
# Compute corpus-wide averages for flagging
corpus_total = sum(s["total_paragraphs"] for s in gen_stats.values())
corpus_inlined = sum(s["inlined_header"] for s in gen_stats.values())
corpus_orphan = sum(s["orphan_word"] for s in gen_stats.values())
corpus_fragment = sum(s["fragment"] for s in gen_stats.values())
corpus_truncated = sum(s["truncated"] for s in gen_stats.values())
corpus_boilerplate = sum(s["boilerplate"] for s in gen_stats.values())
corpus_avg_wc = statistics.mean(
wc for s in gen_stats.values() for wc in s["word_counts"]
) if corpus_total > 0 else 0
avg_rates = {
"inlined_header": corpus_inlined / corpus_total if corpus_total else 0,
"orphan_word": corpus_orphan / corpus_total if corpus_total else 0,
"fragment": corpus_fragment / corpus_total if corpus_total else 0,
"truncated": corpus_truncated / corpus_total if corpus_total else 0,
"boilerplate": corpus_boilerplate / corpus_total if corpus_total else 0,
}
print()
print("=" * 180)
print("GENERATOR QUALITY CROSS-REFERENCE: SEC-cyBERT CORPUS")
print("=" * 180)
print(f"\nCorpus totals: {corpus_total:,} paragraphs across {sum(s['total_filings'] for s in gen_stats.values()):,} filings")
print(f"Corpus averages: InlinedHdr={avg_rates['inlined_header']:.1%} Orphan={avg_rates['orphan_word']:.1%} "
f"Fragment={avg_rates['fragment']:.1%} Truncated={avg_rates['truncated']:.1%} "
f"Boilerplate={avg_rates['boilerplate']:.1%} AvgWC={corpus_avg_wc:.1f}")
print(f"(Cells marked with ** are >2x the corpus average)")
# Sort by total paragraphs descending
sorted_gens = sorted(gen_stats.items(), key=lambda x: x[1]["total_paragraphs"], reverse=True)
# Header
print()
hdr = (
f"{'Generator':<35} {'Files':>6} {'Paras':>7} {'Mean/F':>7} {'Med/F':>6} "
f"{'AvgWC':>6} {'InlHdr%':>8} {'Orphan%':>8} {'Frag%':>8} {'Trunc%':>8} {'Boiler%':>8}"
)
print(hdr)
print("-" * len(hdr))
for gen, s in sorted_gens:
n = s["total_paragraphs"]
if n == 0:
continue
nf = s["total_filings"]
mean_ppf = n / nf if nf else 0
med_ppf = statistics.median(s["paragraphs_per_filing"]) if s["paragraphs_per_filing"] else 0
avg_wc = statistics.mean(s["word_counts"]) if s["word_counts"] else 0
inl_r = s["inlined_header"] / n
orp_r = s["orphan_word"] / n
fra_r = s["fragment"] / n
tru_r = s["truncated"] / n
boi_r = s["boilerplate"] / n
# Flag if >2x corpus average
def fmt_rate(val, avg_key):
pct = f"{val:.1%}"
if avg_rates[avg_key] > 0 and val > 2 * avg_rates[avg_key]:
return f"{pct:>6}**"
return f"{pct:>8}"
row = (
f"{gen:<35} {nf:>6} {n:>7} {mean_ppf:>7.1f} {med_ppf:>6.0f} "
f"{avg_wc:>6.1f} {fmt_rate(inl_r, 'inlined_header')} {fmt_rate(orp_r, 'orphan_word')} "
f"{fmt_rate(fra_r, 'fragment')} {fmt_rate(tru_r, 'truncated')} {fmt_rate(boi_r, 'boilerplate')}"
)
print(row)
print("-" * len(hdr))
# Corpus average row
corpus_med_ppf = statistics.median(
ppf for s in gen_stats.values() for ppf in s["paragraphs_per_filing"]
)
corpus_mean_ppf = corpus_total / sum(s["total_filings"] for s in gen_stats.values())
print(
f"{'CORPUS AVERAGE':<35} "
f"{sum(s['total_filings'] for s in gen_stats.values()):>6} "
f"{corpus_total:>7} "
f"{corpus_mean_ppf:>7.1f} {corpus_med_ppf:>6.0f} "
f"{corpus_avg_wc:>6.1f} "
f"{avg_rates['inlined_header']:>7.1%} "
f"{avg_rates['orphan_word']:>7.1%} "
f"{avg_rates['fragment']:>7.1%} "
f"{avg_rates['truncated']:>7.1%} "
f"{avg_rates['boilerplate']:>7.1%}"
)
# ── 10 Most Problematic Filings ──
print()
print("=" * 180)
print("10 MOST PROBLEMATIC FILINGS (highest combined issue rate across all 5 metrics)")
print("=" * 180)
# Only consider filings with at least 3 paragraphs to avoid noisy tiny filings
eligible = {acc: fr for acc, fr in filing_issue_rates.items() if fr["n_paragraphs"] >= 3}
worst = sorted(eligible.items(), key=lambda x: x[1]["combined_rate"], reverse=True)[:10]
print()
hdr2 = (
f"{'Accession':<30} {'Generator':<35} {'Paras':>5} "
f"{'InlHdr':>7} {'Orphan':>7} {'Frag':>7} {'Trunc':>7} {'Boiler':>7} {'Combined':>8}"
)
print(hdr2)
print("-" * len(hdr2))
for acc, fr in worst:
print(
f"{acc:<30} {fr['generator']:<35} {fr['n_paragraphs']:>5} "
f"{fr['inlined_header_rate']:>6.1%} {fr['orphan_word_rate']:>6.1%} "
f"{fr['fragment_rate']:>6.1%} {fr['truncation_rate']:>6.1%} "
f"{fr['boilerplate_rate']:>6.1%} {fr['combined_rate']:>7.1%}"
)
# ── Per-metric worst generators summary ──
print()
print("=" * 180)
print("GENERATORS >2x CORPUS AVERAGE (flagged metrics)")
print("=" * 180)
metric_names = {
"inlined_header": "Inlined Header",
"orphan_word": "Orphan Word",
"fragment": "Fragment",
"truncated": "Truncation",
"boilerplate": "Boilerplate",
}
for metric_key, metric_label in metric_names.items():
flagged = []
for gen, s in sorted_gens:
n = s["total_paragraphs"]
if n < 10:
continue
rate = s[metric_key] / n
if avg_rates[metric_key] > 0 and rate > 2 * avg_rates[metric_key]:
flagged.append((gen, rate, s[metric_key], n))
if flagged:
print(f"\n {metric_label} rate (corpus avg: {avg_rates[metric_key]:.1%}, threshold >2x = {2*avg_rates[metric_key]:.1%}):")
for gen, rate, count, total in sorted(flagged, key=lambda x: -x[1]):
print(f" {gen:<35} {rate:.1%} ({count}/{total})")
else:
print(f"\n {metric_label}: No generators >2x corpus average")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,164 @@
/**
* Analyze the 348 annotated paragraphs with no cybersecurity keywords.
* Reports label distribution to decide: keep or exclude from training.
*
* Usage: bun ts/scripts/analyze-no-cyber.ts
*/
import { readFileSync } from "node:fs";
const DATA_DIR = new URL("../../data", import.meta.url).pathname;
const QUALITY_PATH = `${DATA_DIR}/paragraphs/quality/quality-scores.jsonl`;
const ANNOTATIONS_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
const TRAINING_PATH = `${DATA_DIR}/paragraphs/training.patched.jsonl`;
interface QualityScore {
id: string;
issues: string[];
quality_tier: string;
}
interface Annotation {
paragraphId: string;
label: {
content_category: string;
specificity_level: number;
category_confidence: string;
specificity_confidence: string;
reasoning: string;
};
provenance: { modelId: string };
}
// Load quality scores — find no-cyber paragraphs
const noCyberIds = new Set<string>();
for (const line of readFileSync(QUALITY_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const q = JSON.parse(line) as QualityScore;
if (q.issues.includes("no_cyber_keywords")) {
noCyberIds.add(q.id);
}
}
console.error(`No-cyber paragraphs (all): ${noCyberIds.size}`);
// Load training set IDs
const trainingIds = new Set<string>();
for (const line of readFileSync(TRAINING_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const p = JSON.parse(line) as { id: string };
trainingIds.add(p.id);
}
// Filter to annotated no-cyber paragraphs
const annotatedNoCyber = new Set([...noCyberIds].filter((id) => trainingIds.has(id)));
console.error(`No-cyber paragraphs (annotated): ${annotatedNoCyber.size}`);
// Load annotations for these paragraphs
const annotations = new Map<string, Annotation[]>();
for (const line of readFileSync(ANNOTATIONS_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const ann = JSON.parse(line) as Annotation;
if (annotatedNoCyber.has(ann.paragraphId)) {
if (!annotations.has(ann.paragraphId)) annotations.set(ann.paragraphId, []);
annotations.get(ann.paragraphId)!.push(ann);
}
}
console.error(`Paragraphs with annotations: ${annotations.size}\n`);
// Majority vote per paragraph
function majority<T>(items: T[]): { value: T; count: number } {
const counts = new Map<T, number>();
for (const item of items) counts.set(item, (counts.get(item) ?? 0) + 1);
let best: T = items[0]!;
let bestCount = 0;
for (const [v, c] of counts) {
if (c > bestCount) { best = v; bestCount = c; }
}
return { value: best, count: bestCount };
}
// Category distribution (consensus)
const catDist = new Map<string, number>();
const specDist = new Map<number, number>();
const confDist = new Map<string, number>();
let conflicts = 0;
// Per-paragraph details for interesting cases
const nonOther: { pid: string; cat: string; spec: number; anns: Annotation[] }[] = [];
for (const [pid, anns] of annotations) {
const catVote = majority(anns.map((a) => a.label.content_category));
const specVote = majority(anns.map((a) => a.label.specificity_level));
catDist.set(catVote.value, (catDist.get(catVote.value) ?? 0) + 1);
specDist.set(specVote.value, (specDist.get(specVote.value) ?? 0) + 1);
if (catVote.count < 2) conflicts++;
// Track confidence
for (const ann of anns) {
confDist.set(ann.label.category_confidence, (confDist.get(ann.label.category_confidence) ?? 0) + 1);
}
if (catVote.value !== "None/Other") {
nonOther.push({ pid, cat: catVote.value, spec: specVote.value, anns });
}
}
// ── Report ──────────────────────────────────────────────────────────────
console.log("═══ NO-CYBER-KEYWORD PARAGRAPH ANALYSIS ═══\n");
console.log(`Total annotated no-cyber paragraphs: ${annotations.size}`);
console.log(`Conflicts (no majority): ${conflicts}\n`);
console.log("─── Category Distribution (Consensus) ───");
for (const [cat, count] of [...catDist.entries()].sort((a, b) => b[1] - a[1])) {
console.log(` ${cat.padEnd(30)} ${count} (${((count / annotations.size) * 100).toFixed(1)}%)`);
}
console.log("\n─── Specificity Distribution (Consensus) ───");
for (const level of [1, 2, 3, 4]) {
const count = specDist.get(level) ?? 0;
console.log(` Level ${level}: ${count} (${((count / annotations.size) * 100).toFixed(1)}%)`);
}
console.log("\n─── Confidence Distribution (All Models) ───");
for (const conf of ["high", "medium", "low"]) {
const count = confDist.get(conf) ?? 0;
const total = [...confDist.values()].reduce((a, b) => a + b, 0);
console.log(` ${conf}: ${count} (${((count / total) * 100).toFixed(1)}%)`);
}
console.log(`\n─── Non-"None/Other" Paragraphs: ${nonOther.length} ───`);
if (nonOther.length > 0) {
console.log("These are the concerning ones — labeled as real categories but have no cyber keywords.\n");
// Load actual paragraph text for these
const textMap = new Map<string, string>();
const noCyberPidSet = new Set(nonOther.map((n) => n.pid));
for (const line of readFileSync(TRAINING_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const p = JSON.parse(line) as { id: string; text: string };
if (noCyberPidSet.has(p.id)) textMap.set(p.id, p.text);
}
// Show samples
for (const item of nonOther.slice(0, 10)) {
const text = textMap.get(item.pid) ?? "(text not found)";
const modelVotes = item.anns.map((a) => `${a.provenance.modelId.split("/")[1]}: ${a.label.content_category}`).join(", ");
console.log(` [${item.cat} / Spec ${item.spec}] ${item.pid}`);
console.log(` Models: ${modelVotes}`);
console.log(` Text: ${text.substring(0, 150)}...`);
console.log();
}
}
// Summary recommendation
const noneOtherCount = catDist.get("None/Other") ?? 0;
const noneOtherPct = ((noneOtherCount / annotations.size) * 100).toFixed(1);
console.log("─── RECOMMENDATION ───");
if (nonOther.length < 50) {
console.log(` ${noneOtherPct}% labeled None/Other. Only ${nonOther.length} labeled as real categories.`);
console.log(` → EXCLUDE ${nonOther.length} non-None/Other paragraphs from training (likely section bleed).`);
console.log(` → KEEP ${noneOtherCount} None/Other paragraphs (correct labels for non-cyber content).`);
} else {
console.log(` WARNING: ${nonOther.length} paragraphs labeled as real categories — investigate further.`);
}

View File

@ -0,0 +1,203 @@
/**
* DAPT corpus analytics: document length distribution, token estimates,
* quality checks, and filter candidates.
*
* Usage: bun ts/scripts/dapt-corpus-analytics.ts
*
* Input: data/dapt-corpus/shard-*.jsonl
*/
import { readFileSync, readdirSync } from "node:fs";
const CORPUS_DIR = new URL("../../data/dapt-corpus", import.meta.url).pathname;
const CHARS_PER_TOKEN = 4.72; // empirical from ModernBERT tokenizer
interface Doc {
accession: string;
text: string;
}
// ── Load all documents ──────────────────────────────────────────────────
console.error("Loading corpus...");
const shards = readdirSync(CORPUS_DIR)
.filter((f) => f.endsWith(".jsonl"))
.sort();
const docs: { accession: string; chars: number; lines: number; words: number }[] = [];
let totalChars = 0;
for (const shard of shards) {
const path = `${CORPUS_DIR}/${shard}`;
for (const line of readFileSync(path, "utf-8").split("\n")) {
if (!line.trim()) continue;
const doc = JSON.parse(line) as Doc;
const chars = doc.text.length;
const lines = doc.text.split("\n").length;
const words = doc.text.split(/\s+/).filter(Boolean).length;
docs.push({ accession: doc.accession, chars, lines, words });
totalChars += chars;
}
}
console.error(` ${docs.length} documents loaded from ${shards.length} shards\n`);
// ── Basic stats ─────────────────────────────────────────────────────────
const charsSorted = docs.map((d) => d.chars).sort((a, b) => a - b);
const wordsSorted = docs.map((d) => d.words).sort((a, b) => a - b);
function percentile(arr: number[], p: number): number {
const idx = Math.ceil((p / 100) * arr.length) - 1;
return arr[Math.max(0, idx)]!;
}
function mean(arr: number[]): number {
return arr.reduce((a, b) => a + b, 0) / arr.length;
}
const totalTokens = Math.round(totalChars / CHARS_PER_TOKEN);
console.log("═══ DAPT CORPUS ANALYTICS ═══\n");
console.log("─── Overview ───");
console.log(` Documents: ${docs.length.toLocaleString()}`);
console.log(` Shards: ${shards.length}`);
console.log(` Total chars: ${(totalChars / 1e9).toFixed(3)}B`);
console.log(` Total tokens (est): ${(totalTokens / 1e6).toFixed(1)}M (@ ${CHARS_PER_TOKEN} chars/token)`);
console.log("\n─── Document Length Distribution (chars) ───");
console.log(` Min: ${percentile(charsSorted, 0).toLocaleString()}`);
console.log(` P5: ${percentile(charsSorted, 5).toLocaleString()}`);
console.log(` P10: ${percentile(charsSorted, 10).toLocaleString()}`);
console.log(` P25: ${percentile(charsSorted, 25).toLocaleString()}`);
console.log(` Median: ${percentile(charsSorted, 50).toLocaleString()}`);
console.log(` Mean: ${Math.round(mean(charsSorted)).toLocaleString()}`);
console.log(` P75: ${percentile(charsSorted, 75).toLocaleString()}`);
console.log(` P90: ${percentile(charsSorted, 90).toLocaleString()}`);
console.log(` P95: ${percentile(charsSorted, 95).toLocaleString()}`);
console.log(` Max: ${percentile(charsSorted, 100).toLocaleString()}`);
console.log("\n─── Document Length Distribution (words) ───");
console.log(` Min: ${percentile(wordsSorted, 0).toLocaleString()}`);
console.log(` P5: ${percentile(wordsSorted, 5).toLocaleString()}`);
console.log(` Median: ${percentile(wordsSorted, 50).toLocaleString()}`);
console.log(` Mean: ${Math.round(mean(wordsSorted)).toLocaleString()}`);
console.log(` P95: ${percentile(wordsSorted, 95).toLocaleString()}`);
console.log(` Max: ${percentile(wordsSorted, 100).toLocaleString()}`);
// ── Token length distribution ───────────────────────────────────────────
const tokensSorted = docs.map((d) => Math.round(d.chars / CHARS_PER_TOKEN)).sort((a, b) => a - b);
console.log("\n─── Token Length Distribution (estimated) ───");
console.log(` Min: ${percentile(tokensSorted, 0).toLocaleString()}`);
console.log(` P5: ${percentile(tokensSorted, 5).toLocaleString()}`);
console.log(` P10: ${percentile(tokensSorted, 10).toLocaleString()}`);
console.log(` P25: ${percentile(tokensSorted, 25).toLocaleString()}`);
console.log(` Median: ${percentile(tokensSorted, 50).toLocaleString()}`);
console.log(` Mean: ${Math.round(mean(tokensSorted)).toLocaleString()}`);
console.log(` P75: ${percentile(tokensSorted, 75).toLocaleString()}`);
console.log(` P90: ${percentile(tokensSorted, 90).toLocaleString()}`);
console.log(` P95: ${percentile(tokensSorted, 95).toLocaleString()}`);
console.log(` Max: ${percentile(tokensSorted, 100).toLocaleString()}`);
// ── Sequence count at different max_seq_length ──────────────────────────
console.log("\n─── Training Sequences by max_seq_length ───");
for (const seqLen of [512, 1024, 2048, 4096, 8192]) {
let totalSeqs = 0;
for (const d of docs) {
const tokens = Math.round(d.chars / CHARS_PER_TOKEN);
totalSeqs += Math.ceil(tokens / seqLen);
}
const docsExceeding = docs.filter((d) => Math.round(d.chars / CHARS_PER_TOKEN) > seqLen).length;
console.log(
` ${String(seqLen).padStart(5)}: ${totalSeqs.toLocaleString().padStart(10)} sequences` +
` (${docsExceeding.toLocaleString()} docs exceed, ${((docsExceeding / docs.length) * 100).toFixed(1)}%)`,
);
}
// ── Filter candidates ───────────────────────────────────────────────────
const tiny = docs.filter((d) => d.chars < 10_000);
const small = docs.filter((d) => d.chars < 50_000);
const empty = docs.filter((d) => d.chars < 100);
const huge = docs.filter((d) => d.chars > 5_000_000);
console.log("\n─── Filter Candidates ───");
console.log(` <100 chars (empty): ${empty.length}`);
console.log(` <10K chars (covers): ${tiny.length} (${(tiny.reduce((s, d) => s + d.chars, 0) / totalChars * 100).toFixed(3)}% of corpus)`);
console.log(` <50K chars (small): ${small.length} (${(small.reduce((s, d) => s + d.chars, 0) / totalChars * 100).toFixed(3)}% of corpus)`);
console.log(` >5M chars (huge): ${huge.length}`);
if (tiny.length > 0 && tiny.length <= 20) {
console.log("\n Tiny documents (<10K chars):");
for (const d of tiny.sort((a, b) => a.chars - b.chars)) {
console.log(` ${d.accession}: ${d.chars.toLocaleString()} chars, ${d.words.toLocaleString()} words`);
}
}
// ── Content quality spot checks ─────────────────────────────────────────
console.log("\n─── Content Quality Checks ───");
// Check for residual HTML tags
let docsWithHtml = 0;
let docsWithXbrl = 0;
let docsWithPageNums = 0;
let docsWithUrls = 0;
let singleBlockDocs = 0;
for (const shard of shards) {
const path = `${CORPUS_DIR}/${shard}`;
for (const line of readFileSync(path, "utf-8").split("\n")) {
if (!line.trim()) continue;
const doc = JSON.parse(line) as Doc;
if (/<[a-z][^>]*>/i.test(doc.text)) docsWithHtml++;
if (/ix:|xbrl|xmlns/i.test(doc.text)) docsWithXbrl++;
if (/\n\s*(?:\d{1,3}|[- ]\d{1,3}[- ]|F-\d+)\s*\n/.test(doc.text)) docsWithPageNums++;
if (/https?:\/\//.test(doc.text)) docsWithUrls++;
if (doc.text.split("\n\n").length < 3) singleBlockDocs++;
}
}
console.log(` Residual HTML tags: ${docsWithHtml} docs (${((docsWithHtml / docs.length) * 100).toFixed(1)}%)`);
console.log(` XBRL/xmlns traces: ${docsWithXbrl} docs (${((docsWithXbrl / docs.length) * 100).toFixed(1)}%)`);
console.log(` Page number traces: ${docsWithPageNums} docs (${((docsWithPageNums / docs.length) * 100).toFixed(1)}%)`);
console.log(` URLs present: ${docsWithUrls} docs (${((docsWithUrls / docs.length) * 100).toFixed(1)}%)`);
console.log(` Single-block (<3¶): ${singleBlockDocs} docs`);
// ── Shard distribution ──────────────────────────────────────────────────
console.log("\n─── Shard Distribution ───");
let shardIdx = 0;
for (const shard of shards) {
const path = `${CORPUS_DIR}/${shard}`;
const lines = readFileSync(path, "utf-8").split("\n").filter((l) => l.trim()).length;
const sizeBytes = readFileSync(path).length;
console.log(
` ${shard}: ${lines.toLocaleString().padStart(6)} docs, ${(sizeBytes / 1e6).toFixed(0).padStart(4)} MB`,
);
shardIdx++;
}
// ── Post-filter stats ───────────────────────────────────────────────────
const filtered = docs.filter((d) => d.chars >= 10_000);
const filteredChars = filtered.reduce((s, d) => s + d.chars, 0);
const filteredTokens = Math.round(filteredChars / CHARS_PER_TOKEN);
console.log("\n─── After Filtering <10K chars ───");
console.log(` Documents: ${filtered.length.toLocaleString()} (removed ${docs.length - filtered.length})`);
console.log(` Total chars: ${(filteredChars / 1e9).toFixed(3)}B`);
console.log(` Total tokens (est): ${(filteredTokens / 1e6).toFixed(1)}M`);
console.log(` Token loss: ${((1 - filteredTokens / totalTokens) * 100).toFixed(3)}%`);
// ── Training time estimates ─────────────────────────────────────────────
console.log("\n─── Training Time Estimates (RTX 3090, bf16, grad_checkpoint) ───");
for (const { seqLen, batchSize, gradAccum, secPerStepRange } of [
{ seqLen: 2048, batchSize: 4, gradAccum: 8, secPerStepRange: [1.0, 1.5, 2.0] },
{ seqLen: 8192, batchSize: 1, gradAccum: 32, secPerStepRange: [3.0, 5.0, 7.0] },
]) {
const totalSeqs = filtered.reduce((s, d) => s + Math.ceil(Math.round(d.chars / CHARS_PER_TOKEN) / seqLen), 0);
const effectiveBatch = batchSize * gradAccum;
const stepsPerEpoch = Math.ceil(totalSeqs / effectiveBatch);
console.log(`\n seq_len=${seqLen}, batch=${batchSize}, grad_accum=${gradAccum} (eff=${effectiveBatch})`);
console.log(` Sequences: ${totalSeqs.toLocaleString()}, Steps/epoch: ${stepsPerEpoch.toLocaleString()}`);
for (const secPerStep of secPerStepRange) {
const hoursPerEpoch = (stepsPerEpoch * secPerStep) / 3600;
console.log(` @ ${secPerStep}s/step: ${hoursPerEpoch.toFixed(1)}h/epoch`);
}
}

View File

@ -40,15 +40,20 @@ function cleanForDapt(raw: string): string {
if (trimmed.length === 0) { cleaned.push(""); continue; } if (trimmed.length === 0) { cleaned.push(""); continue; }
// Page numbers: bare digits, "Page N", F-N financial page markers
if (/^\d{1,3}$/.test(trimmed)) continue; if (/^\d{1,3}$/.test(trimmed)) continue;
if (/^(page\s+\d+|[-–—]\s*\d+\s*[-–—])$/i.test(trimmed)) continue; if (/^(page\s+\d+)$/i.test(trimmed)) continue;
if (/^F-\d{1,3}$/.test(trimmed)) continue;
if (/^table\s+of\s+contents?\s*$/i.test(trimmed)) continue; if (/^table\s+of\s+contents?\s*$/i.test(trimmed)) continue;
// XBRL metadata // XBRL metadata lines
if (/^(0000\d{6}\s|xbrli:|iso4217:|http:\/\/fasb\.org|http:\/\/xbrl\.)/.test(trimmed)) continue; if (/^(0000\d{6}\s|xbrli:|iso4217:|http:\/\/fasb\.org|http:\/\/xbrl\.)/.test(trimmed)) continue;
if (/^[a-z]{1,5}-\d{8}\s/.test(trimmed)) continue; if (/^[a-z]{1,5}-\d{8}\s/.test(trimmed)) continue;
if (/http:\/\/fasb\.org\/us-gaap/.test(trimmed) && trimmed.length > 100) continue; if (/http:\/\/fasb\.org\/us-gaap/.test(trimmed) && trimmed.length > 100) continue;
if (/^(FY|CY)\d{4,}/.test(trimmed) && /http:/.test(trimmed)) continue; if (/^(FY|CY)\d{4,}/.test(trimmed) && /http:/.test(trimmed)) continue;
// XBRL exhibit listing lines (101.CAL, 101.DEF, cover page XBRL, etc.)
if (/xbrl/i.test(trimmed) && !/cyber|secur|risk|board|manage|disclos/i.test(trimmed)) continue;
// Lines that are majority XBRL namespace tokens
if (trimmed.length > 20) { if (trimmed.length > 20) {
const tokens = trimmed.split(/\s+/); const tokens = trimmed.split(/\s+/);
const xbrlCount = tokens.filter(t => const xbrlCount = tokens.filter(t =>
@ -57,13 +62,16 @@ function cleanForDapt(raw: string): string {
if (tokens.length > 3 && xbrlCount / tokens.length > 0.5) continue; if (tokens.length > 3 && xbrlCount / tokens.length > 0.5) continue;
} }
// URLs — strip inline URLs (company sites, SEC, investor relations)
if (/^https?:\/\/\S+$/.test(trimmed)) continue; // standalone URL lines
// SEC boilerplate / filenames // SEC boilerplate / filenames
if (/^(10-K|10-Q|8-K)\s*$/i.test(trimmed)) continue; if (/^(10-K|10-Q|8-K)\s*$/i.test(trimmed)) continue;
if (/generated by sec publisher/i.test(trimmed)) continue; if (/generated by sec publisher/i.test(trimmed)) continue;
if (/^\S+\.(htm|html)\s*$/i.test(trimmed)) continue; if (/^\S+\.(htm|html)\s*$/i.test(trimmed)) continue;
if (/^\S+\.(htm|html)\s+-\s+Generated/i.test(trimmed)) continue; if (/^\S+\.(htm|html)\s+-\s+Generated/i.test(trimmed)) continue;
// Repeated headers // Repeated headers (running headers/footers)
if (trimmed.length > 5 && trimmed.length < 80) { if (trimmed.length > 5 && trimmed.length < 80) {
if ((shortLineCounts.get(trimmed) ?? 0) >= 5) continue; if ((shortLineCounts.get(trimmed) ?? 0) >= 5) continue;
} }
@ -72,7 +80,8 @@ function cleanForDapt(raw: string): string {
if (/^\(?\s*back\s+to\s+(index|top|toc)\s*\)?$/i.test(trimmed)) continue; if (/^\(?\s*back\s+to\s+(index|top|toc)\s*\)?$/i.test(trimmed)) continue;
if (/^index$/i.test(trimmed)) continue; if (/^index$/i.test(trimmed)) continue;
cleaned.push(line); // Strip inline URLs from prose (replace with empty string)
cleaned.push(line.replace(/https?:\/\/\S+/g, ""));
} }
return cleaned.join("\n").replace(/\n{3,}/g, "\n\n").trim(); return cleaned.join("\n").replace(/\n{3,}/g, "\n\n").trim();

View File

@ -0,0 +1,210 @@
/**
* Diff original vs re-run annotations for orphan-word paragraphs.
*
* Compares stage1.jsonl (original) against stage1-orphan-rerun.jsonl (patched text)
* to measure label changes, bias correction, and conflict resolution.
*
* Usage: bun ts/scripts/diff-orphan-annotations.ts
*/
import { readFileSync } from "node:fs";
const DATA_DIR = new URL("../../data", import.meta.url).pathname;
const ORIG_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
const RERUN_PATH = `${DATA_DIR}/annotations/stage1-orphan-rerun.jsonl`;
const PATCHES_PATH = `${DATA_DIR}/paragraphs/patches/orphan-word-patches.jsonl`;
interface Annotation {
paragraphId: string;
label: {
content_category: string;
specificity_level: number;
category_confidence: string;
specificity_confidence: string;
};
provenance: {
modelId: string;
};
}
function loadAnnotations(path: string): Map<string, Annotation[]> {
const map = new Map<string, Annotation[]>();
for (const line of readFileSync(path, "utf-8").split("\n")) {
if (!line.trim()) continue;
const ann = JSON.parse(line) as Annotation;
const key = ann.paragraphId;
if (!map.has(key)) map.set(key, []);
map.get(key)!.push(ann);
}
return map;
}
function majorityVote(annotations: Annotation[], field: "content_category" | "specificity_level"): { value: string | number; unanimous: boolean; count: number } {
const counts = new Map<string | number, number>();
for (const ann of annotations) {
const v = ann.label[field];
counts.set(v, (counts.get(v) ?? 0) + 1);
}
let best: string | number = "";
let bestCount = 0;
for (const [v, c] of counts) {
if (c > bestCount) { best = v; bestCount = c; }
}
return { value: best, unanimous: bestCount === annotations.length, count: bestCount };
}
// ── Main ────────────────────────────────────────────────────────────────
const patchIds = new Set<string>();
for (const line of readFileSync(PATCHES_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
patchIds.add((JSON.parse(line) as { id: string }).id);
}
const origAll = loadAnnotations(ORIG_PATH);
const rerunAll = loadAnnotations(RERUN_PATH);
// Filter original annotations to only orphan-word paragraphs
const origFiltered = new Map<string, Annotation[]>();
for (const [pid, anns] of origAll) {
if (patchIds.has(pid)) origFiltered.set(pid, anns);
}
console.error(`Orphan-word paragraphs: ${patchIds.size}`);
console.error(`Original annotations found: ${origFiltered.size} paragraphs`);
console.error(`Re-run annotations found: ${rerunAll.size} paragraphs`);
// Compare paragraphs that have BOTH original and re-run annotations
const comparable = [...rerunAll.keys()].filter((pid) => origFiltered.has(pid));
console.error(`Comparable paragraphs: ${comparable.length}\n`);
// Track changes
let catChanged = 0;
let specChanged = 0;
let eitherChanged = 0;
// Per-model changes
const perModelCatChanges = new Map<string, number>();
const perModelSpecChanges = new Map<string, number>();
// Category transition matrix
const catTransitions = new Map<string, Map<string, number>>();
// Consensus changes
let origConflicts = 0;
let rerunConflicts = 0;
let conflictsResolved = 0;
let consensusBroken = 0;
// Category distribution
const origCatDist = new Map<string, number>();
const rerunCatDist = new Map<string, number>();
// Specificity distribution
const origSpecDist = new Map<number, number>();
const rerunSpecDist = new Map<number, number>();
for (const pid of comparable) {
const origAnns = origFiltered.get(pid)!;
const rerunAnns = rerunAll.get(pid)!;
// Per-model comparison
for (const rerunAnn of rerunAnns) {
const modelId = rerunAnn.provenance.modelId;
const origAnn = origAnns.find((a) => a.provenance.modelId === modelId);
if (!origAnn) continue;
if (origAnn.label.content_category !== rerunAnn.label.content_category) {
perModelCatChanges.set(modelId, (perModelCatChanges.get(modelId) ?? 0) + 1);
// Track transition
const from = origAnn.label.content_category;
const to = rerunAnn.label.content_category;
if (!catTransitions.has(from)) catTransitions.set(from, new Map());
catTransitions.get(from)!.set(to, (catTransitions.get(from)!.get(to) ?? 0) + 1);
}
if (origAnn.label.specificity_level !== rerunAnn.label.specificity_level) {
perModelSpecChanges.set(modelId, (perModelSpecChanges.get(modelId) ?? 0) + 1);
}
}
// Consensus comparison (majority vote)
const origCatVote = majorityVote(origAnns, "content_category");
const rerunCatVote = majorityVote(rerunAnns, "content_category");
const origSpecVote = majorityVote(origAnns, "specificity_level");
const rerunSpecVote = majorityVote(rerunAnns, "specificity_level");
origCatDist.set(origCatVote.value as string, (origCatDist.get(origCatVote.value as string) ?? 0) + 1);
rerunCatDist.set(rerunCatVote.value as string, (rerunCatDist.get(rerunCatVote.value as string) ?? 0) + 1);
origSpecDist.set(origSpecVote.value as number, (origSpecDist.get(origSpecVote.value as number) ?? 0) + 1);
rerunSpecDist.set(rerunSpecVote.value as number, (rerunSpecDist.get(rerunSpecVote.value as number) ?? 0) + 1);
if (origCatVote.value !== rerunCatVote.value) catChanged++;
if (origSpecVote.value !== rerunSpecVote.value) specChanged++;
if (origCatVote.value !== rerunCatVote.value || origSpecVote.value !== rerunSpecVote.value) eitherChanged++;
// Conflict tracking (no majority = conflict)
const origHasConflict = origCatVote.count < 2 || origSpecVote.count < 2;
const rerunHasConflict = rerunCatVote.count < 2 || rerunSpecVote.count < 2;
if (origHasConflict) origConflicts++;
if (rerunHasConflict) rerunConflicts++;
if (origHasConflict && !rerunHasConflict) conflictsResolved++;
if (!origHasConflict && rerunHasConflict) consensusBroken++;
}
// ── Report ──────────────────────────────────────────────────────────────
console.log("═══ ORPHAN WORD RE-ANNOTATION DIFF REPORT ═══\n");
console.log(`Paragraphs compared: ${comparable.length}`);
console.log(` Category consensus changed: ${catChanged} (${((catChanged / comparable.length) * 100).toFixed(1)}%)`);
console.log(` Specificity consensus changed: ${specChanged} (${((specChanged / comparable.length) * 100).toFixed(1)}%)`);
console.log(` Either dimension changed: ${eitherChanged} (${((eitherChanged / comparable.length) * 100).toFixed(1)}%)`);
console.log(`\n─── Per-Model Category Changes ───`);
for (const [model, count] of [...perModelCatChanges.entries()].sort((a, b) => b[1] - a[1])) {
const short = model.split("/")[1] ?? model;
console.log(` ${short}: ${count} (${((count / comparable.length) * 100).toFixed(1)}%)`);
}
console.log(`\n─── Per-Model Specificity Changes ───`);
for (const [model, count] of [...perModelSpecChanges.entries()].sort((a, b) => b[1] - a[1])) {
const short = model.split("/")[1] ?? model;
console.log(` ${short}: ${count} (${((count / comparable.length) * 100).toFixed(1)}%)`);
}
console.log(`\n─── Conflict Resolution ───`);
console.log(` Original conflicts: ${origConflicts}`);
console.log(` Re-run conflicts: ${rerunConflicts}`);
console.log(` Conflicts resolved (orig conflict → rerun consensus): ${conflictsResolved}`);
console.log(` Consensus broken (orig consensus → rerun conflict): ${consensusBroken}`);
console.log(` Net conflict change: ${conflictsResolved - consensusBroken > 0 ? "-" : "+"}${Math.abs(conflictsResolved - consensusBroken)}`);
console.log(`\n─── Category Distribution (Consensus) ───`);
console.log(` ${"Category".padEnd(30)} ${"Original".padStart(8)} ${"Re-run".padStart(8)} ${"Delta".padStart(8)}`);
const allCats = new Set([...origCatDist.keys(), ...rerunCatDist.keys()]);
for (const cat of [...allCats].sort()) {
const orig = origCatDist.get(cat) ?? 0;
const rerun = rerunCatDist.get(cat) ?? 0;
const delta = rerun - orig;
const sign = delta > 0 ? "+" : "";
console.log(` ${cat.padEnd(30)} ${String(orig).padStart(8)} ${String(rerun).padStart(8)} ${(sign + delta).padStart(8)}`);
}
console.log(`\n─── Specificity Distribution (Consensus) ───`);
console.log(` ${"Level".padEnd(10)} ${"Original".padStart(8)} ${"Re-run".padStart(8)} ${"Delta".padStart(8)}`);
for (const level of [1, 2, 3, 4]) {
const orig = origSpecDist.get(level) ?? 0;
const rerun = rerunSpecDist.get(level) ?? 0;
const delta = rerun - orig;
const sign = delta > 0 ? "+" : "";
console.log(` ${String(level).padEnd(10)} ${String(orig).padStart(8)} ${String(rerun).padStart(8)} ${(sign + delta).padStart(8)}`);
}
console.log(`\n─── Top Category Transitions ───`);
const transitions: [string, string, number][] = [];
for (const [from, tos] of catTransitions) {
for (const [to, count] of tos) {
transitions.push([from, to, count]);
}
}
transitions.sort((a, b) => b[2] - a[2]);
for (const [from, to, count] of transitions.slice(0, 15)) {
console.log(` ${from}${to}: ${count}`);
}

View File

@ -0,0 +1,190 @@
/**
* Extract styled headings (bold, underline, h-tags) from SEC filing HTML.
* Produces a per-filing heading cache for paragraph heading detection.
*
* Usage: bun run ts/scripts/extract-html-headings.ts
*
* Input: data/raw/html/*.html + data/paragraphs/quality/ambiguous-filings.txt
* Output: data/paragraphs/quality/filing-headings.jsonl
* Each line: {"accession": "...", "headings": ["heading1", "heading2", ...]}
*/
import { readFileSync, writeFileSync, mkdirSync, existsSync } from "node:fs";
import { cpus } from "node:os";
const HTML_DIR = "data/raw/html";
const FILING_LIST = "data/paragraphs/quality/ambiguous-filings.txt";
const OUTPUT = "data/paragraphs/quality/filing-headings.jsonl";
/**
* Extract styled text (bold, underline, h-tags) from HTML within Item 1C.
* Returns an array of heading strings found.
*/
function extractStyledHeadings(html: string): string[] {
// Find Item 1C region (rough — look for "Item 1C" and take the next ~200KB)
const item1cMatch = html.match(/item\s*1c/i);
if (!item1cMatch || item1cMatch.index === undefined) return [];
const startIdx = item1cMatch.index;
// Look for next Item boundary or end of filing
const nextItemMatch = html.slice(startIdx + 20).match(/item\s+(?:2|1[a-bd-z]|[3-9])/i);
const endIdx = nextItemMatch?.index
? startIdx + 20 + nextItemMatch.index
: Math.min(startIdx + 200000, html.length);
const section = html.slice(startIdx, endIdx);
const headings: string[] = [];
// Pattern 1: <b> or <strong> tags
const boldRegex = /<(?:b|strong)[^>]*>([\s\S]*?)<\/(?:b|strong)>/gi;
for (const m of section.matchAll(boldRegex)) {
const text = stripTags(m[1]!).trim();
if (isHeadingCandidate(text)) headings.push(text);
}
// Pattern 2: font-weight: bold or font-weight: 700 in inline styles
const boldStyleRegex = /<[^>]+font-weight\s*:\s*(?:bold|[6-9]00)[^>]*>([\s\S]*?)<\/[^>]+>/gi;
for (const m of section.matchAll(boldStyleRegex)) {
const text = stripTags(m[1]!).trim();
if (isHeadingCandidate(text)) headings.push(text);
}
// Pattern 3: text-decoration: underline
const underlineRegex = /<[^>]+text-decoration\s*:\s*underline[^>]*>([\s\S]*?)<\/[^>]+>/gi;
for (const m of section.matchAll(underlineRegex)) {
const text = stripTags(m[1]!).trim();
if (isHeadingCandidate(text)) headings.push(text);
}
// Pattern 4: h1-h6 tags
const hRegex = /<h[1-6][^>]*>([\s\S]*?)<\/h[1-6]>/gi;
for (const m of section.matchAll(hRegex)) {
const text = stripTags(m[1]!).trim();
if (isHeadingCandidate(text)) headings.push(text);
}
// Deduplicate and normalize
const seen = new Set<string>();
const unique: string[] = [];
for (const h of headings) {
const normalized = h.replace(/\s+/g, " ").trim();
if (normalized.length < 3) continue;
const key = normalized.toLowerCase();
if (!seen.has(key)) {
seen.add(key);
unique.push(normalized);
}
}
return unique;
}
/** Strip HTML tags from a string. */
function stripTags(html: string): string {
return html
.replace(/<[^>]+>/g, " ")
.replace(/&nbsp;|&#160;/gi, " ")
.replace(/&amp;/g, "&")
.replace(/&lt;/g, "<")
.replace(/&gt;/g, ">")
.replace(/&quot;/g, '"')
.replace(/&#39;|&apos;/g, "'")
.replace(/&mdash;|&#8212;/g, "—")
.replace(/&ndash;|&#8211;/g, "")
.replace(/\s+/g, " ")
.trim();
}
/** Check if extracted styled text looks like a heading (not body text). */
function isHeadingCandidate(text: string): boolean {
if (text.length < 3 || text.length > 150) return false;
const words = text.split(/\s+/);
if (words.length > 15) return false;
// Must contain at least one heading-like keyword
if (!/(?:risk|management|strategy|cybersecurity|cyber|governance|oversight|board|directors?|incident|response|recovery|planning|detection|program|process|third[- ]party|security|threats?|assessment|compliance|safeguards?|awareness|training|education|monitoring|integration|framework|practices|personnel|role|controls|policies|procedures|reporting|identification|disclosure|material|enterprise|technology|overview|impact|effects?|vulnerabilit)/i.test(text)) {
return false;
}
return true;
}
// ─── Worker mode ───
const args = process.argv.slice(2);
if (args[0] === "--worker") {
const startIdx = parseInt(args[1]!);
const endIdx = parseInt(args[2]!);
const outFile = args[3]!;
const filings = readFileSync(FILING_LIST, "utf-8").trim().split("\n").slice(startIdx, endIdx);
const results: string[] = [];
for (const acc of filings) {
const htmlPath = `${HTML_DIR}/${acc}.html`;
if (!existsSync(htmlPath)) continue;
const html = readFileSync(htmlPath, "utf-8");
const headings = extractStyledHeadings(html);
results.push(JSON.stringify({ accession: acc, headings }));
}
writeFileSync(outFile, results.join("\n") + (results.length > 0 ? "\n" : ""));
process.exit(0);
}
// ─── Main mode ───
const start = Date.now();
const filings = readFileSync(FILING_LIST, "utf-8").trim().split("\n");
const nproc = cpus().length;
const chunkSize = Math.ceil(filings.length / nproc);
process.stderr.write(` ${filings.length} filings, ${nproc} workers\n`);
const tmpFiles: string[] = [];
const workers: ReturnType<typeof Bun.spawn>[] = [];
for (let i = 0; i < nproc; i++) {
const s = i * chunkSize;
const e = Math.min(s + chunkSize, filings.length);
if (s >= filings.length) break;
const tmpFile = `${OUTPUT}.tmp-${i}`;
tmpFiles.push(tmpFile);
workers.push(
Bun.spawn(
["bun", "run", import.meta.filename, "--worker", String(s), String(e), tmpFile],
{ stderr: "inherit" },
)
);
}
for (const w of workers) await w.exited;
// Merge
const allResults: string[] = [];
for (const tmp of tmpFiles) {
if (existsSync(tmp)) {
const content = readFileSync(tmp, "utf-8").trim();
if (content) allResults.push(content);
try { require("node:fs").unlinkSync(tmp); } catch {}
}
}
writeFileSync(OUTPUT, allResults.join("\n") + "\n");
const elapsed = ((Date.now() - start) / 1000).toFixed(1);
// Count stats
let totalHeadings = 0;
let filingsWithHeadings = 0;
for (const line of allResults.join("\n").split("\n")) {
if (!line.trim()) continue;
const d = JSON.parse(line);
if (d.headings.length > 0) {
filingsWithHeadings++;
totalHeadings += d.headings.length;
}
}
process.stderr.write(
`\n Done in ${elapsed}s\n` +
` ${filings.length} filings processed\n` +
` ${filingsWithHeadings} filings with styled headings\n` +
` ${totalHeadings} total heading instances\n` +
` Output: ${OUTPUT}\n`,
);

View File

@ -0,0 +1,73 @@
/**
* Merge original Stage 1 annotations with orphan-word re-run annotations.
*
* For paragraphs that were re-annotated, replaces original annotations with
* re-run annotations. For all other paragraphs, keeps original annotations.
* Original stage1.jsonl is NOT modified.
*
* Usage: bun ts/scripts/merge-annotations.ts
*
* Output: data/annotations/stage1.patched.jsonl
*/
import { readFileSync, writeFileSync } from "node:fs";
const DATA_DIR = new URL("../../data", import.meta.url).pathname;
const ORIG_PATH = `${DATA_DIR}/annotations/stage1.jsonl`;
const RERUN_PATH = `${DATA_DIR}/annotations/stage1-orphan-rerun.jsonl`;
const OUTPUT_PATH = `${DATA_DIR}/annotations/stage1.patched.jsonl`;
interface Annotation {
paragraphId: string;
provenance: { modelId: string };
[key: string]: unknown;
}
// Load re-run annotations, keyed by paragraphId|modelId
const rerunMap = new Map<string, string>(); // key -> raw JSON line
const rerunPids = new Set<string>();
for (const line of readFileSync(RERUN_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const ann = JSON.parse(line) as Annotation;
const key = `${ann.paragraphId}|${ann.provenance.modelId}`;
rerunMap.set(key, line);
rerunPids.add(ann.paragraphId);
}
console.error(`Re-run annotations: ${rerunMap.size} (${rerunPids.size} paragraphs)`);
// Stream through original, replacing where re-run exists
let kept = 0;
let replaced = 0;
const output: string[] = [];
for (const line of readFileSync(ORIG_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const ann = JSON.parse(line) as Annotation;
const key = `${ann.paragraphId}|${ann.provenance.modelId}`;
if (rerunMap.has(key)) {
output.push(rerunMap.get(key)!);
rerunMap.delete(key); // mark as used
replaced++;
} else {
output.push(line);
kept++;
}
}
// Any re-run annotations not matched to originals (shouldn't happen, but be safe)
let added = 0;
for (const [, line] of rerunMap) {
output.push(line);
added++;
}
writeFileSync(OUTPUT_PATH, output.join("\n") + "\n");
console.error(
`\nMerge complete:` +
`\n ${kept} original annotations kept` +
`\n ${replaced} annotations replaced with re-run` +
`\n ${added} new annotations added` +
`\n ${output.length} total annotations` +
`\n Output: ${OUTPUT_PATH}`,
);

View File

@ -0,0 +1,174 @@
/**
* Expanded orphan word patch: recover dropped leading words for all
* paragraphs that start with lowercase (non-list patterns).
*
* For each candidate paragraph:
* 1. Read the source HTML for the filing
* 2. Strip HTML to plain text
* 3. Find the paragraph text in the stripped output
* 4. Look backwards to find the orphaned word on its own line
* 5. Validate: orphaned word must be short (1-3 words), start with uppercase
* 6. Output patch record
*
* Usage: bun run ts/scripts/patch-orphan-words.ts
* Input: data/paragraphs/paragraphs-clean.jsonl
* Output: data/paragraphs/patches/orphan-word-patches.jsonl
*/
import { readFileSync, writeFileSync, mkdirSync, existsSync } from "node:fs";
import { stripHtml } from "../src/extract/html-cleaner.ts";
const PARAGRAPHS_PATH = "data/paragraphs/paragraphs-clean.jsonl";
const HTML_DIR = "data/raw/html";
const OUTPUT_PATH = "data/paragraphs/patches/orphan-word-patches.jsonl";
// List patterns to exclude (legitimate lowercase starts)
const LIST_PATTERNS = /^(and |or |including |such as |as well as |along with |that |which |where |whether |as described |for example|for more |pursuant to |in addition )/i;
interface Paragraph {
id: string;
text: string;
textHash: string;
wordCount: number;
paragraphIndex: number;
filing: {
accessionNumber: string;
companyName: string;
[key: string]: unknown;
};
}
interface PatchRecord {
id: string;
accession: string;
paragraphIndex: number;
orphanWord: string;
originalStart: string;
patchedStart: string;
method: string;
}
// Cache stripped HTML per filing
const strippedCache = new Map<string, string>();
function getStrippedHtml(accession: string): string | null {
if (strippedCache.has(accession)) return strippedCache.get(accession)!;
const htmlPath = `${HTML_DIR}/${accession}.html`;
if (!existsSync(htmlPath)) return null;
const html = readFileSync(htmlPath, "utf-8");
const stripped = stripHtml(html);
strippedCache.set(accession, stripped);
return stripped;
}
function findOrphanWord(stripped: string, paragraphText: string): string | null {
// Use first 80 chars to search — avoids paragraph-end differences
const searchText = paragraphText.substring(0, Math.min(80, paragraphText.length));
const idx = stripped.indexOf(searchText);
if (idx === -1) return null;
// Look backwards to find the orphaned word
const before = stripped.substring(Math.max(0, idx - 200), idx);
const lines = before.split("\n");
const candidates = lines.filter((l) => l.trim().length > 0);
if (candidates.length === 0) return null;
const lastLine = candidates[candidates.length - 1]!.trim();
// Validate: short (1-3 words), starts with uppercase
const words = lastLine.split(/\s+/);
if (words.length > 3 || words.length === 0) return null;
if (!/^[A-Z]/.test(words[0]!)) return null;
// Reject all-caps headings (>15 chars)
if (lastLine === lastLine.toUpperCase() && lastLine.length > 15) return null;
// Reject section/item references and page artifacts
if (/^(item|part|section)\s/i.test(lastLine)) return null;
if (/^\d+[\.\)]/.test(lastLine)) return null;
if (/^table of contents$/i.test(lastLine)) return null;
return lastLine;
}
// ─── Main ───
const start = Date.now();
mkdirSync("data/paragraphs/patches", { recursive: true });
process.stderr.write(" Loading paragraphs...\n");
const paragraphs: Paragraph[] = [];
for (const line of readFileSync(PARAGRAPHS_PATH, "utf-8").split("\n")) {
if (line.trim()) paragraphs.push(JSON.parse(line));
}
process.stderr.write(` ${paragraphs.length} paragraphs loaded\n`);
// Find candidates
const candidateParas = paragraphs.filter((p) => {
if (!p.text || p.text.length === 0) return false;
if (!/^[a-z]/.test(p.text)) return false;
if (LIST_PATTERNS.test(p.text)) return false;
return true;
});
process.stderr.write(` ${candidateParas.length} orphan word candidates\n\n`);
// Process
const patches: PatchRecord[] = [];
let notFound = 0;
let noOrphan = 0;
let lastAcc = "";
for (let i = 0; i < candidateParas.length; i++) {
const p = candidateParas[i]!;
const acc = p.filing.accessionNumber;
if (acc !== lastAcc) {
if (strippedCache.size > 20) strippedCache.clear();
lastAcc = acc;
}
const stripped = getStrippedHtml(acc);
if (!stripped) { notFound++; continue; }
const orphan = findOrphanWord(stripped, p.text);
if (!orphan) { noOrphan++; continue; }
patches.push({
id: p.id,
accession: acc,
paragraphIndex: p.paragraphIndex,
orphanWord: orphan,
originalStart: p.text.substring(0, 60),
patchedStart: orphan + " " + p.text.substring(0, 60),
method: "html-lookback",
});
if ((i + 1) % 200 === 0) {
process.stderr.write(
`\x1b[2K\r ${i + 1}/${candidateParas.length} | ${patches.length} patched | ${noOrphan} no orphan | ${notFound} no HTML`,
);
}
}
writeFileSync(OUTPUT_PATH, patches.map((p) => JSON.stringify(p)).join("\n") + "\n");
const elapsed = ((Date.now() - start) / 1000).toFixed(1);
process.stderr.write(
`\n\n Done in ${elapsed}s\n` +
` ${candidateParas.length} candidates → ${patches.length} patches found\n` +
` ${noOrphan} candidates: no orphan word found in HTML\n` +
` ${notFound} candidates: HTML file not found\n` +
` Output: ${OUTPUT_PATH}\n`,
);
// Word frequency summary
const wordCounts = new Map<string, number>();
for (const p of patches) {
wordCounts.set(p.orphanWord, (wordCounts.get(p.orphanWord) ?? 0) + 1);
}
const sorted = [...wordCounts.entries()].sort((a, b) => b[1] - a[1]);
process.stderr.write("\n Top orphan words:\n");
for (const [word, count] of sorted.slice(0, 15)) {
process.stderr.write(` ${word}: ${count}\n`);
}

View File

@ -0,0 +1,175 @@
/**
* Re-run Stage 1 annotations on orphan-word-patched paragraphs.
*
* Loads paragraphs that had orphan words restored, runs all 3 Stage 1 models
* on the PATCHED text, and saves to a separate annotation file.
* Original annotations in stage1.jsonl are NOT modified.
*
* Usage:
* bun ts/scripts/rerun-orphan-stage1.ts [--concurrency 60]
*
* Input:
* data/paragraphs/training.patched.jsonl patched paragraph text
* data/paragraphs/patches/orphan-word-patches.jsonl patch records (for ID filtering)
*
* Output:
* data/annotations/stage1-orphan-rerun.jsonl new annotations (separate file)
*/
import { readJsonl, readJsonlRaw, appendJsonl } from "../src/lib/jsonl.ts";
import { Paragraph } from "@sec-cybert/schemas/paragraph.ts";
import { STAGE1_MODELS } from "../src/lib/openrouter.ts";
import { annotateParagraph, type AnnotateOpts } from "../src/label/annotate.ts";
import { PROMPT_VERSION } from "../src/label/prompts.ts";
import { v4 as uuidv4 } from "uuid";
import { mkdir } from "node:fs/promises";
import { existsSync, readFileSync } from "node:fs";
import pLimit from "p-limit";
// ── Args ────────────────────────────────────────────────────────────────
const args = process.argv.slice(2);
function flag(name: string): string | undefined {
const idx = args.indexOf(`--${name}`);
return idx === -1 ? undefined : args[idx + 1];
}
const CONCURRENCY = parseInt(flag("concurrency") ?? "60", 10);
const DATA_DIR = new URL("../../data", import.meta.url).pathname;
const TRAINING_PATH = `${DATA_DIR}/paragraphs/training.patched.jsonl`;
const PATCHES_PATH = `${DATA_DIR}/paragraphs/patches/orphan-word-patches.jsonl`;
const OUTPUT_DIR = `${DATA_DIR}/annotations`;
const OUTPUT_PATH = `${OUTPUT_DIR}/stage1-orphan-rerun.jsonl`;
// ── Main ────────────────────────────────────────────────────────────────
async function main() {
if (!existsSync(OUTPUT_DIR)) await mkdir(OUTPUT_DIR, { recursive: true });
// Load orphan-word patch IDs
console.error("Loading orphan-word patch IDs...");
const patchIds = new Set<string>();
for (const line of readFileSync(PATCHES_PATH, "utf-8").split("\n")) {
if (!line.trim()) continue;
const rec = JSON.parse(line) as { id: string };
patchIds.add(rec.id);
}
console.error(` ${patchIds.size} patched paragraph IDs`);
// Load patched training data, filter to orphan-word paragraphs only
console.error(`Loading patched paragraphs from ${TRAINING_PATH}...`);
const { records: allParagraphs, skipped } = await readJsonl(TRAINING_PATH, Paragraph);
if (skipped > 0) console.error(` ⚠ Skipped ${skipped} invalid lines`);
const paragraphs = allParagraphs.filter((p) => patchIds.has(p.id));
console.error(` ${paragraphs.length} orphan-word paragraphs in training set`);
console.error(` Models: ${STAGE1_MODELS.join(", ")}`);
console.error(` Prompt: ${PROMPT_VERSION}`);
console.error(` Concurrency: ${CONCURRENCY}`);
const totalJobs = paragraphs.length * STAGE1_MODELS.length;
console.error(` Total annotations needed: ${totalJobs.toLocaleString()}`);
// Load existing results for resume
const doneKeys = new Set<string>();
let resumedCost = 0;
if (existsSync(OUTPUT_PATH)) {
const { records: existing } = await readJsonlRaw(OUTPUT_PATH);
for (const rec of existing) {
const r = rec as { paragraphId?: string; provenance?: { modelId?: string; costUsd?: number } };
if (r.paragraphId && r.provenance?.modelId) {
doneKeys.add(`${r.paragraphId}|${r.provenance.modelId}`);
resumedCost += r.provenance.costUsd ?? 0;
}
}
if (doneKeys.size > 0) {
console.error(` Resuming: ${doneKeys.size} already done ($${resumedCost.toFixed(2)}), ${totalJobs - doneKeys.size} remaining`);
}
}
if (doneKeys.size >= totalJobs) {
console.error(" All annotations already complete!");
return;
}
// Build job list
type Job = { paragraph: Paragraph; modelId: string };
const jobs: Job[] = [];
for (const paragraph of paragraphs) {
for (const modelId of STAGE1_MODELS) {
if (!doneKeys.has(`${paragraph.id}|${modelId}`)) {
jobs.push({ paragraph, modelId });
}
}
}
console.error(` Jobs to run: ${jobs.length.toLocaleString()}\n`);
// Run with concurrency limiter
const runId = uuidv4();
const limit = pLimit(CONCURRENCY);
let completed = doneKeys.size;
let failed = 0;
let sessionCost = 0;
const startTime = Date.now();
// Progress logging
const logInterval = setInterval(() => {
const elapsed = (Date.now() - startTime) / 1000;
const done = completed - doneKeys.size;
const rate = done / elapsed;
const remaining = totalJobs - completed;
const eta = rate > 0 ? remaining / rate : Infinity;
const etaMin = Math.floor(eta / 60);
const etaSec = Math.round(eta % 60);
process.stderr.write(
`\x1b[2K\r ${completed.toLocaleString()}/${totalJobs.toLocaleString()} (${((completed / totalJobs) * 100).toFixed(1)}%)` +
` $${(resumedCost + sessionCost).toFixed(4)}` +
` ${rate.toFixed(1)}/s` +
` ETA ${etaMin}m${etaSec.toString().padStart(2, "0")}s` +
` ${failed} failed`,
);
}, 2000);
const tasks = jobs.map((job) =>
limit(async () => {
const opts: AnnotateOpts = {
modelId: job.modelId,
stage: "stage1",
runId,
promptVersion: PROMPT_VERSION,
reasoningEffort: "low",
};
try {
const ann = await annotateParagraph(job.paragraph, opts);
await appendJsonl(OUTPUT_PATH, ann);
sessionCost += ann.provenance.costUsd;
completed++;
} catch (error) {
failed++;
const msg = error instanceof Error ? error.message : String(error);
console.error(`\n ✖ ${job.modelId} × ${job.paragraph.id}: ${msg}`);
}
}),
);
await Promise.all(tasks);
clearInterval(logInterval);
const elapsed = ((Date.now() - startTime) / 1000).toFixed(0);
console.error(
`\n\n ═══ ORPHAN WORD RE-ANNOTATION COMPLETE ═══` +
`\n Annotations: ${completed.toLocaleString()}/${totalJobs.toLocaleString()}` +
`\n Failed: ${failed}` +
`\n Session cost: $${sessionCost.toFixed(4)}` +
`\n Total cost: $${(resumedCost + sessionCost).toFixed(4)}` +
`\n Wall time: ${elapsed}s` +
`\n Output: ${OUTPUT_PATH}`,
);
if (failed > 0) {
console.error(`\n ⚠ ${failed} failures — re-run this script to retry them.`);
}
}
main().catch((err) => {
console.error(err);
process.exit(1);
});

View File

@ -0,0 +1,393 @@
/**
* Tag every SEC filing HTML with its generator tool.
*
* Usage: bun run ts/scripts/tag-generators.ts
*
* Reads first 20KB of each HTML file in data/raw/html/, detects the
* generator using heuristics ported from scripts/detect_generators.py,
* and writes a mapping JSONL to data/paragraphs/quality/generator-tags.jsonl.
*
* Uses Bun.spawn worker parallelism (same pattern as dapt-corpus-prep.ts).
*/
import {
readdirSync,
readFileSync,
writeFileSync,
mkdirSync,
unlinkSync,
createReadStream,
} from "node:fs";
import { createInterface } from "node:readline";
import { cpus } from "node:os";
import { basename } from "node:path";
const HTML_DIR = "data/raw/html";
const OUTPUT_DIR = "data/paragraphs/quality";
const OUTPUT_FILE = `${OUTPUT_DIR}/generator-tags.jsonl`;
const READ_BYTES = 20_000;
// Known SEC filing agent CIKs (accession number prefixes)
const FILING_AGENT_CIKS: Record<string, string> = {
"0000950170": "Donnelley Financial Solutions",
"0001193125": "Donnelley Financial Solutions",
"0001558370": "Toppan Merrill",
"0001654954": "Toppan Merrill",
"0001104659": "Toppan Merrill",
};
// ─── Generator normalization ───
function normalizeGenerator(raw: string): string {
const r = raw.trim().toLowerCase();
if (r.includes("workiva") || r.includes("wdesk")) return "Workiva";
if (r.includes("donnelley") || r.includes("dfin") || r.includes("rrdonnelley"))
return "Donnelley Financial Solutions";
if (r.includes("toppan") || (r.includes("merrill") && r.includes("bridge")))
return "Toppan Merrill";
if (r.includes("word") && r.includes("microsoft")) return "Microsoft Word";
if (r.includes("excel") && r.includes("microsoft")) return "Microsoft Excel";
if (r.includes("thunderdome")) return "ThunderDome";
if (r.includes("goxbrl")) return "GoXBRL";
if (r.includes("compsci")) return "CompSci Transform";
if (r.includes("certent")) return "Certent";
if (r.includes("iris carbon")) return "IRIS Carbon";
if (r.includes("broadridge") || r.includes("profile")) return "Broadridge PROfile";
if (r.includes("sec publisher")) return "SEC Publisher";
return raw.trim();
}
// ─── Generator detection (ported from detect_generators.py) ───
function detectGenerator(filepath: string): string {
const buf = Buffer.alloc(READ_BYTES);
const fd = Bun.file(filepath);
// Use sync read for worker perf
const raw = readFileSync(filepath);
const text = raw.subarray(0, READ_BYTES).toString("utf-8");
const textLower = text.toLowerCase();
// --- Explicit generator metadata ---
// 1. <meta name="generator" content="...">
let m: RegExpMatchArray | null;
m =
text.match(
/<meta\s+name\s*=\s*["']generator["']\s+content\s*=\s*["']([^"']+)["']/i,
) ??
text.match(
/<meta\s+content\s*=\s*["']([^"']+)["']\s+name\s*=\s*["']generator["']/i,
);
if (m) return normalizeGenerator(m[1]!);
// 2. <meta name="Creator" content="...">
m = text.match(
/<meta\s+name\s*=\s*["']Creator["']\s+content\s*=\s*["']([^"']+)["']/i,
);
if (m) return normalizeGenerator(m[1]!);
// 3. <meta name="Producer" content="...">
m = text.match(
/<meta\s+name\s*=\s*["']Producer["']\s+content\s*=\s*["']([^"']+)["']/i,
);
if (m) return normalizeGenerator(m[1]!);
// 4. ProgId meta tag
m = text.match(
/<meta\s+name\s*=\s*["']ProgId["']\s+content\s*=\s*["']([^"']+)["']/i,
);
if (m) {
const progid = m[1]!;
if (/word/i.test(progid)) return "Microsoft Word";
if (/excel/i.test(progid)) return "Microsoft Excel";
return normalizeGenerator(progid);
}
// --- HTML comment signatures ---
// Workiva / Wdesk
if (/<!--.*Created with the Workiva Platform.*-->/i.test(text)) return "Workiva";
if (/<!--.*Copyright\s+\d{4}\s+Workiva.*-->/i.test(text)) return "Workiva";
if (/<!--.*Document created using Wdesk.*-->/i.test(text)) return "Workiva";
// Toppan Merrill / Bridge
if (/<!--.*(?:Toppan\s*Merrill|iXBRL document created with.*Toppan).*-->/i.test(text))
return "Toppan Merrill";
if (/<!--.*Merrill\s*Bridge.*-->/i.test(text)) return "Toppan Merrill";
// Donnelley Financial Solutions / RR Donnelley
if (/<!--.*Donnelley Financial Solutions.*-->/i.test(text))
return "Donnelley Financial Solutions";
if (/<!--.*RR\s*Donnelley.*-->/i.test(text)) return "Donnelley Financial Solutions";
// Broadridge PROfile
if (/<!--.*Broadridge\s+PROfile.*-->/i.test(text)) return "Broadridge PROfile";
if (textLower.includes("broadridge")) return "Broadridge PROfile";
// SEC Publisher
const titleMatch = text.match(/<title[^>]*>([^<]+)<\/title>/i);
const titleText = titleMatch ? titleMatch[1]!.trim() : "";
if (textLower.includes("sec publisher") || titleText.toLowerCase().includes("sec publisher"))
return "SEC Publisher";
// IRIS Carbon
if (/<!--.*Powered by IRIS Carbon.*-->/i.test(text)) return "IRIS Carbon";
// Certent
if (/<!--.*Certent\s+Disclosure\s+Management.*-->/i.test(text)) return "Certent";
if (textLower.includes("certent")) return "Certent";
// CompSci Resources
if (/<!--.*CompSci Resources.*-->/i.test(text)) return "CompSci Transform";
// RDG Portal
if (/<!--.*RDG Portal.*-->/i.test(text)) return "RDG Portal";
// PDF to EDGAR
if (titleText.toLowerCase() === "pdf to edgar" || textLower.slice(0, 2000).includes("pdf to edgar"))
return "PDF to EDGAR";
// Generic generated/created by comments
m = text.match(/<!--\s*Generated\s+by\s+([^-]+?)-->/i);
if (m) {
const val = m[1]!.trim();
if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val);
}
m = text.match(/<!--\s*Created\s+(?:by|with)\s+([^-]+?)-->/i);
if (m) {
const val = m[1]!.trim();
if (!/^\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}/.test(val)) return normalizeGenerator(val);
}
// --- Keyword signatures ---
if (/\bwdesk\b/.test(textLower)) return "Workiva";
if (/\bworkiva\b/.test(textLower)) return "Workiva";
if (/\brrdonnelley\b/.test(textLower)) return "Donnelley Financial Solutions";
if (/\bedgar-online\b/.test(textLower)) return "Donnelley Financial Solutions";
if (/\btoppan\b/.test(textLower)) return "Toppan Merrill";
if (/\bmerrill\b/.test(textLower) && /\b(?:bridge|ixbrl|xbrl)\b/.test(textLower))
return "Toppan Merrill";
if (/\bbowne\b/.test(textLower)) return "Toppan Merrill";
if (/\bcompsci\b/.test(textLower)) return "CompSci Transform";
if (/\bthunderdome\b/.test(textLower)) return "ThunderDome";
if (/\bgoxbrl\b/.test(textLower)) return "GoXBRL";
// CSS class naming patterns
if (/class\s*=\s*["'][^"']*\bwk_\w+/.test(textLower)) return "Workiva";
// --- SGML document wrapper detection ---
const hasSgml = /<DOCUMENT>\s*\n?\s*<TYPE>/i.test(text);
if (hasSgml) {
const fnMatch = text.match(/<FILENAME>\s*([\w\-\.]+)/i);
if (fnMatch) {
const filename = fnMatch[1]!.toLowerCase();
if (/^d\d+/.test(filename)) return "Donnelley Financial Solutions";
if (/^tm\d+/.test(filename)) return "Toppan Merrill";
if (/^ea\d+/.test(filename)) return "EFiling/EDGAR Agent";
}
if (
textLower.includes("<!-- field: rule-page") ||
textLower.slice(0, 5000).includes("rule-page")
)
return "Broadridge PROfile";
if (textLower.includes("field: set; name: xdx")) return "EFiling XDX";
if (textLower.slice(0, 5000).includes("<!-- field:")) return "EFiling/EDGAR Agent";
if (/<Center><DIV STYLE="width:8\.5in"/.test(text))
return "Donnelley Financial Solutions";
// Check accession prefix
const bn = basename(filepath);
const accessionPrefix = bn.split("-")[0]!;
if (accessionPrefix in FILING_AGENT_CIKS)
return FILING_AGENT_CIKS[accessionPrefix]!;
// Legacy font-based
const fontCount = (textLower.match(/<font/g) ?? []).length;
if (fontCount > 5) return "SGML-wrapped (legacy/font-based)";
return "SGML-wrapped (unknown)";
}
// --- Inline XBRL detection ---
const hasIxNs =
textLower.includes("xmlns:ix=") || textLower.includes("<ix:header");
// Structural: Donnelley uppercase P STYLE + Center DIV 8.5in
if (
/<P STYLE="[^"]*font-family:Times New Roman"/.test(text) &&
/<Center><DIV STYLE="width:8\.5in"/.test(text)
)
return "Donnelley Financial Solutions";
// Title tag tool names
if (titleText) {
const tl = titleText.toLowerCase();
if (tl.includes("workiva") || tl.includes("wdesk")) return "Workiva";
}
if (hasIxNs) {
if (textLower.includes("field: set; name: xdx")) return "EFiling XDX";
if (textLower.includes("<!-- field: rule")) return "Broadridge PROfile";
if (textLower.slice(0, 5000).includes("<!-- field:")) return "EFiling/EDGAR Agent";
// Filing agent CIK-based
const bn = basename(filepath);
const accessionPrefix = bn.split("-")[0]!;
if (accessionPrefix in FILING_AGENT_CIKS)
return FILING_AGENT_CIKS[accessionPrefix]!;
// XML declaration encoding
if (textLower.slice(0, 200).includes('<?xml version="1.0" encoding="utf-8"'))
return "Inline XBRL (utf-8 toolchain)";
if (textLower.slice(0, 200).includes("<?xml version='1.0' encoding='ascii'?>")) {
if (/<div style="display:none"><ix:header>/i.test(textLower.slice(0, 3000)))
return "Inline XBRL (SEC/EDGAR standard)";
return "Inline XBRL (SEC/EDGAR standard)";
}
return "Inline XBRL (tool unresolved)";
}
// --- Structural fallbacks ---
const fontCount = (textLower.match(/<font/g) ?? []).length;
const tdCount = (textLower.match(/<td/g) ?? []).length;
const spanCount = (textLower.match(/<span/g) ?? []).length;
if (fontCount > 20) return "Legacy generator (font-based)";
if (tdCount > 50 && spanCount < 10) return "Table-based generator";
const dataAttrCount = (textLower.match(/\bdata-\w+/g) ?? []).length;
if (dataAttrCount > 10) return "Modern web tooling";
return "Unknown";
}
// ─── Worker mode ───
const args = process.argv.slice(2);
if (args[0] === "--worker") {
const startIdx = parseInt(args[1]!);
const endIdx = parseInt(args[2]!);
const outFile = args[3]!;
const htmlFiles = readdirSync(HTML_DIR)
.filter((f: string) => f.endsWith(".html"))
.sort()
.slice(startIdx, endIdx);
const records: string[] = [];
for (const file of htmlFiles) {
const accession = file.replace(".html", "");
const generator = detectGenerator(`${HTML_DIR}/${file}`);
records.push(JSON.stringify({ accession, generator }));
}
writeFileSync(outFile, records.join("\n") + (records.length > 0 ? "\n" : ""));
process.exit(0);
}
// ─── Main mode: orchestrate workers ───
const start = Date.now();
mkdirSync(OUTPUT_DIR, { recursive: true });
const htmlFiles = readdirSync(HTML_DIR)
.filter((f: string) => f.endsWith(".html"))
.sort();
const nproc = cpus().length;
const chunkSize = Math.ceil(htmlFiles.length / nproc);
process.stderr.write(
` Tagging generators for ${htmlFiles.length} HTML files with ${nproc} workers...\n\n`,
);
const tmpFiles: string[] = [];
const workers: ReturnType<typeof Bun.spawn>[] = [];
for (let i = 0; i < nproc; i++) {
const startIdx = i * chunkSize;
const endIdx = Math.min(startIdx + chunkSize, htmlFiles.length);
if (startIdx >= htmlFiles.length) break;
const tmpFile = `${OUTPUT_DIR}/.tmp-gen-${i}.jsonl`;
tmpFiles.push(tmpFile);
workers.push(
Bun.spawn(
[
"bun",
"run",
import.meta.filename,
"--worker",
String(startIdx),
String(endIdx),
tmpFile,
],
{ stderr: "inherit" },
),
);
}
for (const worker of workers) {
await worker.exited;
}
process.stderr.write(` Workers done, merging results...\n`);
// Merge and sort
type TagRecord = { accession: string; generator: string };
const allRecords: TagRecord[] = [];
for (const tmpFile of tmpFiles) {
const rl = createInterface({ input: createReadStream(tmpFile) });
for await (const line of rl) {
if (line.trim()) allRecords.push(JSON.parse(line));
}
}
allRecords.sort((a, b) => a.accession.localeCompare(b.accession));
// Write final output
writeFileSync(
OUTPUT_FILE,
allRecords.map((r) => JSON.stringify(r)).join("\n") + "\n",
);
// Cleanup
for (const tmpFile of tmpFiles) {
try {
unlinkSync(tmpFile);
} catch {}
}
// Print summary
const counts = new Map<string, number>();
for (const r of allRecords) {
counts.set(r.generator, (counts.get(r.generator) ?? 0) + 1);
}
const sorted = [...counts.entries()].sort((a, b) => b[1] - a[1]);
const elapsed = ((Date.now() - start) / 1000).toFixed(1);
const total = allRecords.length;
console.log(`\n${"=".repeat(70)}`);
console.log(`Generator Tags Summary (${total} files, ${elapsed}s)`);
console.log(`${"=".repeat(70)}`);
console.log(`${"Generator".padEnd(45)} ${"Count".padStart(7)} ${" %".padStart(7)}`);
console.log("-".repeat(70));
for (const [gen, count] of sorted) {
const pct = ((count / total) * 100).toFixed(1);
console.log(`${gen.padEnd(45)} ${String(count).padStart(7)} ${(pct + "%").padStart(7)}`);
}
console.log("-".repeat(70));
console.log(`${"TOTAL".padEnd(45)} ${String(total).padStart(7)} ${"100.0%".padStart(7)}`);
console.log(`\nOutput: ${OUTPUT_FILE}`);

View File

@ -4,59 +4,13 @@
*/ */
import { readdirSync, readFileSync, writeFileSync } from "node:fs"; import { readdirSync, readFileSync, writeFileSync } from "node:fs";
import { segmentParagraphs } from "./segment.ts"; import { segmentParagraphs } from "./segment.ts";
import type { FilingMeta, Paragraph } from "@sec-cybert/schemas/paragraph.ts"; import { stripHtml } from "./html-cleaner.ts";
import type { FilingMeta } from "@sec-cybert/schemas/paragraph.ts";
const HTML_CACHE_DIR = "../data/raw/html"; const HTML_CACHE_DIR = "../data/raw/html";
const OUTPUT_PATH = "../data/paragraphs/paragraphs.jsonl"; const OUTPUT_PATH = "../data/paragraphs/paragraphs.jsonl";
const ACCESSION_META_PATH = "../data/bulk/accession-meta.json"; const ACCESSION_META_PATH = "../data/bulk/accession-meta.json";
// ─── Fast HTML→text (regex, no DOM) ───
function stripHtml(html: string): string {
return html
.replace(/<script[\s\S]*?<\/script>/gi, "")
.replace(/<style[\s\S]*?<\/style>/gi, "")
.replace(/<noscript[\s\S]*?<\/noscript>/gi, "")
// Collapse adjacent inline element boundaries to prevent word splitting
.replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi, (_m, _tag, ws) => ws.length > 0 ? " " : "")
.replace(/<\/ix:[a-z]+>(\s*)<ix:[a-z]+[^>]*>/gi, (_m, ws) => ws.length > 0 ? " " : "")
.replace(/<\/(p|div|tr|li|h[1-6]|td|th)>/gi, "\n")
.replace(/<(br|hr)\s*\/?>/gi, "\n")
.replace(/<[^>]+>/g, " ")
.replace(/&nbsp;|&#160;|&#xa0;/gi, " ")
.replace(/&amp;/g, "&")
.replace(/&lt;/g, "<")
.replace(/&gt;/g, ">")
.replace(/&quot;|&ldquo;|&rdquo;|&#8220;|&#8221;|&#147;|&#148;/g, '"')
.replace(/&#39;|&apos;|&rsquo;|&lsquo;|&#8216;|&#8217;|&#146;/g, "'")
.replace(/&mdash;|&#8212;|&#151;/g, "—")
.replace(/&ndash;|&#8211;|&#150;/g, "")
.replace(/&bull;|&#8226;|&#149;/g, "•")
.replace(/&minus;|&#8722;/g, "-")
.replace(/&sect;|&#167;/g, "§")
.replace(/&#153;/g, "™")
.replace(/&#x([0-9a-fA-F]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#\d+;/g, " ")
.replace(/&\w+;/g, " ")
.replace(/[^\S\n]+/g, " ")
.replace(/([a-z])\.([A-Z])/g, "$1. $2")
.replace(/([a-z]),([A-Z])/g, "$1, $2")
.replace(/([a-z]);([A-Z])/g, "$1; $2")
.replace(/•([A-Za-z])/g, "• $1")
.replace(/\b([a-z])\.([A-Z])/g, "$1. $2")
// Greek question mark (U+037E) → semicolon
.replace(/\u037e/g, ";")
// Fix inline element joins that created camelCase with common English words
.replace(/([a-z])(The|Our|We|This|These|That|Its|His|Her|In|As|For|And|Or|If|An|It|To|By|On|At|No|Of|All|Any|Has|Was|Is|Are|Not|May|Can|Will|Such|Also|But|Each|New|So|Up|With|From)\b/g, "$1 $2")
// Fix colon-joins: word:Word → word: Word (exclude URLs)
.replace(/([a-z]):([A-Z])/g, "$1: $2")
// Fix ISO standard joins: ISO/IEC27001 → ISO/IEC 27001, ISO27001 → ISO 27001
.replace(/\b(ISO(?:\/IEC)?)(\d)/g, "$1 $2")
.replace(/(Standardization)(\d)/g, "$1 $2")
// Fix PDF extraction artifact: space before punctuation ("Director ," → "Director,")
.replace(/ ([,;:.!?)])/g, "$1");
}
// ─── Item 1C extraction (regex on stripped text) ─── // ─── Item 1C extraction (regex on stripped text) ───
const ITEM_1C = /^\s*(\u2022\s*)?item\s*1c[\.\s\u00a0—:-]/i; const ITEM_1C = /^\s*(\u2022\s*)?item\s*1c[\.\s\u00a0—:-]/i;

View File

@ -0,0 +1,50 @@
/**
* HTML plain text cleaning for SEC filings.
* Used by both paragraph extraction (fast-reparse) and DAPT corpus preparation.
*/
/** Strip HTML tags, decode entities, fix word-boundary artifacts from SEC EDGAR HTML. */
export function stripHtml(html: string): string {
return html
.replace(/<script[\s\S]*?<\/script>/gi, "")
.replace(/<style[\s\S]*?<\/style>/gi, "")
.replace(/<noscript[\s\S]*?<\/noscript>/gi, "")
// Collapse adjacent inline element boundaries to prevent word splitting
.replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi, (_m, _tag, ws) => ws.length > 0 ? " " : "")
.replace(/<\/ix:[a-z]+>(\s*)<ix:[a-z]+[^>]*>/gi, (_m, ws) => ws.length > 0 ? " " : "")
.replace(/<\/(p|div|tr|li|h[1-6]|td|th)>/gi, "\n")
.replace(/<(br|hr)\s*\/?>/gi, "\n")
.replace(/<[^>]+>/g, " ")
.replace(/&nbsp;|&#160;|&#xa0;/gi, " ")
.replace(/&amp;/g, "&")
.replace(/&lt;/g, "<")
.replace(/&gt;/g, ">")
.replace(/&quot;|&ldquo;|&rdquo;|&#8220;|&#8221;|&#147;|&#148;/g, '"')
.replace(/&#39;|&apos;|&rsquo;|&lsquo;|&#8216;|&#8217;|&#146;/g, "'")
.replace(/&mdash;|&#8212;|&#151;/g, "—")
.replace(/&ndash;|&#8211;|&#150;/g, "")
.replace(/&bull;|&#8226;|&#149;/g, "•")
.replace(/&minus;|&#8722;/g, "-")
.replace(/&sect;|&#167;/g, "§")
.replace(/&#153;/g, "™")
.replace(/&#x([0-9a-fA-F]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#\d+;/g, " ")
.replace(/&\w+;/g, " ")
.replace(/[^\S\n]+/g, " ")
.replace(/([a-z])\.([A-Z])/g, "$1. $2")
.replace(/([a-z]),([A-Z])/g, "$1, $2")
.replace(/([a-z]);([A-Z])/g, "$1; $2")
.replace(/•([A-Za-z])/g, "• $1")
.replace(/\b([a-z])\.([A-Z])/g, "$1. $2")
// Greek question mark (U+037E) → semicolon
.replace(/\u037e/g, ";")
// Fix inline element joins that created camelCase with common English words
.replace(/([a-z])(The|Our|We|This|These|That|Its|His|Her|In|As|For|And|Or|If|An|It|To|By|On|At|No|Of|All|Any|Has|Was|Is|Are|Not|May|Can|Will|Such|Also|But|Each|New|So|Up|With|From)\b/g, "$1 $2")
// Fix colon-joins: word:Word → word: Word (exclude URLs)
.replace(/([a-z]):([A-Z])/g, "$1: $2")
// Fix ISO standard joins: ISO/IEC27001 → ISO/IEC 27001, ISO27001 → ISO 27001
.replace(/\b(ISO(?:\/IEC)?)(\d)/g, "$1 $2")
.replace(/(Standardization)(\d)/g, "$1 $2")
// Fix PDF extraction artifact: space before punctuation ("Director ," → "Director,")
.replace(/ ([,;:.!?)])/g, "$1");
}

View File

@ -177,12 +177,34 @@ export function segmentParagraphs(
} }
} }
// Buffer for orphan first-words: SEC HTML sometimes splits the first word of a
// sentence onto its own line within a <span> tag. These single-word blocks are
// below MIN_WORDS and would be dropped. Instead, buffer them and prepend to the
// next block so the sentence stays intact.
let orphanBuffer = "";
for (const block of blocks) { for (const block of blocks) {
const stripped = block.replace(LEADING_PUNCT, ""); let stripped = block.replace(LEADING_PUNCT, "");
if (stripped.length === 0) continue; if (stripped.length === 0) continue;
// Prepend any buffered orphan word, but only if this block starts lowercase
// (confirming it's a sentence continuation, not a new heading)
if (orphanBuffer) {
if (STARTS_LOWERCASE.test(stripped)) {
stripped = orphanBuffer + " " + stripped;
}
// Either way, clear the buffer — don't carry it across multiple blocks
orphanBuffer = "";
}
const wc = wordCount(stripped); const wc = wordCount(stripped);
// Single-word orphan: buffer for prepending to the next block
if (wc === 1 && /^[A-Za-z]/.test(stripped) && !TERMINAL_PUNCT.test(stripped)) {
orphanBuffer = stripped;
continue;
}
// Short blocks: append to previous paragraph instead of dropping, // Short blocks: append to previous paragraph instead of dropping,
// but only if it completes a sentence or previous was already broken // but only if it completes a sentence or previous was already broken
if (wc < MIN_WORDS && paragraphs.length > 0) { if (wc < MIN_WORDS && paragraphs.length > 0) {

View File

@ -117,8 +117,9 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`, The paragraph must be PRIMARILY about managing vendor/supplier cyber risk to qualify as Third-Party Risk.`,
"None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — ask: is there substantive cybersecurity disclosure? "None/Other|Strategy Integration": `NONE/OTHER vs STRATEGY INTEGRATION — ask: is there substantive cybersecurity disclosure?
None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate. None/Other = NO substantive disclosure at all: section headers, disclaimers, generic IT-dependence language ("our IT systems are important to operations"), forward-looking boilerplate, generic regulatory compliance language ("subject to various regulatory requirements... non-compliance could result in penalties").
Strategy Integration = actual discussion of business/financial impact, cyber insurance, budget allocation, or materiality assessment. Strategy Integration = actual discussion of business/financial impact, cyber insurance, budget allocation, or materiality assessment.
Generic regulatory risk language (acknowledging regulations exist, non-compliance would be bad) is None/Other it makes no materiality assessment and describes no strategy. It only becomes Strategy Integration if it explicitly assesses whether regulatory risks have "materially affected" the business.
If the paragraph only establishes that the company has IT systems and data without describing any program, process, or strategy None/Other.`, If the paragraph only establishes that the company has IT systems and data without describing any program, process, or strategy None/Other.`,
"Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: who is the grammatical subject? "Board Governance|Management Role": `BOARD GOVERNANCE vs MANAGEMENT ROLE — ask: who is the grammatical subject?
@ -133,7 +134,8 @@ const CATEGORY_GUIDANCE: Record<string, string> = {
"None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities? "None/Other|Risk Management Process": `NONE/OTHER vs RISK MANAGEMENT PROCESS — ask: does the paragraph describe actual cybersecurity activities?
Describing actual processes (monitoring, assessment, vulnerability management, training programs) RMP. Describing actual processes (monitoring, assessment, vulnerability management, training programs) RMP.
Only stating the company has IT systems, collects data, or faces cyber risks without describing what it DOES about them None/Other.`, Only stating the company has IT systems, collects data, or faces cyber risks without describing what it DOES about them None/Other.
Generic regulatory compliance language ("subject to various regulations... non-compliance could result in penalties") is None/Other it describes no actual compliance activities. If a specific regulation is named (GDPR, HIPAA, PCI DSS) but no company-specific program is described RMP at Specificity 2 (named standard).`,
"Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic? "Risk Management Process|Strategy Integration": `RISK MANAGEMENT PROCESS vs STRATEGY INTEGRATION — ask: operational or strategic?
Describing HOW risks are assessed, monitored, mitigated Risk Management Process. Describing HOW risks are assessed, monitored, mitigated Risk Management Process.