SEC-cyBERT/docs/training/DATA-QUALITY-AUDIT.md

# Data Quality Audit — SEC-cyBERT Corpus

**Date:** 2026-03-29
**Scope:** Full audit of DAPT corpus (14,756 docs) and paragraph data (72,045 paragraphs)
**Method:** 6 automated agents + manual investigation

---

## 1. Executive Summary

The data is in better shape than initially feared, but two significant issues were uncovered:

1. **Inlined section headings affect ~22% of paragraphs** across all generators. These are section titles ("Risk Management and Strategy", "Board Oversight") prepended to paragraph body text with no separator. Consistent across generators = our extraction pipeline's heading detection, not a generator HTML quirk.

2. **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)** produces severely degraded extraction quality: 36.8% orphan word rate (8x corpus average), 5.9% fragment rate, lowest paragraphs-per-filing. This generator was hidden in a 45% "UNKNOWN" bucket until we identified it. It affects 1,014 filings and 5,779 paragraphs.

**Decision:** Strip inlined headers from fine-tuning data. Expand orphan word patching to cover EFiling/XDX paragraphs. Tag all paragraphs with generator metadata for quality-aware training.

---

## 2. Generator Landscape

### Identification

We identified **14 distinct filing generators** covering 99.99% of all 14,759 HTML files. Only 2 files remain unidentified (both 0-byte empty files). Detection used a combination of HTML meta tags, comments, namespace declarations, CSS class patterns, and CIK-based filing agent identification.

Full reference: `docs/EDGAR-FILING-GENERATORS.md`

### Generator Distribution

| Generator | Files | % | Paragraphs | Quality Tier |
|-----------|-------|---|------------|-------------|
| Workiva | 3,592 | 24.3% | 22,407 | Clean |
| Inline XBRL (unattributed) | 2,417 | 16.4% | 15,233 | Clean |
| Donnelley Financial Solutions | 2,327 | 15.8% | 13,153 | Clean |
| EFiling/EDGAR Agent (XDX) | 1,997 | 13.5% | 5,779 | **Bad** |
| Toppan Merrill | 1,378 | 9.3% | 7,332 | OK |
| CompSci Transform | 879 | 6.0% | 3,287 | **Degraded** |
| SEC Publisher | 793 | 5.4% | — | — |
| ThunderDome | 732 | 5.0% | 3,581 | OK |
| Broadridge PROfile | 465 | 3.2% | 772 | OK |
| Certent | 86 | 0.6% | — | — |
| SGML-wrapped | 58 | 0.4% | — | — |
| IRIS Carbon | 20 | 0.1% | — | — |
| RDG Portal | 12 | 0.1% | — | — |
| PDF to EDGAR | 1 | <0.1% | — | — |

Note: Not all HTML files produced paragraphs (some lack Item 1C, some are 8-Ks or amendments).

### Quality Metrics by Generator

| Generator | Orphan% | Fragment% | Trunc% | InlHdr% | AvgWC | Paras/Filing |
|-----------|---------|-----------|--------|---------|-------|-------------|
| Workiva | 0.6% | 1.2% | 0.5% | 21.9% | 99.7 | 8.4 |
| Donnelley | 0.5% | 1.4% | 0.5% | 21.8% | 92.7 | 7.9 |
| Inline XBRL | 0.9% | 1.5% | 0.6% | 21.8% | 98.4 | 8.1 |
| Toppan Merrill | 3.2% | 3.0% | 1.4% | 23.1% | 84.7 | 8.1 |
| ThunderDome | 3.0% | 4.3% | 1.8% | 24.4% | 83.0 | 7.7 |
| Broadridge | 3.4% | 3.5% | 2.1% | 21.5% | 84.4 | 7.8 |
| **CompSci Transform** | **14.8%** | **5.8%** | 1.7% | 15.4% | 72.1 | 5.6 |
| **EFiling/XDX** | **36.8%** | **5.9%** | **2.1%** | 16.5% | 69.8 | 5.7 |
| *Corpus average* | *4.7%* | *2.3%* | *0.9%* | *21.5%* | *91.9* | *7.7* |

**Bold** = >2x corpus average.

Key observations:
- Inlined headers (~22%) are consistent across ALL generators → extraction pipeline issue, not generator-specific
- Orphan words are highly concentrated: EFiling/XDX (36.8%) and CompSci Transform (14.8%) account for the vast majority
- Workiva and Donnelley produce the cleanest output (>70% of paragraphs)
- EFiling/XDX also has the lowest paragraphs-per-filing (5.7 vs 7.7 avg), suggesting extraction misses content
- CompSci Transform was acquired by Broadridge in July 2024; newer filings may appear as Broadridge PROfile

---

## 3. Issue Inventory

### 3.1 Inlined Section Headings (~22% of paragraphs)

**What:** Section headings like "Risk Management and Strategy", "Board Oversight", "Cybersecurity Governance" are prepended to paragraph body text with no separator.

**Example:**
```
Risk Management and Strategy We have designed our cybersecurity risk management program to identify,
assess, and manage risks from cybersecurity threats...
```

**Cause:** The `extractItem1C()` function in `fast-reparse.ts` extracts the full Item 1C text including sub-section headings, and the paragraph segmenter doesn't strip them. The headings become the first "sentence" of the paragraph.

**Impact on classification:**
- The heading is a near-perfect predictor of `content_category` — creates shortcut learning risk
- The heading tells you nothing about `specificity_level` — model still has to read body text
- At inference time, heading presence will be inconsistent across filings
- **Decision: Strip from fine-tuning data.** Headings are consistent across generators, so a single detection heuristic works.

**Detection heuristic:**
- Common Item 1C sub-headings: "Risk Management and Strategy", "Risk Management", "Board Oversight", "Governance", "Management('s) Role", "Cybersecurity Governance", "Incident Detection", "Incident Response", "Strategy", "Third Party", "Third-Party"
- Structural: 2-5 title-cased words at paragraph start, followed by sentence text starting with "We", "Our", "The", a pronoun, or an article

### 3.2 Orphan Words (4.7% overall, concentrated in 2 generators)

**What:** The first word of a paragraph is dropped during extraction, leaving a paragraph that starts with lowercase mid-sentence.

**Example:**
```
sole executive officer and director is responsible for assessing and managing cybersecurity risks...
```
(should be: "Our sole executive officer...")

**Cause:** HTML source wraps text at fixed column width. The `<span>` opening tag consumes most of a line, so only the first word fits before a source newline. `stripHtml()` preserves that newline, and downstream processing drops the single-word fragment.

**Scope by generator:**
- EFiling/XDX: 36.8% of its paragraphs (2,127 affected)
- CompSci Transform: 14.8% (487 affected)
- All others: <3.5%
- Total: ~3,400 paragraphs corpus-wide

**Already patched:** 215 paragraphs were surgically patched in `paragraphs-clean.patched.jsonl`. The remaining ~3,185 need the same treatment.

**Impact on classification:** Meaning is preserved — annotators and models can infer the missing word from context. But systematically missing subjects ("We", "Our") could subtly bias specificity assessment.

### 3.3 Orphaned Fragments (2.3% overall)

**What:** List items split from their parent paragraph, creating very short standalone paragraphs.

**Example:**
```
the use of external service providers, where appropriate, to assess, test or otherwise assist with
aspects of our security controls;
```

**Cause:** Semicolon-terminated list items are treated as paragraph boundaries by the segmenter.

**Scope:** 250 paragraphs identified in the narrower audit; ~1,660 total with <25 words.

**Impact:** These are classifiable in isolation (the content is clear) but lack the framing context of the parent list. Likely annotated correctly but may have lower model confidence.

### 3.4 Truncated Paragraphs (0.37%)

**What:** Paragraphs ending mid-sentence without terminal punctuation.

**Two patterns:**
1. Paragraph absorbed the start of the next section's heading (ends with "Governance", "Identify")
2. True truncation — cross-reference sentence cut off ("Risk Factors" in this)

**Scope:** 264 paragraphs.

**Impact:** Low — 0.37% and meaning is usually recoverable from context.

### 3.5 Cross-Filing Boilerplate (53.6%)

**What:** Paragraphs with identical text appearing in multiple filings. Driven by law firms and compliance consultants providing template language.

**Scope:** 38,601 paragraphs share text with at least one other filing. 1,705 unique boilerplate texts appear in 3+ filings. The most-duplicated text appears in 138 filings across 84 companies.

**Impact:** This IS the construct being measured. Boilerplate paragraphs should be classified as Specificity Level 1 (Generic Boilerplate). Not a quality issue — it's the signal.

---

## 4. DAPT Corpus Audit

### 4.1 Corpus Stats

- **14,756 documents**, 15 shards
- **~1.06 billion tokens** (ModernBERT tokenizer; chars/4.72, not chars/4.0)
- **Median doc length:** 347K chars (~73K tokens)
- **90.8% of docs exceed 8,192 tokens** — chunking is mandatory (handled by training pipeline)

### 4.2 Issues Found

| Issue | Scope | Verdict |
|-------|-------|---------|
| 188 docs < 10K chars (cover pages) | 0.04% of tokens | Filter out |
| XBRL preambles (8% of docs) | 0.18% of chars | Negligible |
| Financial table fragments (~25% of lines) | Widespread | Acceptable — SEC domain includes numbers |
| URLs in 80% of docs (~4 per doc) | Low | Optional cleanup |
| 64 8-K filings mixed in | Tiny | Keep — domain-relevant |
| 1,470 amendments (median 94K chars) | Substantial content | Keep |
| 2 single-block docs (no paragraph breaks) | 2 docs | Filter out |
| 242 near-duplicate cross-year filings | 1.6% | Keep — different content |
| 0 garbled text, 0 HTML artifacts | | Clean |
| 0 sentence boundary violations | | Clean |

### 4.3 Decision

Filter <10K char docs and 2 structureless docs. Everything else is acceptable for unsupervised MLM. The model will learn SEC language including financial notation, legal boilerplate, and cybersecurity terminology.

---

## 5. Patch History

### Patch 1: Orphan Word Fix (2026-03-29)

- **Scope:** 215 paragraphs, 77 filings
- **Method:** Detect orphan word in raw HTML, prepend to paragraph text
- **Validation:** All prefix additions, 0 boundary changes, 0 text shrinkages
- **Files:** `paragraphs-clean.patched.jsonl`, `training.patched.jsonl`
- **Annotation impact:** 142 annotated paragraphs affected (0.28%), meaning preserved

### Patch 2: Expanded Orphan Word Fix (2026-03-29)

- **Scope:** 2,233 paragraphs (includes Patch 1's 215; net 2,026 new)
- **Method:** HTML lookback — find paragraph text in stripped HTML, extract preceding word
- **Top orphan words:** We (632), Our (403), As (152), The (91), To (84), In (78), Cybersecurity (64)
- **Validation:** 0 false positives after filtering "Table of Contents" artifacts. 1,122 candidates rejected (legitimate list items starting with lowercase).
- **Annotation impact:** 1,400 annotated paragraphs affected. Label bias detected: Strategy Integration 1.55x over-represented, Management Role 0.49x under-represented in orphan-word paragraphs. **Recommended: re-run Stage 1 on patched text (~$15-20, may resolve conflicts).**
- **Script:** `ts/scripts/patch-orphan-words.ts`
- **Patch file:** `data/paragraphs/patches/orphan-word-patches.jsonl`

### Patch 3: Heading Stripping (2026-03-29)

- **Scope:** 7,514 paragraphs (10.4%)
- **Method:** Explicit pattern matching against known Item 1C sub-section headings (71 unique headings). Validated by confirming body text starts with sentence-starting word.
- **Top headings stripped:** Risk Management and Strategy (2,453), Cybersecurity Risk Management and Strategy (1,281), Cybersecurity Governance (1,208), Governance (301), Third-Party Risk Management (224)
- **Annotation impact:** 5,013 annotated paragraphs. Heading removal eliminates shortcut learning risk (heading was near-perfect predictor of content_category).
- **Script:** Inline Python (see audit process notes)
- **Patch file:** `data/paragraphs/patches/heading-strip-patches.jsonl`

### Patch 4: Colon-Headed Paragraphs (2026-03-29)

- **Scope:** 370 paragraphs
- **Method:** Regex match for "Heading Text: Sentence..." patterns. Only fires when colon is followed by known sentence-starting word.
- **Top headings stripped:** Education and Awareness (97), Safeguards (18), Management (15), Approach (13), Training (11)
- **Annotation impact:** 227 annotated paragraphs.
- **Patch file:** `data/paragraphs/patches/colon-heading-patches.jsonl`

### Patch 5: Extended Separator Headings (2026-03-29)

- **Scope:** 184 paragraphs
- **Method:** Detect headings with period, dash/em-dash, semicolon, or ALL-CAPS separators that Patches 3-4 missed.
- **Annotation impact:** 133 annotated paragraphs.
- **Patch file:** `data/paragraphs/patches/heading-strip-v2-patches.jsonl`

### Patch 6: HTML-Confirmed Headings (2026-03-29)

- **Scope:** 343 paragraphs
- **Method:** Extract bold/underline/h-tag styled text from source HTML (cached in `filing-headings.jsonl`), match against paragraph starts, validate with sentence-start check. Zero false positives — if the HTML says it's bold, it's a heading.
- **855 ambiguous cases rejected** where styled text was a sentence subject (e.g., bold "Cybersecurity" starting "Cybersecurity is a critical component...")
- **Annotation impact:** 270 annotated paragraphs.
- **Scripts:** `ts/scripts/extract-html-headings.ts` (1.7s for 6,341 filings with 32 workers)
- **Patch file:** `data/paragraphs/patches/heading-strip-html-patches.jsonl`
- **Cache:** `data/paragraphs/quality/filing-headings.jsonl`

### Cumulative Heading Strip Summary

| Pass | Method | Count | Cumulative |
|------|--------|-------|-----------|
| Patch 3 | Explicit heading patterns (space separator) | 7,514 | 7,514 |
| Patch 4 | Colon separator | 370 | 7,884 |
| Patch 5 | Period/dash/caps/semicolon | 184 | 8,068 |
| Patch 6 | HTML bold/underline confirmed | 343 | 8,411 |
| **Total** | | **8,411** | **11.7% of corpus** |

---

## 6. Data Integrity Rules

1. **`paragraphs-clean.jsonl` is FROZEN.** Never modify. It is the original extraction output and the source of truth for reproducibility.

2. **All fixes go through `.patched.jsonl` files.** The patched file has the same schema and IDs as the original. Text may differ. TextHash is updated.

3. **Annotations link by paragraph `id` (UUID).** This linkage is stable across patches — IDs never change.

4. **Never re-run extraction from HTML.** Cascade effects from merge logic changes cause thousands of ripple-effect text changes (documented in `docs/SEC-HTML-CLEANING.md`). Surgical JSONL patching is the only safe approach.

5. **Every patch is documented** with scope, method, validation, and annotation impact.

6. **Quality metadata is separate from text data.** Per-paragraph quality scores live in a separate file, not embedded in the paragraph data. This keeps the data schema stable.

---

## 7. Quality Tier System

Each paragraph gets a quality tier based on detected issues:

| Tier | Criteria | Count | % | Training Action |
|------|----------|-------|---|-----------------|
| **clean** | No detected issues | 58,165 | 80.7% | Full weight (1.0) |
| **headed** | Had inlined section heading (now stripped) | 7,402 | 10.3% | Full weight (1.0) — heading removed |
| **degraded** | Embedded bullets (1,941), invisible merges (222), fragments, truncations, no-cyber | 4,331 | 6.0% | Downweight (0.5) — content preserved but structure degraded |
| **minor** | Had orphan word (now fixed) | 2,147 | 3.0% | Full weight (1.0) — word restored |

Note: Tiers reflect the most severe issue. A paragraph can have multiple issues. All "headed" and "minor" paragraphs have been patched — the tier records what WAS wrong, not what IS wrong.

### Sample Weighting Strategy

During fine-tuning, each training sample is weighted by quality tier to reduce the influence of structurally degraded paragraphs without discarding them entirely:

- **clean + headed + minor (1.0 weight):** Content is correct and text is clean (after patching). These form the reliable training signal.
- **degraded (0.5 weight):** Content is present but structural issues (concatenated list items, fragments, truncations) may cause the text to misrepresent paragraph-level semantics. The labels are likely correct (models can infer meaning despite structural noise), but the text doesn't match what the model will see at inference time on clean filings. Downweighting reduces overfitting to degraded patterns without losing the content signal.

Sample weighting is applied via the HuggingFace Trainer's `sample_weight` column or a custom loss function that multiplies cross-entropy by the tier weight.

### Additional Findings (from anomaly detection)

| Finding | Count | Concern |
|---------|-------|---------|
| Embedded bullet points mid-text | 1,941 (flagged degraded) | MEDIUM — semicolon-separated list items without bullet markers |
| Invisible merges (no separators) | 222 (flagged degraded) | MEDIUM — list items concatenated with no trace of structure (e.g., Bancorp 34) |
| No cybersecurity keywords at all | 528 (348 annotated) | LOW — investigated, keyword filter was too narrow, labels correct |
| Cross-references to other SEC items | 5,750 | LOW — mostly legitimate "see Item 1A" refs |
| Dollar amounts in text | 46 | LOW — mostly legitimate incident costs |
| Paragraphs >400 words | 149 | LOW — possible failed splits |
| Repeated sentences within paragraph | 9 | LOW — copy-paste artifacts |

---

## 8. Annotation Impact (Quantified)

Of 49,795 annotated paragraphs:

### Annotated set by generator

| Generator | Annotated Paras | % of Annotated Set |
|-----------|----------------|-------------------|
| Inline XBRL | ~10,500 | 21.1% |
| Workiva | ~15,300 | 30.7% |
| Donnelley | ~9,000 | 18.1% |
| Toppan Merrill | ~5,900 | 11.8% |
| EFiling/XDX | 3,562 | 7.2% |
| ThunderDome | ~2,500 | 5.0% |
| CompSci Transform | 2,288 | 4.6% |
| Others | ~700 | 1.4% |

### Orphan words in annotated set

**2,178 annotated paragraphs (4.37%)** start with lowercase (non-list) — orphan word candidates.

| Generator | Orphan Paras | % of Generator's Annotated | % of All Orphans |
|-----------|-------------|---------------------------|-----------------|
| EFiling/XDX | 1,389 | 39.0% | 63.8% |
| CompSci Transform | 401 | 17.5% | 18.4% |
| All others | 388 | <5% each | 17.8% |

EFiling/XDX alone accounts for 63.8% of all orphan-word paragraphs in the annotated set.

### Label bias in orphan-word paragraphs

- **Strategy Integration** is over-represented at 1.55x base rate (16.1% of orphan paras vs 10.4% overall)
- **Board Governance** and **Management Role** are under-represented (0.60x and 0.49x) — likely because governance headings/lead-in sentences get split off, leaving the orphan fragment lacking governance context

This suggests orphan words may cause subtle category misclassification, not just missing text.

### Inlined headers in annotated set

**4,513 annotated paragraphs (9.06%)** have section headings merged into text. Relatively uniform across generators (~9-10%), but notably lower for EFiling/XDX (5.3%) and CompSci Transform (5.6%) — these generators split at headers rather than merging them.

### Combined impact

**6,691 annotated paragraphs (13.44%)** have either orphan-word OR inlined-header issues.

Per generator:
- EFiling/XDX: 1,577 of 3,562 (44.3%) affected
- CompSci Transform: ~600 of 2,288 (~26%) affected
- All others: <15% affected

---

## 9. Summary of Changes to Annotated Data

| Change | Annotated Paragraphs Affected | Semantic Impact |
|--------|------------------------------|----------------|
| Orphan word restored | 1,400 | Label bias detected (Strategy 1.55x, Management 0.49x) |
| Heading stripped (all passes) | ~5,643 | Removes shortcut learning signal |
| No-cyber flagged as degraded | 348 | May want to exclude from training |
| **Total modified** | **~7,100 of 49,795 (14.3%)** | |

## 10. Remaining Questions / Next Steps

- **Re-run Stage 1 on orphan-word paragraphs** (~$15-20 for 1,400 paragraphs). Label bias suggests some misclassification. May resolve conflicts and save Stage 2 judge costs.
- **Heading-stripped paragraphs:** Existing labels are likely still valid — annotators classified the body text, not the heading. But could re-run if budget allows.
- **Exclude 348 no-cyber-keyword annotated paragraphs?** If labeled "None/Other" they're fine; if other categories, they're noise from section bleed.
- **855 ambiguous HTML heading cases** — bold/underline text at paragraph start but also a valid sentence subject. Would need manual review to resolve.
- **Run DAPT** — filter <10K char docs from DAPT corpus, then start training.

---

## 11. Artifacts Produced

### Data Files

```
data/paragraphs/
├── paragraphs-clean.jsonl              ← FROZEN original (72,045 paragraphs)
├── paragraphs-clean.patched.jsonl      ← All 6 patches applied (orphan + heading)
├── training.patched.jsonl              ← Training subset, all patches applied (49,795)
├── patches/
│   ├── orphan-word-patches.jsonl       ← 2,233 orphan word recovery records
│   ├── heading-strip-patches.jsonl     ← 7,514 heading strip records (space sep)
│   ├── colon-heading-patches.jsonl     ← 370 colon-heading strip records
│   ├── heading-strip-v2-patches.jsonl  ← 184 period/dash/caps/semicolon headings
│   └── heading-strip-html-patches.jsonl← 343 HTML bold/underline confirmed headings
└── quality/
    ├── generator-tags.jsonl            ← 14,759 accession → generator mappings
    ├── quality-scores.jsonl            ← 72,045 per-paragraph quality metadata
    ├── filing-headings.jsonl           ← Cached styled headings from HTML (3,459 filings)
    └── ambiguous-filings.txt           ← Filing list used for HTML heading extraction
```

### Scripts

| Script | Purpose |
|--------|---------|
| `ts/scripts/patch-orphan-words.ts` | Detect and recover orphan words from HTML source |
| `ts/scripts/tag-generators.ts` | Identify filing generator from HTML signatures |
| `ts/scripts/extract-html-headings.ts` | Extract bold/underline headings from HTML (32-worker parallel, 1.7s) |
| `ts/scripts/dapt-corpus-prep.ts` | DAPT corpus preparation (HTML → clean JSONL, 32-worker parallel) |
| `scripts/detect_generators.py` | Python generator detection (initial analysis) |
| `scripts/generator_quality_analysis.py` | Generator × quality metrics cross-reference |
| `scripts/analyze_generator_quality.py` | Annotation impact analysis by generator |
| `scripts/find_heading_candidates.py` | Creative heading pattern hunt (7 approaches) |
| `scripts/data_quality_audit.py` | Statistical anomaly detection (content, structure, outliers) |
| `scripts/audit_corpus.py` | Text corruption checks |
| `scripts/audit_paragraphs.py` | Boundary audit (per-filing stats, coherence, duplicates) |

### Documentation

| Doc | Content |
|-----|---------|
| `docs/DATA-QUALITY-AUDIT.md` | This document — full audit findings, patch history, quality tiers |
| `docs/EDGAR-FILING-GENERATORS.md` | Generator reference — 14 vendors, signatures, market share, quality issues |
| `docs/SEC-HTML-CLEANING.md` | HTML cleaning lessons and pitfalls |