SEC-cyBERT/docs/data-pipeline/SEC-HTML-CLEANING.md
2026-04-05 21:00:40 -04:00

12 KiB
Raw Blame History

SEC Filing HTML Cleaning — Lessons & Pitfalls

Everything we've learned about cleaning SEC EDGAR HTML for text extraction, specifically for Item 1C (Cybersecurity) from 10-K filings. These lessons likely apply to any SEC filing text extraction pipeline.

The HTML landscape

SEC filings come from thousands of different filers using dozens of different tools (Workiva/Toppan Merrill, Donnelley Financial, various legal/accounting software). There is no standard HTML structure. The same semantic content — a paragraph of body text — can appear as:

  • <p><span style="...">Text here</span></p>
  • <div><font face="..." size="...">Text here</font></div>
  • Nested XBRL inline tags: <ix:nonNumeric><p><span>Text</span></p></ix:nonNumeric>
  • Table-based layouts: <table><tr><td><span>Text</span></td></tr></table>
  • Deeply nested <div> structures with inline styles

The only constant: it will be ugly.

Inline element newlines (the orphan word problem)

The bug: Many filing generators produce HTML where the first word of a paragraph is on its own line within a <span> tag:

<p><span style="font-family: Times New Roman; font-size: 10pt">Our
sole executive officer and director is responsible for assessing and
managing cybersecurity risks...</span></p>

When this is stripped to plain text, Our ends up on its own line. If downstream processing splits on newlines and filters short lines (< 20 words), Our is silently dropped. The paragraph becomes sole executive officer and director is responsible... — missing its subject.

Prevalence: ~1.4% of filings (156/11,299) have this pattern in their Item 1C section. It produces ~2,500 affected paragraphs across the corpus.

Common orphaned words: We (73), Our (37), The (5), To (17), As (15), In (13), Cybersecurity (10), Management (6), Following (6). Basically any sentence-starting word.

Why it happens: The filing generator wraps text at a fixed column width in the HTML source. If the <span> opening tag + attributes eat most of a line, only the first word fits before the line break. The browser renders this identically (HTML treats source newlines as whitespace), but text extraction that preserves newlines from inline elements breaks.

Detection (for patching existing data): Match the pattern <span...>Word\nlowercase continuation... directly in the raw HTML. Three validation layers are needed:

  1. Same-tag check: The orphan word and continuation must be within the same inline element (<span>, <a>, <font>, etc.). This distinguishes orphan first-words from section headings above paragraphs. Critically, exclude <ix:...> XBRL tags — these are structural, not inline, and their first text is often a section title.

  2. Bold/underline filter: Skip matches inside <b>, <strong>, or text-decoration: underline. These are section headings that happen to have a line break mid-heading (e.g., <b>Risk\nManagement and Strategy</b>). Without this filter, headings get inlined into body text.

  3. Stripped-text validation: After finding an orphan word in the raw HTML, confirm it exists as a standalone word in the stripHtml() output. This catches mid-word splits across adjacent spans (see below).

Case-sensitivity matters: If using a regex with the i (case-insensitive) flag for tag name matching, the [a-z] check on the continuation text becomes meaningless — it will match uppercase too, letting headings through. Either drop the i flag (and match tags as [Ss][Pp][Aa][Nn] etc.) or validate continuation case separately.

Prevention (for future extractions): In the paragraph segmenter, buffer single-word blocks that would otherwise be dropped (below minimum word count) and prepend them to the next block when it starts lowercase. This must happen at the segmentation stage, not in the extraction merge logic — changes to merge behavior cascade through downstream paragraph boundary decisions.

Mid-word splits across adjacent spans

The bug: Some filing generators split a single word across multiple <span> tags, sometimes with empty formatting spans between them:

<span style="font-size: 10pt">B</span>
<span style="font-size: 8pt"></span>
<span style="font-size: 10pt">lackrock
maintains a comprehensive cybersecurity risk management program...</span>

The HTML cleaner's adjacent-inline-boundary collapse correctly joins B + lackrock into Blackrock in the stripped text. But if a patching script operates on raw HTML (to find orphan patterns), it sees <span>lackrock\nmaintains... and incorrectly treats lackrock as an orphan word, prepending it to produce lackrock maintains... instead of the correct Blackrock maintains....

Detection: After finding a candidate orphan word in raw HTML, verify it exists as a standalone word (surrounded by whitespace or at line boundaries) in the stripped text. If stripHtml() produces Blackrock (not lackrock), the candidate is a word fragment, not an orphan.

Root cause: The filing generator uses separate spans for styling changes (font-size) that happen to fall at character boundaries within words. The empty <span style="font-size: 8pt"></span> is a zero-width formatting artifact.

Adjacent inline element boundaries

The bug: Different formatting applied to adjacent text creates word-joining when tags are stripped:

<span style="color: black">word</span><span style="color: blue">The next word</span>

Naively stripping tags produces wordThe next word. The words at the span boundary merge.

Fix: Before stripping tags, collapse adjacent inline element boundaries to spaces:

.replace(/<\/(span|a|b|i|u|em|strong|font)>(\s*)<(?:span|a|b|i|u|em|strong|font)[^>]*>/gi,
  (_m, _tag, ws) => ws.length > 0 ? " " : "")

This replaces </span><span> (and similar) with a space, preventing word joins. The whitespace check (ws.length > 0) handles cases where whitespace already exists between tags.

Same treatment needed for XBRL inline tags (</ix:nonNumeric><ix:nonNumeric>).

Source newlines vs block-element breaks

The issue: HTML source files contain newlines in two semantically different roles:

  1. Block-element breaks: </p>, </div>, <br> — these are paragraph boundaries
  2. Source line wrapping: Newlines within inline elements from the filing generator's line-length limit — these are meaningless whitespace

Both become \n in the stripped text. The extraction pipeline relies on newlines to separate paragraphs, so collapsing all newlines breaks paragraph detection. But preserving all newlines creates the orphan word problem.

The tradeoff: We chose to preserve newlines (they're needed for paragraph boundary detection in the extraction pass). The orphan word problem is handled downstream in the segmenter. An alternative (sentinel-based) approach — using \x00 for block breaks, collapsing source newlines to spaces, then restoring sentinels — was tested but caused too many changes to paragraph segmentation across the corpus (18,589 paragraphs changed text in regression testing).

XBRL inline tags (iXBRL / ix: namespace)

What they are: Starting in 2024, SEC filings use Inline XBRL to tag structured data directly in HTML. The cyd: taxonomy covers cybersecurity disclosures. Tags like <ix:nonNumeric name="cyd:CybersecurityRiskManagementProcessesIntegratedTextBlock"> wrap entire sections.

Pitfalls:

  • Not inline formatting: Despite being inline XML elements, ix: tags are structural — they wrap paragraphs, sections, even entire Items. Treating them like <span> for orphan detection will match section headings.
  • XBRL metadata leaks into text: CIK numbers (0000123456), namespace URIs (xbrli:, fasb.org), ticker-date identifiers (ae-20231231) can appear in the text stream. Filter lines where >50% of tokens look like XBRL metadata.
  • continuedAt chains: Long sections are split across multiple ix:continuation blocks. These can interrupt the visual flow of text.

Running headers/footers and page artifacts

SEC HTML often retains print-formatting artifacts:

Pattern Example Detection
Page numbers 17, - 17 -, Page 17 Regex: /^[-–—\s]*[A-Za-z]?[-–—]?\s*\d+[-–—\s]*$/
Running headers ACME CORP FORM 10-K Short line + company name + form type
Table of contents markers Table of Contents Exact match, strip trailing content
Back-to-top links (Back to Index) Regex: `/back\s+to\s+(index
Part headings PART II Short line, roman numerals

These appear mid-text because they're print-layout remnants. Filter them in the extraction pass, before paragraph segmentation.

Subsidiary headers in combined filings

Holding companies file combined 10-Ks covering multiple subsidiaries. Each subsidiary section repeats a header:

ENTERGY ARKANSAS, LLC AND SUBSIDIARIES

These are ALL-CAPS, contain entity suffixes (LLC, INC, CORP, L.P.), and include "AND SUBSIDIARIES". Filter with:

/^[A-Z][A-Z\s,.'&-]{5,}(?:LLC|INC|CORP|COMPANY|L\.?P\.?)\b.*\bAND\s+SUBSIDIARIES\b/

PDF extraction artifacts

Some filings are PDF-converted-to-HTML, producing:

  • Missing spaces: word.Next → fix with /([a-z])\.([A-Z])/g
  • CamelCase joins: wordThe next → fix common English words: /([a-z])(The|Our|We|This|...)\b/g
  • Orphaned punctuation: Director , → fix with / ([,;:.!?)])/g
  • Colon joins: word:Word → fix with /([a-z]):([A-Z])/g

Entity decoding

SEC HTML uses a mix of named entities, decimal entities, and hex entities. Common ones to handle:

&nbsp; &#160; &#xa0;  →  space
&amp;                 →  &
&mdash; &#8212; &#151; →  —
&ndash; &#8211; &#150; →  
&rsquo; &#8217; &#146; →  '  (right single quote, used as apostrophe)
&ldquo; &rdquo;        →  "  (curly quotes)
&bull; &#8226; &#149;   →  •
&#153;                  →  ™

Some filings use the Greek question mark (U+037E) instead of a semicolon — looks identical but breaks regex.

Truncation detection

The extraction pipeline caps output at 50 blocks / 15,000 words. Filings that hit this cap may be truncated. Detection: check if the last paragraph of each filing ends with terminal punctuation ([.!?;")]\s*$). If not, the filing was likely cut mid-sentence — remove all its paragraphs from the training corpus.

Limitation: This only catches truncation at sentence boundaries. If the cap happens to fall at a sentence end, the filing appears complete even though content was lost. No fix for this without comparing against the full filing length.

Merge logic and cascade effects

The extraction pipeline merges short/broken lines in multiple passes. Any change to merge logic cascades: merging two lines changes the resulting line's length, which affects whether subsequent lines trigger length-based merge thresholds, which changes the next merge decision, etc.

In regression testing, a single-word forward-merge change in the extraction pass caused 1,812 ripple-effect text changes across the corpus. Moving the fix to the segmentation stage (after all extraction merges complete) reduced ripples but still affected ~800 paragraphs.

Lesson: For retroactive data fixes, prefer surgical data patching (find-and-prepend on the JSONL) over re-running extraction. For future extraction, place fixes as late in the pipeline as possible to minimize cascade.

Testing extraction changes

When modifying the HTML cleaner, extraction, or segmentation code, regression test against the full corpus:

  1. Re-extract all cached HTML files with the modified code
  2. Compare against existing paragraphs by (accessionNumber, paragraphIndex)
  3. Classify changes:
    • Clean prefix (new text ends with old text) — orphan word recovered
    • Clean suffix (new text starts with old text) — fragment absorbed
    • Re-merge (text differs in other ways) — cascade/ripple effect
    • Paragraph count change — boundary shift, highest-risk regression
  4. Investigate any paragraph count decreases and text shrinkages — these are the most likely regressions

For the orphan word fix, acceptable results were: 215 clean prefix fixes, 0 paragraph count changes, 0 text shrinkages.