From 1f2d748a1d2ad40b611aa92b6ba237c9a0f543df Mon Sep 17 00:00:00 2001 From: Joey Eamigh <55670930+JoeyEamigh@users.noreply.github.com> Date: Sat, 4 Apr 2026 15:01:20 -0400 Subject: [PATCH] new codebook and ethos --- CLAUDE.md | 4 + docs/CODEBOOK-ETHOS.md | 319 +++++++++ docs/LABELING-CODEBOOK-v1.md | 871 +++++++++++++++++++++++ docs/LABELING-CODEBOOK.md | 977 ++++++++----------------- docs/NARRATIVE-v1.md | 1292 ++++++++++++++++++++++++++++++++++ docs/NARRATIVE.md | 1262 +++------------------------------ docs/STATUS.md | 338 ++++----- 7 files changed, 3038 insertions(+), 2025 deletions(-) create mode 100644 docs/CODEBOOK-ETHOS.md create mode 100644 docs/LABELING-CODEBOOK-v1.md create mode 100644 docs/NARRATIVE-v1.md diff --git a/CLAUDE.md b/CLAUDE.md index 78379c2..d45c17c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -12,7 +12,11 @@ Bun workspace monorepo. Three packages: |------|-------| | Shared schemas (Zod) | `packages/schemas/src/` | | Labeling codebook (source of truth for all category/specificity definitions) | `docs/LABELING-CODEBOOK.md` | +| Codebook ethos (reasoning behind every codebook decision) | `docs/CODEBOOK-ETHOS.md` | | Project narrative (decisions, roadblocks, lessons) | `docs/NARRATIVE.md` | +| Project status & todo list | `docs/STATUS.md` | +| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` | +| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` | | Implementation plan for labelapp | `docs/labelapp-plan.md` | | Labelapp-specific agent guide | `labelapp/AGENTS.md` | | Docker compose (Postgres) | `docker-compose.yaml` (root) | diff --git a/docs/CODEBOOK-ETHOS.md b/docs/CODEBOOK-ETHOS.md new file mode 100644 index 0000000..89ac02c --- /dev/null +++ b/docs/CODEBOOK-ETHOS.md @@ -0,0 +1,319 @@ +# Codebook Ethos — Design Reasoning & Edge Case Analysis + +This document explains the reasoning behind every decision in the labeling codebook. It is the training companion for human annotators and the design record for the project. If you read this document and disagree with anything in the codebook, flag it — we want to resolve disagreements here, not at labeling time. + +--- + +## Why This Document Exists + +The codebook (LABELING-CODEBOOK.md) tells you WHAT to do. This document tells you WHY. The distinction matters because: + +1. **Models need clean instructions.** The codebook is designed to go directly into an LLM system prompt. Extra explanation creates context pollution and can cause models to overfit to edge case reasoning rather than applying general rules. +2. **Humans need understanding.** A human annotator who understands the reasoning behind a rule will correctly handle novel edge cases that the rule doesn't explicitly cover. A human who only knows the rule will freeze on ambiguity or make inconsistent judgment calls. +3. **Decisions need documentation.** Every bright line in the codebook represents a deliberate choice. Documenting the reasoning makes those choices auditable, revisable, and defensible in the final paper. + +--- + +## Why v2? What Changed from v1 + +The v1 codebook (preserved at `docs/LABELING-CODEBOOK-v1.md`) was built over 12+ prompt iterations and served through 150K Stage 1 annotations, a 6-person human labeling round, and a 10-model benchmark. It worked — but it had structural problems that became visible only at evaluation time: + +### Problem 1: Specificity Level 2 was too narrow + +The professor's construct defines Level 2 as "Sector-adapted — references industry but no firm-specific details." Our v1 codebook interpreted this as "names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.)." That interpretation was too literal. Things like penetration testing, vulnerability scanning, SIEM, phishing simulations — these are all cybersecurity industry practices that a security professional instantly recognizes as domain-specific. Our codebook classified them as Level 1 (generic boilerplate), which squeezed Level 2 down to 3.9% of the holdout (47 samples). + +At 47 samples, ±3 correct/incorrect swings F1 by ~0.06. The measurement is too noisy for reliable per-class evaluation. + +**v2 fix:** Level 2 is now "Domain-Adapted" — uses cybersecurity domain terminology recognizable to a security professional, not just named standards. The projected distribution shifts from ~44/4/37/14 to ~25/20/37/18. Every class has real mass. + +### Problem 2: Level 4 required 2+ QV facts (counting problem) + +The professor's construct says: "(4) Quantified and verifiable — includes specific metrics, dollar amounts, incident timelines, or third-party audit references." That's a list of qualifying facts, not a "count two" rule. Our v1 codebook added the 2-fact threshold, which created a narrow Level 4 (14.1%) and forced annotators into a counting exercise that was error-prone and contentious. + +**v2 fix:** 1+ QV-eligible fact → Level 4. No counting. The bright line is: "Can an external party independently verify this claim?" One verifiable dollar amount, one named third-party firm, one specific date — any of these is already more informative than a paragraph without them. + +### Problem 3: The BG/MR/RMP triangle was patched, not fixed + +v1 accumulated six decision rules and ten borderline cases — many were patches for systemic ambiguity rather than clean rules. The v3.0 person-vs-function test and v3.5 three-step decision chain were good ideas, but they were bolted on as rulings to an unchanged set of definitions. Models had to process increasingly complex instructions with diminishing returns. + +**v2 fix:** The "What question does this paragraph answer?" test replaces the patchwork. MR's headline is now "How is management organized to handle cybersecurity?" — broader than "who a specific person is" (which missed paragraphs about management structure without named individuals) and clearer than a multi-step mechanical test. The person-removal test survives as a confirmation tool, not the primary rule. + +### Problem 4: The holdout was adversarial by design + +v1's holdout was stratified to OVER-SAMPLE confusion-axis paragraphs. This was great for codebook development (stress-testing rules on hard cases) but terrible for evaluation (inflating error rates and depressing F1). Combined with the narrow Level 2, this created a structurally unfavorable evaluation set. + +**v2 fix:** Random stratified sample — equal per category class, random within each stratum. Hard cases are represented at their natural frequency, not overweighted. + +--- + +## Category Reasoning + +### Why "What Question Does This Paragraph Answer?" + +Previous approaches tried to classify based on surface features: grammatical subjects, keyword presence, mechanical tests. These worked for clear-cut cases but failed on the governance chain (Board → Committee → Officer → Program) that appears in thousands of SEC filings. + +The "what question?" test works because it asks about communicative PURPOSE, not surface features. A paragraph that chains "The Audit Committee oversees... our CISO reports quarterly... the program includes penetration testing" has keywords from all three of BG, MR, and RMP. The question test cuts through: what is this paragraph TRYING TO TELL YOU? It's trying to tell you how oversight works. → BG. + +This is also the test that humans naturally apply. When you read a paragraph and "just know" it's about governance vs. process, you're implicitly asking what the paragraph's purpose is. The codebook now makes that implicit test explicit. + +### The Board Governance / Management Role Boundary + +**The core issue:** SEC Item 106(c) has two parts — (c)(1) covers board oversight and (c)(2) covers management's role. Many filings interleave them in a single paragraph. + +**The rule:** Governance-chain paragraphs default to BG. They become MR only when management's organizational role is the primary content. + +**Why this default?** Because the governance chain exists TO DESCRIBE OVERSIGHT. When a paragraph says "The Audit Committee oversees our cybersecurity program. Our CISO reports quarterly to the Committee on threat landscape and program effectiveness," the paragraph is explaining how oversight works. The CISO is the mechanism through which the board gets information — the paragraph is about the board's oversight structure, not about the CISO as a person or management's organizational role. + +MR captures something different: it answers "how is management organized?" This includes: +- Who holds cybersecurity responsibilities and how those responsibilities are divided +- What qualifies those people (credentials, experience, background) +- How management-level structures work (steering committees, reporting lines between officers) +- The identity and background of specific individuals + +A paragraph about the CISO's 20 years of experience, CISSP certification, and team of 12 → MR. A paragraph about the board receiving quarterly reports from the CISO → BG. Same person mentioned, different purpose. + +**The directionality heuristic (confirmation tool, not primary rule):** +- Board → Management (describing governance structure flowing down) → BG +- Management → Board (describing reporting relationship flowing up) → usually BG (the board is still the focus as the recipient) +- Management → Management (how roles are divided, who reports to whom in management) → MR +- Either mentioned, but most content is about actual processes → RMP + +### The Management Role / Risk Management Process Boundary + +**The core issue:** This was the #1 disagreement axis in v1 (2,290 disputes). The pattern is always the same: a paragraph names a CISO/CIO/CTO in the opening clause, then describes what the cybersecurity program does. Is it about the person or the program? + +**The person-removal test:** Remove all person-specific content. If a substantive description remains → RMP. This works because: +- If the paragraph is ABOUT the program, removing the person who oversees it leaves the program description intact +- If the paragraph is ABOUT the person, removing their details leaves nothing meaningful + +**Why this test and not a noun count or keyword list:** We tried mechanical approaches in v1 (step-by-step decision chains, grammatical subject tests). They worked for easy cases but made hard cases harder — annotators had to run through a mental flowchart instead of reading the paragraph naturally. The person-removal test is a single thought experiment that maps to what humans already do intuitively. + +**The remaining hard case — management committee with process details:** + +> "Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs." + +Person-removal test: remove committee membership → "monthly to review cybersecurity risks, assess emerging threats, and oversee vulnerability management and incident response programs." Still has content, but it's thin — the committee structure IS the primary content. → MR. + +If the paragraph instead spent three more sentences describing how the vulnerability management program works → RMP (process becomes dominant). The test scales with paragraph length naturally. + +### The Strategy Integration / None/Other Boundary + +**The core issue:** v1 had 1,094 disputes on this axis, almost all from materiality disclaimers. The sentence "risks have not materially affected our business strategy, results of operations, or financial condition" appears in thousands of filings. Is it SI (a materiality assessment) or N/O (boilerplate)? + +**The rule:** It's SI. Even though the language is generic, the company IS fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. Category captures WHAT the paragraph discloses (a materiality assessment). Specificity captures HOW specific it is (generic boilerplate = Level 1). These are independent dimensions. + +**The "could" vs. "have not" distinction:** This is a linguistic bright line, not a judgment call. +- "Have not materially affected" → past tense, definitive statement → assessment → SI +- "Are reasonably likely to materially affect" → SEC's required forward-looking language → assessment → SI +- "Could have a material adverse effect" → conditional, hypothetical → speculation → N/O (or classify by other content) + +The keyword is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold. "Could" is the generic risk-factor language that appears in every 10-K regardless of actual risk level. + +**Cross-references with materiality language:** "For risks that may materially affect us, see Item 1A" is N/O. The paragraph's purpose is pointing elsewhere. The word "materially" describes what Item 1A discusses, not the company's own conclusion. But: "Risks have not materially affected us. See Item 1A" is SI — the first sentence IS an assessment, and the cross-reference is subordinate. + +--- + +## Specificity Reasoning + +### Why Broaden Level 2: The ERM Test + +The v1 definition of Level 2 ("names a specific recognized standard") was too narrow because it conflated "domain-specific" with "names a formal standard." A paragraph that says "we conduct penetration testing and vulnerability assessments" is clearly more informative than "we have processes to manage cybersecurity risks" — the first uses domain vocabulary, the second uses generic business language. But v1 classified both as Level 1. + +The v2 test: **"Would this term appear naturally in a generic enterprise risk management document?"** This captures the construct's intent — "references industry" means using the industry's vocabulary, not just citing its standards. + +**Why "incident response plan" stays at Level 1:** IRP is used across all risk management domains — cybersecurity, physical security, natural disasters, supply chain disruptions. A non-security ERM professional would use this term naturally. By contrast, "penetration testing" is uniquely cybersecurity — you don't penetration-test a supply chain or a natural disaster response. + +**Why "security awareness training" is Level 2:** This is borderline. A businessperson might say "we train employees on security." But the specific phrase "security awareness training" is a recognized cybersecurity program type. The term itself references a domain-specific practice, even though it's become common. A non-security person would say "we train our employees" (Level 1), not "we provide security awareness training" (Level 2). The difference IS the domain vocabulary. + +**Why "tabletop exercises" stays at Level 1:** Tabletop exercises are used in emergency management, business continuity, and general risk management — not just cybersecurity. "Cybersecurity tabletop exercises simulating ransomware scenarios" → Level 2 (the qualifier makes it domain-specific). But bare "tabletop exercises" could refer to any risk domain. + +### Why 1+ QV Fact: The External Verifiability Test + +The v1 rule was 2+ QV facts. This created problems: +1. **Counting is error-prone.** Annotators and models disagree on what counts. Is "CISO" a QV fact? Is "quarterly" a fact? The counting itself became a source of disagreement. +2. **The construct doesn't require counting.** The professor's Level 4 definition lists types of qualifying facts, not a minimum count. +3. **One verifiable fact IS quantified and verifiable.** A paragraph that says "We maintain $100M in cyber insurance coverage" is genuinely more informative and verifiable than one without dollar amounts. The 2-fact threshold was artificial. + +The v2 test asks: **Can an external party independently verify at least one claim in this paragraph?** One specific number, one named third-party firm, one named certification held by an individual — any of these crosses the threshold. + +**Why named roles (CISO) are NOT QV:** A role title tells you something about the company's structure (firm-specific, Level 3) but is not a quantified claim an outsider can verify. "Our CISO" is identification. "Our CISO holds CISSP certification" adds a verifiable claim (CISSP holders are in a public registry). The role gets you to Level 3; the certification pushes to Level 4. + +**Why named individuals alone are NOT QV:** "Our CISO, Jane Smith" is firm-specific (Level 3). You could look her up, but the NAME itself isn't a quantified claim about cybersecurity posture. "Jane Smith, who has 20 years of cybersecurity experience" adds a verifiable quantity. The name identifies; the experience quantifies. + +**The certification trilogy — a critical distinction:** +1. "Our program is aligned with ISO 27001" → **Level 2** (references a standard, no firm-specific claim) +2. "We are working toward ISO 27001 certification" → **Level 3** (firm-specific intent, but no verifiable achievement) +3. "We maintain ISO 27001 certification" → **Level 4** (verifiable claim — you can check if a company holds this certification) + +The difference between "aligned with" and "maintain certification" is the difference between aspiration and audited fact. + +--- + +## Worked Edge Cases + +### Case 1: The Governance Chain + +> "The Board of Directors, through its Audit Committee, oversees the Company's cybersecurity risk management program. The Audit Committee receives regular updates from the CISO on the results of penetration testing and vulnerability assessments." + +**"What question?" test:** "How does the board oversee cybersecurity?" → **BG** +**Specificity:** "penetration testing," "vulnerability assessments" = domain terminology → **Level 2** +**Why not RMP?** The process details (pen testing, vuln assessments) are subordinate to the reporting structure. The paragraph exists to tell you that the Audit Committee oversees things and receives reports — the program details are examples of WHAT is reported. + +### Case 2: CISO Attribution + Program Description + +> "Our CISO oversees our cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework." + +**Person-removal test:** "cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework" → complete program description → **RMP** +**Specificity:** Domain terms (pen testing, vuln scanning) + named standard (NIST CSF) → **Level 2** +**Why not MR?** The paragraph tells you nothing about the CISO as a person — no qualifications, no experience, no reporting line, no team. The CISO is an attribution tag, like a byline on a news article. The content is the program. + +### Case 3: CISO Qualifications + +> "Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer. She leads a team of 12 dedicated cybersecurity professionals." + +**"What question?" test:** "How is management organized / who is this person?" → **MR** +**Specificity:** CISSP/CISM (named certifications, QV), 20 years (specific number, QV), 12 professionals (headcount, QV) — any one of these → **Level 4** +**Why not RMP?** Every sentence is about the person: their title, credentials, experience, reporting line, team. Remove the person-specific content and nothing remains. + +### Case 4: CFO/VP Role Allocation (No Named Individuals) + +> "Our CFO and VP of IT jointly oversee our cybersecurity program. The CFO is responsible for risk governance and insurance, while the VP of IT manages technical operations. They report to the board quarterly on cybersecurity matters." + +**"What question?" test:** "How is management organized?" → **MR** +**Person-removal test:** Remove all role content → "report to the board quarterly on cybersecurity matters" → barely anything → **MR confirmed** +**Specificity:** VP of IT = cybersecurity-specific title → **Level 3** (firm-specific) +**Why this is MR without named individuals:** MR isn't "who a specific person is" — it's "how management is organized." This paragraph describes role allocation and reporting structure. The roles are named, the responsibilities are divided, the governance chain is defined. This is organizational disclosure. + +### Case 5: Management Committee with Process Details + +> "Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs." + +**"What question?" test:** "How is management organized?" → **MR** +**Person-removal test:** Remove committee membership → thin but the activities remain → borderline +**Tiebreak:** The paragraph's FRAME is the committee — it introduces the committee and describes what it does. The activities listed (review, assess, oversee) are verbs of management oversight, not operational descriptions of HOW those programs work. → **MR, Specificity 3** (named committee + composition = firm-specific) +**When this flips to RMP:** If the paragraph spent most of its length describing how the vulnerability management program works (tools, methodology, frequency, findings), with the committee mentioned only as context → RMP. + +### Case 6: Materiality Assessment (Backward-Looking) + +> "Risks from cybersecurity threats have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition." + +**Materiality test:** Company stating a conclusion → **SI** +**Specificity:** Boilerplate language (every company says this) → **Level 1** +**Why this is SI and not N/O:** The company is fulfilling its SEC obligation to assess materiality. The fact that the language is generic makes it low-specificity, but the CATEGORY is about what the paragraph discloses (a materiality assessment), not how specific it is. + +### Case 7: Materiality Speculation + +> "Cybersecurity risks could have a material adverse effect on our business, financial condition, and results of operations." + +**Materiality test:** "Could" = speculation, not a conclusion → **N/O** +**Specificity:** N/O always gets **Level 1** +**Why this is N/O and not SI:** This is generic risk-factor language that appears in virtually every 10-K, regardless of whether the company has ever experienced a cybersecurity incident. The company is not stating a conclusion about its cybersecurity posture — it's acknowledging that cybersecurity risks exist. This carries zero informational content about THIS company's cybersecurity situation. + +### Case 8: Forward-Looking Assessment (SEC Qualifier) + +> "We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition." + +**Materiality test:** "Reasonably likely to materially affect" = SEC's Item 106(b)(2) threshold → **SI** +**Specificity:** Boilerplate → **Level 1** +**Why "reasonably likely" is different from "could":** "Reasonably likely" is the SEC's required assessment language. A company using this phrase is making a forward-looking materiality assessment, not idly speculating. It's still boilerplate (Spec 1), but it IS an assessment (SI). + +### Case 9: Cross-Reference with vs. without Assessment + +> **N/O:** "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A, 'Risk Factors.'" +> → The paragraph points elsewhere. "May materially affect" describes what Item 1A discusses. → **N/O, Level 1** + +> **SI:** "We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors." +> → The first sentence IS an assessment. The cross-reference is subordinate. → **SI, Level 1** + +The test: does the paragraph MAKE a materiality conclusion, or only REFERENCE one that exists elsewhere? + +### Case 10: SPAC / No-Operations Company + +> "We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any." + +**→ N/O, Level 1.** The board mention is perfunctory ("generally responsible... if any"). The company explicitly has no program. The absence of a program is not a disclosure of a program, and an incidental governance mention in the context of "we have nothing" does not constitute substantive board governance disclosure. + +### Case 11: Named Tool as QV Fact + +> "We utilize CrowdStrike Falcon for endpoint detection and response across our enterprise." + +**Category:** "What does the program do?" → **RMP** +**Specificity:** CrowdStrike Falcon = named product = QV-eligible fact → **Level 4** +**Why this is Level 4:** A company naming its specific EDR tool is genuinely more transparent and verifiable than "we use endpoint detection tools." You could confirm this claim. This is exactly what the construct means by "quantified and verifiable." + +### Case 12: Single Named Tool (v1 was Level 3, v2 is Level 4) + +Under v1's 2-fact rule, a paragraph with only one named product was Level 3. Under v2's 1-fact rule, it's Level 4. This is intentional — the 2-fact threshold was artificial. One verifiable external reference IS "quantified and verifiable." + +### Case 13: Insurance with Dollar Amount + +> "We maintain cybersecurity insurance coverage with $100 million in aggregate coverage and a $5 million deductible per incident." + +**"What question?" test:** "How does cybersecurity affect the business?" → **SI** (insurance is a financial/business-impact response) +**Specificity:** $100M and $5M = dollar amounts (QV) → **Level 4** + +### Case 14: Regulatory Compliance — Three Variants + +> **N/O:** "The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy." +> → A truism. No disclosure of what the company DOES. → **N/O, Level 1** + +> **RMP, Level 2:** "We maintain compliance with PCI DSS, HIPAA, and GDPR through regular audits and monitoring of our security controls." +> → Names specific standards + describes compliance activities → **RMP, Level 2** + +> **RMP, Level 4:** "We passed our PCI DSS Level 1 audit in March 2024, conducted by Trustwave." +> → Names standard + specific date + named third-party auditor → **RMP, Level 4** + +### Case 15: "Under the Direction of" Attribution + +> "Under the direction of our CISO, the Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring." + +**Person-removal test:** "The Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring." → Complete program description → **RMP, Level 2** + +### Case 16: ERM Integration + +> "Our cybersecurity risk management program is integrated into our overall enterprise risk management framework." + +**Category:** This describes a program characteristic → **RMP** +**Specificity:** "Enterprise risk management" and "integrated" are generic business language → **Level 1** +**Why not Level 2:** "Enterprise risk management" is a general business concept, not cybersecurity domain terminology. The ERM test: would this sentence appear in a generic ERM document? Yes, it could describe integrating ANY risk program into ERM. → Level 1. + +### Case 17: "Dedicated Cybersecurity Team" + +> "We have a dedicated cybersecurity team that is responsible for managing our cybersecurity risks." + +**Category:** RMP (what the team does — manages cyber risks) +**Specificity:** "Dedicated cybersecurity team" = domain-adapted organizational approach → **Level 2** +**Why Level 2 and not Level 3:** Many companies claim "dedicated" teams. The term describes a general organizational approach (having people dedicated to cybersecurity), not a fact unique to THIS company. Compare: "a dedicated team of 12 cybersecurity professionals" → Level 4 (the headcount is QV). The word "dedicated" itself doesn't differentiate. + +### Case 18: Multiple Category Paragraph — Incident + Cost + +> "On January 15, 2024, we detected unauthorized access to our customer support portal. We estimate the total cost of remediation at approximately $8.5 million." + +**Both ID and SI content.** Which dominates? The incident (what happened) is the frame; the cost is a detail within the incident narrative. → **ID, Level 4** (January 15, 2024 + $8.5M = QV facts) + +If the paragraph were primarily financial analysis with one sentence mentioning an incident → SI. + +### Case 19: Negative Incident Assertion + +> "We have not experienced any material cybersecurity incidents during the reporting period." + +**Materiality test:** Negative assertion with materiality framing → **SI, Level 1** +**Why SI and not N/O:** The company is STATING A CONCLUSION about the absence of material incidents. This is a materiality assessment even though it's negative. +**Why not ID:** No incident is described. The paragraph assesses business impact (no material incidents), not incident details. + +--- + +## What We Preserved from v1 + +Not everything changed. The following were validated through 150K annotations, 10-model benchmarks, and human labeling: + +1. **7 content categories** mapped to SEC rule structure — the construct is sound +2. **4 specificity levels** as an ordinal scale — the graduated concept works +3. **IS/NOT list pattern** — the single most effective prompt engineering technique from v1. Lists beat rules for specificity. +4. **Validation step** — "review your facts, remove NOT-list items" catches model self-correction +5. **Materiality assessment vs. speculation** — linguistic bright line, well-calibrated in v3.5 +6. **SPAC/no-operations rule** — resolved cleanly +7. **TP vs RMP distinction** — "who is being assessed?" test works +8. **ID for actual incidents only** — hypothetical language doesn't trigger ID + +These are proven components. v2 changes the boundaries and definitions around them, not the components themselves. diff --git a/docs/LABELING-CODEBOOK-v1.md b/docs/LABELING-CODEBOOK-v1.md new file mode 100644 index 0000000..b591c15 --- /dev/null +++ b/docs/LABELING-CODEBOOK-v1.md @@ -0,0 +1,871 @@ +# Labeling Codebook — SEC Cybersecurity Disclosure Quality + +This codebook is the authoritative reference for all human and GenAI labeling. Every annotator (human or model) must follow these definitions exactly. The LLM system prompt is generated directly from this document. + +--- + +## Classification Design + +**Unit of analysis:** One paragraph from an SEC filing (Item 1C of 10-K, or Item 1.05/8.01/7.01 of 8-K). + +**Classification type:** Multi-class (single-label), NOT multi-label. Each paragraph receives exactly one content category. + +**Each paragraph receives two labels:** +1. **Content Category** — single-label, one of 7 mutually exclusive classes +2. **Specificity Level** — ordinal integer 1-4 + +**None/Other policy:** Required. Since this is multi-class (not multi-label), we need a catch-all for paragraphs that don't fit the 6 substantive categories. A paragraph receives None/Other when it contains no cybersecurity-specific disclosure content (e.g., forward-looking statement disclaimers, section headers, general business language). + +--- + +## Dimension 1: Content Category + +Each paragraph is assigned exactly **one** content category. If a paragraph spans multiple categories, assign the **dominant** category — the one that best describes the paragraph's primary communicative purpose. + +### Board Governance + +- **SEC basis:** Item 106(c)(1) +- **Covers:** Board or committee oversight of cybersecurity risks, briefing frequency, board member cybersecurity expertise +- **Key markers:** "Audit Committee," "Board of Directors oversees," "quarterly briefings," "board-level expertise," "board committee" +- **Assign when:** The grammatical subject performing the primary action is the board or a board committee + +**Example texts:** + +> *"The Board of Directors oversees the Company's management of cybersecurity risks. The Board has delegated oversight of cybersecurity and data privacy matters to the Audit Committee."* +> → Board Governance, Specificity 3 (names Audit Committee — firm-specific delegation) + +> *"Our Board of Directors recognizes the critical importance of maintaining the trust and confidence of our customers and stakeholders, and cybersecurity risk is an area of increasing focus for our Board."* +> → Board Governance, Specificity 1 (could apply to any company — generic statement of intent) + +> *"The Audit Committee, which includes two members with significant technology and cybersecurity expertise, receives quarterly reports from the CISO and conducts an annual deep-dive review of the Company's cybersecurity program, threat landscape, and incident response readiness."* +> → Board Governance, Specificity 3 (names specific committee, describes specific briefing cadence and scope) + +### Management Role + +- **SEC basis:** Item 106(c)(2) +- **Covers:** The specific *person* filling a cybersecurity leadership position: their name, qualifications, career history, credentials, tenure, reporting lines, management committees responsible for cybersecurity +- **Key markers:** "Chief Information Security Officer," "reports to," "years of experience," "management committee," "CISSP," "CISM," named individuals, career background +- **Assign when:** The paragraph tells you something about *who the person is* — their background, credentials, experience, or reporting structure. A paragraph that names a CISO/CIO/CTO and then describes what the cybersecurity *program* does is NOT Management Role — it is Risk Management Process with an incidental role attribution. The test is whether the paragraph is about the **person** or about the **function**. + +**The person-vs-function test:** If you removed the role holder's name, title, qualifications, and background from the paragraph and the remaining content still describes substantive cybersecurity activities, processes, or oversight → the paragraph is about the function (Risk Management Process), not the person (Management Role). Management Role requires the person's identity or credentials to be the primary content, not just a brief attribution of who runs the program. + +**Example texts:** + +> *"Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer and is responsible for leading our cybersecurity program."* +> → Management Role, Specificity 3 — The paragraph is about the person: their credentials, experience, and reporting line. (named role, certifications, reporting line — all firm-specific) + +> *"Management is responsible for assessing and managing cybersecurity risks within the organization."* +> → Management Role, Specificity 1 (generic, no named roles or structure) + +> *"Our CISO, Sarah Chen, leads a dedicated cybersecurity team of 35 professionals and presents monthly threat briefings to the executive leadership team. Ms. Chen joined the Company in 2019 after serving as Deputy CISO at a Fortune 100 financial services firm."* +> → Management Role, Specificity 4 — The paragraph is about the person: their name, team size, background, prior role. (named individual, team size, specific frequency, prior employer — multiple verifiable facts) + +> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, penetration testing, and incident response planning aligned with the NIST CSF framework."* +> → **Risk Management Process**, NOT Management Role — The CISO is mentioned once as attribution, but the paragraph is about what the program does. Remove "Our CISO oversees" and the paragraph still makes complete sense as a process description. + +### Risk Management Process + +- **SEC basis:** Item 106(b) +- **Covers:** Risk assessment methodology, framework adoption (NIST, ISO, etc.), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration +- **Key markers:** "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management," "tabletop exercises," "incident response plan," "SOC," "SIEM" +- **Assign when:** The paragraph primarily describes the company's internal cybersecurity processes, tools, or methodologies + +**Example texts:** + +> *"We maintain a cybersecurity risk management program that is integrated into our overall enterprise risk management framework. Our program is designed to identify, assess, and manage material cybersecurity risks to our business."* +> → Risk Management Process, Specificity 1 (generic, could apply to any company) + +> *"Our cybersecurity program is aligned with the NIST Cybersecurity Framework and incorporates elements of ISO 27001. We conduct regular risk assessments, vulnerability scanning, and penetration testing as part of our continuous monitoring approach."* +> → Risk Management Process, Specificity 2 (names frameworks but no firm-specific detail) + +> *"We operate a 24/7 Security Operations Center that uses Splunk SIEM and CrowdStrike Falcon endpoint detection. Our incident response team conducts quarterly tabletop exercises simulating ransomware, supply chain compromise, and insider threat scenarios."* +> → Risk Management Process, Specificity 4 (named tools, named vendor, specific exercise frequency and scenarios — verifiable) + +### Third-Party Risk + +- **SEC basis:** Item 106(b) +- **Covers:** Vendor/supplier risk oversight, external assessor engagement, contractual security requirements, supply chain risk management +- **Key markers:** "third-party," "service providers," "vendor risk," "external auditors," "supply chain," "SOC 2 report," "contractual requirements" +- **Assign when:** The central topic is oversight of external parties' cybersecurity, not the company's own internal processes + +**Example texts:** + +> *"We face cybersecurity risks associated with our use of third-party service providers who may have access to our systems and data."* +> → Third-Party Risk, Specificity 1 (generic risk statement) + +> *"Our vendor risk management program requires all third-party service providers with access to sensitive data to meet minimum security standards, including SOC 2 Type II certification or equivalent third-party attestation."* +> → Third-Party Risk, Specificity 2 (names SOC 2 standard but no firm-specific detail about which vendors or how many) + +> *"We assessed 312 vendors in fiscal 2024 through our Third-Party Risk Management program. All Tier 1 vendors (those with access to customer PII or financial data) are required to provide annual SOC 2 Type II reports. In fiscal 2024, 14 vendors were placed on remediation plans and 3 vendor relationships were terminated for non-compliance."* +> → Third-Party Risk, Specificity 4 (specific numbers, specific actions, specific criteria — all verifiable) + +### Incident Disclosure + +- **SEC basis:** 8-K Item 1.05 (and 8.01/7.01 post-May 2024) +- **Covers:** Description of cybersecurity incidents — nature, scope, timing, impact assessment, remediation actions, ongoing investigation +- **Key markers:** "unauthorized access," "detected," "incident," "remediation," "impacted," "forensic investigation," "breach," "compromised" +- **Assign when:** The paragraph primarily describes what happened in a cybersecurity incident + +**Example texts:** + +> *"We have experienced, and may in the future experience, cybersecurity incidents that could have a material adverse effect on our business, results of operations, and financial condition."* +> → Incident Disclosure, Specificity 1 (hypothetical — no actual incident described. Note: if this appears in Item 1C rather than an 8-K, consider None/Other instead since it's generic risk language) + +> *"On January 15, 2024, we detected unauthorized access to our customer support portal. The threat actor exploited a known vulnerability in a third-party software component. Upon detection, we activated our incident response plan, contained the intrusion, and engaged Mandiant for forensic investigation."* +> → Incident Disclosure, Specificity 4 (specific date, specific system, named forensic firm, specific attack vector — all verifiable) + +> *"In December 2023, the Company experienced a cybersecurity incident involving unauthorized access to certain internal systems. The Company promptly took steps to contain and remediate the incident, including engaging third-party cybersecurity experts."* +> → Incident Disclosure, Specificity 3 (specific month, specific action — but no named firms or quantified impact) + +### Strategy Integration + +- **SEC basis:** Item 106(b)(2) +- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents +- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations" +- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves +- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment. + +**Example texts:** + +> *"Cybersecurity risks, including those described above, have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."* +> → Strategy Integration, Specificity 1 (boilerplate materiality statement — nearly identical language appears across thousands of filings, but it IS a materiality assessment) + +> *"We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors."* +> → Strategy Integration, Specificity 1 — The materiality assessment is the substantive content. The cross-reference is noise and does not pull the paragraph to None/Other. + +> *"We maintain cybersecurity insurance coverage as part of our overall risk management strategy to help mitigate potential financial losses from cybersecurity incidents."* +> → Strategy Integration, Specificity 2 (mentions insurance but no specifics) + +> *"We increased our cybersecurity budget by 32% to $45M in fiscal 2024, representing 0.8% of revenue. We maintain cyber liability insurance with $100M in aggregate coverage through AIG and Chubb, with a $5M deductible per incident."* +> → Strategy Integration, Specificity 4 (dollar amounts, percentages, named insurers, specific deductible — all verifiable) + +### None/Other + +- **Covers:** Forward-looking statement disclaimers, section headers, cross-references to other filing sections, general business language that mentions cybersecurity incidentally, text erroneously extracted from outside Item 1C/1.05 +- **No specificity scoring needed:** Always assign Specificity 1 for None/Other paragraphs (since there is no cybersecurity disclosure to rate) +- **SPACs and shell companies:** Companies that explicitly state they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. Paragraphs like "We have not adopted any cybersecurity risk management program. Our board is generally responsible for oversight" are None/Other — the board mention is perfunctory, not substantive governance disclosure. +- **Distinguishing from Strategy Integration:** A pure cross-reference ("See Item 1A, Risk Factors") with no materiality assessment is None/Other. But if the paragraph includes an explicit materiality conclusion ("have not materially affected our business strategy"), it becomes Strategy Integration even if a cross-reference is also present. The test: does the paragraph make a substantive claim about cybersecurity's impact on the business? If yes → Strategy Integration. If it only points elsewhere → None/Other. + +**Example texts:** + +> *"This Annual Report on Form 10-K contains forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended."* +> → None/Other, Specificity 1 + +> *"Item 1C. Cybersecurity"* +> → None/Other, Specificity 1 (section header only) + +> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"* +> → None/Other, Specificity 1 (cross-reference, no disclosure content) + +> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our board of directors is generally responsible for oversight of cybersecurity risks, if any."* +> → None/Other, Specificity 1 — No substantive disclosure. The board mention is incidental; the company explicitly has no program to disclose. + +> *"We do not consider that we face significant cybersecurity risk and have not adopted any formal processes for assessing cybersecurity risk."* +> → None/Other, Specificity 1 — Absence of a program is not a program description. + +--- + +## Category Decision Rules + +### Rule 1: Dominant Category +If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose. + +### Rule 2: Board vs. Management (the board-line test) + +**Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus. + +| Layer | Category | Key signals | +|-------|----------|-------------| +| Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) | +| Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials | +| Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" | + +**When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose? + +- **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus. +- **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**. +- **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content. + +| Signal | Category | +|--------|----------| +| Board/committee is the grammatical subject | Board Governance | +| Board delegates responsibility to management | Board Governance | +| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) | +| Management role's qualifications, experience, credentials described | Management Role | +| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) | +| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) | +| "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) | +| Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) | + +### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain) + +This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result. + +**Step 1 — Subject test:** What is the paragraph's grammatical subject? +- Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop. +- Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content. + +**Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure? +- **YES** → **Risk Management Process** (the process stands on its own; people are incidental) +- **NO** → **Management Role** (the paragraph is fundamentally about who these people are) +- Borderline → continue to Step 3 + +**Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals? +- **YES** → **Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible) +- **NO** → **Risk Management Process** (no person-specific content beyond a title attribution) + +| Signal | Category | +|--------|----------| +| The person's background, credentials, tenure, experience, education, career history | Management Role | +| The person's name is given | Management Role (strong signal) | +| Reporting lines as primary content (who reports to whom, management committee structure) | Management Role | +| Role title mentioned as attribution ("Our CISO oversees...") followed by process description | **Risk Management Process** | +| Activities, tools, methodologies, frameworks as the primary content | **Risk Management Process** | +| The paragraph would still make sense if you removed the role title and replaced it with "the Company" | **Risk Management Process** | + +**Key principle:** Naming a cybersecurity leadership title (CISO, CIO, CTO, VP of Security) does not make a paragraph Management Role. The title is often an incidental attribution — the paragraph names who is responsible then describes what the program does. If the paragraph's substantive content is about processes, activities, or tools, it is Risk Management Process regardless of how many times a role title appears. Management Role requires the paragraph's content to be about the *person* — who they are, what makes them qualified, how long they've served, what their background is. + +### Rule 3: Risk Management vs. Third-Party +| Signal | Category | +|--------|----------| +| Company's own internal processes, tools, teams | Risk Management Process | +| Third parties mentioned as ONE component of internal program | Risk Management Process | +| Vendor oversight is the CENTRAL topic | Third-Party Risk | +| External assessor hired to test the company | Risk Management Process (they serve the company) | +| Requirements imposed ON vendors | Third-Party Risk | + +### Rule 4: Incident vs. Strategy +| Signal | Category | +|--------|----------| +| Describes what happened (timeline, scope, response) | Incident Disclosure | +| Describes business impact of an incident (costs, revenue, insurance claim) | Strategy Integration | +| Mixed: "We detected X... at a cost of $Y" | Assign based on which is dominant — if cost is one sentence in a paragraph about the incident → Incident Disclosure | + +### Rule 5: None/Other Threshold +Assign None/Other ONLY when the paragraph contains no substantive cybersecurity disclosure content. If a paragraph mentions cybersecurity even briefly in service of a disclosure obligation, assign the relevant content category. + +**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure. + +### Rule 6: Materiality Language → Strategy Integration +Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes: + +- **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition" +- **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect" +- **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents" + +**The test:** Is the company STATING A CONCLUSION about materiality? + +- "Risks have not materially affected our business strategy" → YES, conclusion → SI +- "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI +- "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content) +- "Managing material risks associated with cybersecurity" → NO, adjective → not SI + +The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment. + +**Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1). + +**What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks"). + +**The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers. + +--- + +## Borderline Cases + +### Case 1: Framework mention + firm-specific fact +> *"We follow NIST CSF and our CISO oversees the program."* + +The NIST mention → Level 2 anchor. The CISO reference → firm-specific. **Apply boundary rule 2→3: "Does it mention anything unique to THIS company?" Yes (CISO role exists at this company) → Level 3.** + +### Case 2: Named role but generic description +> *"Our Chief Information Security Officer is responsible for managing cybersecurity risks."* + +Names a role (CISO) → potentially Level 3. But the description is completely generic. **Apply judgment: the mere existence of a CISO title is firm-specific (not all companies have one). → Level 3.** If the paragraph said "a senior executive is responsible" without naming the role → Level 1. + +### Case 3: Specificity-rich None/Other +> *"On March 15, 2025, we filed a Current Report on Form 8-K disclosing a cybersecurity incident. For details, see our Form 8-K filed March 15, 2025, accession number 0001193125-25-012345."* + +Contains specific dates and filing numbers, but the paragraph itself contains no disclosure content — it's a cross-reference. → **None/Other, Specificity 1.** Specificity only applies to disclosure substance, not to metadata. + +### Case 4: Hypothetical incident language in 10-K +> *"We may experience cybersecurity incidents that could disrupt our operations."* + +This appears in Item 1C, not an 8-K. It describes no actual incident. → **Risk Management Process or Strategy Integration (depending on context), NOT Incident Disclosure.** Incident Disclosure is reserved for descriptions of events that actually occurred. + +### Case 5: Dual-category paragraph +> *"The Audit Committee oversees our cybersecurity program, which is led by our CISO who holds CISSP certification and reports quarterly to the Committee."* + +Board (Audit Committee oversees) + Management (CISO qualifications, reporting). The opening clause sets the frame: this is about the Audit Committee's oversight, and the CISO detail is subordinate. → **Board Governance, Specificity 3.** + +### Case 6: Management Role vs. Risk Management Process — the person-vs-function test +> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning. The program is aligned with the NIST CSF framework and integrated into our enterprise risk management process."* + +The CISO is named as attribution, but the paragraph is about what the program does — assessments, scanning, response planning, framework alignment, ERM integration. Remove "Our CISO oversees" and it still makes complete sense as a process description. → **Risk Management Process, Specificity 2** (NIST CSF framework, no firm-specific facts beyond that). + +> *"Our CISO has over 20 years of experience in cybersecurity and holds CISSP and CISM certifications. She reports directly to the CIO and oversees a team of 12 security professionals. Prior to joining the Company in 2019, she served as VP of Security at a Fortune 500 technology firm."* + +The entire paragraph is about the person: experience, certifications, reporting line, team size, tenure, prior role. → **Management Role, Specificity 4** (years of experience + team headcount + named certifications = multiple QV-eligible facts). + +### Case 7: Materiality disclaimer — Strategy Integration vs. None/Other +> *"We have not identified any cybersecurity incidents or threats that have materially affected our business strategy, results of operations, or financial condition. However, like other companies, we have experienced threats from time to time. For more information, see Item 1A, Risk Factors."* + +Contains an explicit materiality assessment ("materially affected... business strategy, results of operations, or financial condition"). The cross-reference and generic threat mention are noise. → **Strategy Integration, Specificity 1.** + +> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"* + +No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1.** + +### Case 8: SPAC / no-operations company +> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program or formal processes. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any. We have not encountered any cybersecurity incidents since our IPO."* + +Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.** + +### Case 9: Materiality language — assessment vs. speculation (v3.5 revision) +> *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."* + +The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.** + +> *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."* + +Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.** + +> *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."* + +Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.) + +> *"We face various risks related to our IT systems."* + +No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.** + +**The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk. + +### Case 10: Generic regulatory compliance language +> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."* + +This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.** + +The key distinctions: +- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted) +- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6) +- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity + +--- + +## Dimension 2: Specificity Level + +Each paragraph receives a specificity level (1-4) indicating how company-specific the disclosure is. Apply the decision test in order — stop at the first "yes." + +### Decision Test + +1. **Count hard verifiable facts ONLY** (specific dates, dollar amounts, headcounts/percentages, named third-party firms, named products/tools, named certifications). TWO or more? → **Quantified-Verifiable (4)** +2. **Does it contain at least one fact from the IS list below?** → **Firm-Specific (3)** +3. **Does it name a recognized standard** (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA)? → **Sector-Adapted (2)** +4. **None of the above?** → **Generic Boilerplate (1)** + +None/Other paragraphs always receive Specificity 1. + +### Level Definitions + +| Level | Name | Description | +|-------|------|-------------| +| 1 | Generic Boilerplate | Could paste into any company's filing unchanged. No named entities, frameworks, roles, dates, or specific details. | +| 2 | Sector-Adapted | Names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.) but contains nothing unique to THIS company. General practices (pen testing, vulnerability scanning, tabletop exercises) do NOT qualify — only named standards. | +| 3 | Firm-Specific | Contains at least one fact from the IS list that identifies something unique to THIS company's disclosure. | +| 4 | Quantified-Verifiable | Contains TWO or more hard verifiable facts (see QV-eligible list). One fact = Firm-Specific, not QV. | + +### ✓ IS a Specific Fact (any ONE → at least Firm-Specific) + +- **Cybersecurity-specific titles:** CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, HSE Director overseeing cybersecurity, Chief Digital Officer (when overseeing cyber), Cybersecurity Director +- **Named non-generic committees:** Technology Committee, Cybersecurity Committee, Risk Committee, ERM Committee (NOT "Audit Committee" — that exists at every public company) +- **Specific team/department compositions:** "Legal, Compliance, and Finance" (but NOT just "a cross-functional team") +- **Specific dates:** "In December 2023", "On May 6, 2024", "fiscal 2025" +- **Named internal programs with unique identifiers:** "Cyber Incident Response Plan (CIRP)" (must have a distinguishing name/abbreviation — generic "incident response plan" does not qualify) +- **Named products, systems, tools:** Splunk, CrowdStrike Falcon, Azure Sentinel, ServiceNow +- **Named third-party firms:** Mandiant, Deloitte, CrowdStrike, PwC +- **Specific numbers:** headcounts, dollar amounts, percentages, exact durations ("17 years", "12 professionals") +- **Certification claims:** "We maintain ISO 27001 certification" (holding a certification is more than naming a standard) +- **Named universities in credential context:** "Ph.D. from Princeton University" (independently verifiable) + +### ✗ IS NOT a Specific Fact (do NOT use to justify Firm-Specific) + +- **Generic governance:** "the Board", "Board of Directors", "management", "Audit Committee", "the Committee" +- **Generic C-suite:** CEO, CFO, COO, President, General Counsel — these exist at every company and are not cybersecurity-specific +- **Generic IT leadership (NOT cybersecurity-specific):** "Head of IT", "IT Manager", "Director of IT", "Chief Compliance Officer", "Associate Vice President of IT" — these are general corporate/IT titles, not cybersecurity roles per the IS list +- **Unnamed entities:** "third-party experts", "external consultants", "cybersecurity firms", "managed service provider" +- **Generic cadences:** "quarterly", "annual", "periodic", "regular" — without exact dates +- **Boilerplate phrases:** "cybersecurity risks", "material adverse effect", "business operations", "financial condition" +- **Standard incident language:** "forensic investigation", "law enforcement", "regulatory obligations", "incident response protocols" +- **Vague quantifiers:** "certain systems", "some employees", "a number of", "a portion of" +- **Common practices:** "penetration testing", "vulnerability scanning", "tabletop exercises", "phishing simulations", "security awareness training" +- **Generic program names:** "incident response plan", "business continuity plan", "cybersecurity program", "Third-Party Risk Management Program", "Company-wide training" — no unique identifier or distinguishing abbreviation +- **Company self-references:** the company's own name, "the Company", "the Bank", subsidiary names, filing form types +- **Company milestones:** "since our IPO", "since inception" — not cybersecurity facts + +### QV-Eligible Facts (count toward the 2-fact threshold for Quantified-Verifiable) + +✓ Specific dates (month+year or exact date) +✓ Dollar amounts, headcounts, percentages +✓ Named third-party firms (Mandiant, CrowdStrike, Deloitte) +✓ Named products/tools (Splunk, Azure Sentinel) +✓ Named certifications held by individuals (CISSP, CISM, CEH) +✓ Years of experience as a specific number ("17 years", "over 20 years") +✓ Named universities in credential context + +**Do NOT count toward QV** (these trigger Firm-Specific but not QV): +✗ Named roles (CISO, CIO) +✗ Named committees +✗ Named frameworks (NIST, ISO 27001) — these trigger Sector-Adapted +✗ Team compositions, reporting structures +✗ Named internal programs +✗ Generic degrees without named university ("BS in Management") + +### Validation Step + +Before finalizing specificity, review the extracted facts. Remove any that appear on the NOT list. If no facts remain after filtering → Generic Boilerplate (or Sector-Adapted if a named standard is present). Do not let NOT-list items inflate the specificity rating. + +--- + +## LLM Response Schema + +The exact Zod schema passed to `generateObject`. This is the contract between the LLM and our pipeline. + +```typescript +import { z } from "zod"; + +export const ContentCategory = z.enum([ + "Board Governance", + "Management Role", + "Risk Management Process", + "Third-Party Risk", + "Incident Disclosure", + "Strategy Integration", + "None/Other", +]); + +export const SpecificityLevel = z.union([ + z.literal(1), + z.literal(2), + z.literal(3), + z.literal(4), +]); + +export const Confidence = z.enum(["high", "medium", "low"]); + +export const LabelOutput = z.object({ + content_category: ContentCategory + .describe("The single most applicable content category for this paragraph"), + specificity_level: SpecificityLevel + .describe("1=generic boilerplate, 2=sector-adapted, 3=firm-specific, 4=quantified-verifiable"), + category_confidence: Confidence + .describe("high=clear-cut, medium=some ambiguity, low=genuinely torn between categories"), + specificity_confidence: Confidence + .describe("high=clear-cut, medium=borderline adjacent levels, low=could argue for 2+ levels"), + reasoning: z.string() + .describe("Brief 1-2 sentence justification citing specific evidence from the text"), +}); +``` + +**Output example:** +```json +{ + "content_category": "Risk Management Process", + "specificity_level": 3, + "category_confidence": "high", + "specificity_confidence": "medium", + "reasoning": "Names NIST CSF (sector-adapted) and describes quarterly tabletop exercises specific to this company's program, pushing to firm-specific. Specificity borderline 2/3 — tabletop exercises could be generic or firm-specific depending on interpretation." +} +``` + +--- + +## System Prompt + +> **Note:** The system prompt below is the v1.0 template from the initial codebook. The production Stage 1 prompt is **v2.5** (in `ts/src/label/prompts.ts`), which incorporates the IS/NOT lists, calibration examples, validation step, and decision test from this codebook. The Stage 2 judge prompt (`buildJudgePrompt()` in the same file) adds dynamic disambiguation rules and confidence calibration. **This codebook is the source of truth; the prompt mirrors it.** + +The v1.0 template is preserved below for reference. See `ts/src/label/prompts.ts` for the current production prompt. + +``` +You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings) under SEC Release 33-11216. + +For each paragraph, assign exactly two labels: + +(a) content_category — the single most applicable category: + - "Board Governance": Board/committee oversight of cyber risk, briefing cadence, board member cyber expertise. SEC basis: Item 106(c)(1). + - "Management Role": CISO/CTO/CIO identification, qualifications, reporting lines, management committees. SEC basis: Item 106(c)(2). + - "Risk Management Process": Risk assessment methods, framework adoption (NIST, ISO), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration. SEC basis: Item 106(b). + - "Third-Party Risk": Vendor/supplier security oversight, external assessor requirements, contractual security standards, supply chain risk. SEC basis: Item 106(b). + - "Incident Disclosure": Description of actual cybersecurity incidents — nature, scope, timing, impact, remediation. SEC basis: 8-K Item 1.05. + - "Strategy Integration": Material impact on business strategy/financials, cyber insurance, investment/resource allocation. SEC basis: Item 106(b)(2). + - "None/Other": Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content. + +If a paragraph spans multiple categories, assign the DOMINANT one — the category that best describes the paragraph's primary communicative purpose. + +(b) specificity_level — integer 1 through 4: + 1 = Generic Boilerplate: Could apply to any company unchanged. Conditional language ("may," "could"). No named entities or frameworks. + 2 = Sector-Adapted: Names frameworks/standards (NIST, ISO, SOC 2) or industry-specific terms, but nothing unique to THIS company. + 3 = Firm-Specific: Contains at least one fact unique to this company — named roles, specific committees, concrete reporting lines, named programs. + 4 = Quantified-Verifiable: Two or more verifiable facts — dollar amounts, dates, headcounts, percentages, named third-party firms, audit results. + +BOUNDARY RULES (apply when torn between adjacent levels): + 1 vs 2: "Does it name ANY framework, standard, or industry-specific term?" → Yes = 2 + 2 vs 3: "Does it mention anything unique to THIS company?" → Yes = 3 + 3 vs 4: "Does it contain TWO OR MORE independently verifiable facts?" → Yes = 4 + +SPECIAL RULES: + - None/Other paragraphs always get specificity_level = 1. + - Hypothetical incident language ("we may experience...") in a 10-K is NOT Incident Disclosure. It is Risk Management Process or Strategy Integration. + - Incident Disclosure is only for descriptions of events that actually occurred. + +CONFIDENCE RATINGS (per dimension): + - "high": Clear-cut classification with no reasonable alternative. + - "medium": Some ambiguity, but one option is clearly stronger. + - "low": Genuinely torn between two or more options. +Be honest — overconfident ratings on hard cases are worse than admitting uncertainty. + +Respond with valid JSON matching the required schema. The "reasoning" field should cite specific words or facts from the paragraph that justify your labels (1-2 sentences). +``` + +--- + +## User Prompt Template + +``` +Company: {company_name} ({ticker}) +Filing type: {filing_type} +Filing date: {filing_date} +Section: {sec_item} + +Paragraph: +{paragraph_text} +``` + +--- + +## Stage 2 Judge Prompt + +Used when Stage 1 annotators disagree. The judge sees the paragraph plus all three prior annotations in randomized order. + +``` +You are adjudicating a labeling disagreement among three independent annotators. Each applied the same codebook but reached different conclusions. + +Review all three opinions below, then provide YOUR OWN independent label based on the codebook definitions above. Do not default to majority vote — use your own expert judgment. If you agree with one annotator's reasoning, explain why their interpretation is correct. + +Company: {company_name} ({ticker}) +Filing type: {filing_type} +Filing date: {filing_date} +Section: {sec_item} + +Paragraph: +{paragraph_text} + +--- Prior annotations (randomized order) --- + +Annotator A: content_category="{cat_a}", specificity_level={spec_a} + Reasoning: "{reason_a}" + +Annotator B: content_category="{cat_b}", specificity_level={spec_b} + Reasoning: "{reason_b}" + +Annotator C: content_category="{cat_c}", specificity_level={spec_c} + Reasoning: "{reason_c}" +``` + +--- + +## Cost and Time Tracking + +### Per-Annotation Record + +Every API call produces an `Annotation` record with full provenance: + +```typescript +provenance: { + modelId: string, // OpenRouter model ID e.g. "google/gemini-3.1-flash-lite-preview" + provider: string, // Upstream provider e.g. "google", "xai", "anthropic" + generationId: string, // OpenRouter generation ID (from response id field) + stage: "stage1" | "stage2-judge" | "benchmark", + runId: string, // UUID per batch run + promptVersion: string, // "v1.0" — tracks prompt iterations + inputTokens: number, // From usage.prompt_tokens + outputTokens: number, // From usage.completion_tokens + reasoningTokens: number, // From usage.completion_tokens_details.reasoning_tokens + costUsd: number, // REAL cost from OpenRouter usage.cost (not estimated) + latencyMs: number, // Wall clock per request + requestedAt: string, // ISO datetime +} +``` + +### Cost Source + +OpenRouter returns **actual cost** in every response body under `usage.cost` (USD). No estimation needed. Each response also includes a `generationId` (the `id` field) which we store in every annotation record. This enables: +- Audit trail: look up any annotation on OpenRouter's dashboard +- Richer stats via `GET /api/v1/generation?id={generationId}` (latency breakdown, provider routing, native token counts) + +### Aggregation Levels + +| Level | What | Where | +|-------|------|-------| +| Per-annotation | Single API call cost + latency | In each Annotation JSONL record | +| Per-model | Sum across all annotations for that model | `bun sec label:cost` | +| Per-stage | Stage 1 total, Stage 2 total | `bun sec label:cost` | +| Per-phase | Labeling total, benchmarking total | `bun sec label:cost` | +| Project total | Everything | `bun sec label:cost` | + +### Time Tracking + +| Metric | How | +|--------|-----| +| Per-annotation latency | `Date.now()` before/after API call | +| Batch throughput | paragraphs/minute computed from batch start/end | +| Stage 1 wall clock | Logged at batch start and end | +| Stage 2 wall clock | Logged at batch start and end | +| Total labeling time | Sum of all batch durations | +| Per-model benchmark time | Tracked during benchmark runs | + +All timing is logged to `data/metadata/cost-log.jsonl` with entries like: + +```json +{ + "event": "batch_complete", + "stage": "stage1", + "modelId": "openai/gpt-oss-120b", + "paragraphsProcessed": 50000, + "wallClockSeconds": 14400, + "totalCostUsd": 38.50, + "throughputPerMinute": 208.3, + "timestamp": "2026-03-29T10:30:00Z" +} +``` + +--- + +## NIST CSF 2.0 Mapping + +For academic grounding: + +| Our Category | NIST CSF 2.0 | +|-------------|-------------| +| Board Governance | GOVERN (GV.OV, GV.RR) | +| Management Role | GOVERN (GV.RR, GV.RM) | +| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) | +| Third-Party Risk | GOVERN (GV.SC) | +| Incident Disclosure | DETECT, RESPOND, RECOVER | +| Strategy Integration | GOVERN (GV.OC, GV.RM) | + +--- + +## Prompt Versioning + +Track prompt changes so we can attribute label quality to specific prompt versions: + +| Version | Date | N | Change | +|---------|------|---|--------| +| v1.0 | 2026-03-27 | 40 | Initial codebook-aligned prompt | +| v1.1 | 2026-03-28 | 40 | Added calibration examples, category decision rules. Cat 95%, Spec 68%, Both 62%. | +| v1.2 | 2026-03-28 | 40 | Expanded "what counts as unique" + materiality rule. REGRESSED (88% cat). | +| v2.0 | 2026-03-28 | 40 | Chain-of-thought schema with specific_facts array + algorithmic specificity. Gemini/Grok 5/5, GPT-OSS broken. | +| v2.1 | 2026-03-28 | 40 | Two-tier facts (organizational vs verifiable) + text enum labels. Gemini/Grok perfect but nano overrates. | +| v2.2 | 2026-03-28 | 40 | Decision-test format, simplified facts, "NOT a fact" list. Cat 95%, Spec 68%, Both 65%, Consensus 100%. | +| v2.2 | 2026-03-28 | 500 | 500-sample baseline. Cat 85.0%, Spec 60.8%, Both 51.4%, Consensus 99.6%, Spread 0.240. | +| v2.3 | 2026-03-28 | 500 | Tightened Sector-Adapted, expanded IS/NOT lists, QV boundary rules. Spec 72.0%, Both 59.2%. [1,1,2] eliminated. | +| v2.4 | 2026-03-28 | 500 | Validation step, schema constraint on specific_facts. Spec 78.6%, Both 66.8%. Nano overrating fixed. | +| v2.5 | 2026-03-28 | 500 | Improved Inc↔Strat tiebreaker, QV calibration examples. **PRODUCTION**: Cat 86.8%, Spec 81.0%, Both 70.8%, Consensus 99.4%, Spread 0.130. Inc↔Strat eliminated. | +| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). | +| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. | +| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. | +| v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). | + +When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare. + +--- + +## Iterative Prompt Tuning Protocol + +The v1.0 system prompt is built from theory and synthetic examples. Before firing the full 50K run, we iterate on real data to find and fix failure modes while it costs cents, not dollars. + +### Phase 0: Seed sample (before extraction is ready) + +Grab 20-30 real Item 1C paragraphs manually from EDGAR full-text search (`efts.sec.gov/LATEST/search-index?q="Item 1C" cybersecurity`). Paste into a JSONL by hand. This lets prompt tuning start immediately while extraction code is still being built. + +### Phase 1: Micro-pilot (30 paragraphs, all 3 Stage 1 models) + +1. Select ~30 real paragraphs covering: + - At least 2 per content category (incl. None/Other) + - At least 2 per specificity level + - Mix of industries and filing years + - 5+ deliberately tricky borderline cases + +2. Run all 3 Stage 1 models on these 30 with prompt v1.0. + +3. **You and at least one teammate independently label the same 30** using the codebook. These are your reference labels. + +4. Compare: + - Per-model accuracy vs reference + - Inter-model agreement (where do they diverge?) + - Per-category confusion (which categories do models mix up?) + - Per-specificity bias (do models systematically over/under-rate?) + - Are confidence ratings calibrated? (Do "high" labels match correct ones?) + +5. **Identify failure patterns.** Common ones: + - Models gravitating to "Risk Management Process" (largest category — pull) + - Models rating specificity too high (any named entity → firm-specific) + - Board Governance / Management Role confusion + - Missing None/Other (labeling boilerplate as Strategy Integration) + +### Phase 2: Prompt revision (v1.1) + +Based on Phase 1 failures, revise the system prompt: +- Add "common mistakes" section with explicit corrections +- Add few-shot examples for confused categories +- Sharpen boundary rules where models diverge +- Add negative examples ("This is NOT Incident Disclosure because...") + +**Do not change the Zod schema or category definitions** — only the system prompt text. Bump to v1.1. Re-run the same 30 paragraphs. Compare to v1.0. + +### Phase 3: Scale pilot (200 paragraphs) + +1. Extract 200 real paragraphs (stratified, broader set of filings). + +2. Run all 3 Stage 1 models with the best prompt version. + +3. Compute: + - **Inter-model Fleiss' Kappa** on category: target ≥ 0.65 + - **Inter-model Spearman correlation** on specificity: target ≥ 0.70 + - **Consensus rate**: % with 2/3+ agreement on both dims. Target ≥ 75%. + - **Confidence calibration**: are "high confidence" labels more likely agreed-upon? + +4. If targets not met: + - Analyze disagreements — genuine ambiguity or prompt failure? + - Prompt failure → revise to v1.2, re-run + - Genuine ambiguity → consider rubric adjustment (merge categories, collapse specificity) + - Repeat until targets met or documented why they can't be + +5. **Cost check**: extrapolate from 200 to 50K. Reasoning token usage reasonable? + +### Phase 4: Green light + +Once scale pilot passes: +- Lock prompt version (no changes during full run) +- Lock model configuration (reasoning effort, temperature) +- Document final prompt, configs, and pilot results +- Fire the full 50K annotation run + +--- + +## Pipeline Reliability & Observability + +### Resumability + +All API-calling scripts (annotation, judging, benchmarking) use the same pattern: + +1. Load output JSONL → parse each line → collect completed paragraph IDs into a Set +2. Lines that fail `JSON.parse` are skipped (truncated from a crash) +3. Filter input to only paragraphs NOT in the completed set +4. For each completion, append one valid JSON line + `flush()` + +JSONL line-append is atomic on Linux. Worst case on crash: one truncated line, skipped on reload. No data loss, no duplicate work, no duplicate API spend. + +### Error Handling + +| Error Type | Examples | Strategy | +|------------|----------|----------| +| Transient | 429, 500, 502, 503, ECONNRESET, timeout | Exponential backoff: 1s→2s→4s→8s→16s. Max 5 retries. | +| Permanent | 400, 422 (bad request) | Log to `{output}-errors.jsonl`, skip | +| Validation | Zod parse fail on LLM response | Retry once, then log + skip | +| Budget | 402 (out of credits) | Stop immediately, write session summary, exit | +| Consecutive | 10+ errors in a row | Stop — likely systemic (model down, prompt broken) | + +Error paragraphs get their own file. Retry later with `--retry-errors`. + +### Graceful Shutdown (SIGINT/SIGTERM) + +On Ctrl+C: +1. Stop dispatching new work +2. Wait for in-flight requests to complete (already paid for) +3. Write session summary +4. Print final stats, exit 0 + +### Live Dashboard (stderr) + +Updates every second: + +``` + SEC-cyBERT │ label:annotate │ google/gemini-3.1-flash-lite-preview │ v1.1 + ───────────────────────────────────────────────────────────────────────── + Progress 12,847 / 50,234 (25.6%) ETA 42m 18s + Session $1.23 │ 38m 12s elapsed │ 337.4 para/min + Totals $4.56 all-time │ 3 errors (0.02%) │ 7 retries + Latency p50: 289ms │ p95: 812ms │ p99: 1,430ms + Reasoning avg 47 tokens/para │ 12.3% of output tokens +``` + +Goes to stderr so stdout stays clean. + +### Session Log + +Every run appends to `data/metadata/sessions.jsonl`: + +```json +{ + "sessionId": "a1b2c3d4", + "command": "label:annotate", + "modelId": "google/gemini-3.1-flash-lite-preview", + "stage": "stage1", + "promptVersion": "v1.1", + "startedAt": "2026-03-29T10:00:00Z", + "endedAt": "2026-03-29T10:38:12Z", + "durationSeconds": 2292, + "paragraphsTotal": 50234, + "paragraphsProcessed": 12847, + "paragraphsSkippedResume": 37384, + "paragraphsErrored": 3, + "costUsd": 1.23, + "reasoningTokensTotal": 482000, + "avgLatencyMs": 450, + "p95LatencyMs": 812, + "throughputPerMinute": 337.4, + "concurrency": 12, + "exitReason": "complete" +} +``` + +`exitReason`: `complete` | `interrupted` (Ctrl+C) | `budget_exhausted` (402) | `error_threshold` (consecutive limit) + +### OpenRouter Generation ID + +Every annotation record includes the OpenRouter `generationId` from the response `id` field. This enables: +- **Audit trail**: look up any annotation on OpenRouter's dashboard +- **Rich stats**: `GET /api/v1/generation?id={generationId}` returns latency breakdown, provider routing, native token counts +- **Dispute resolution**: if a label looks wrong, inspect the exact generation that produced it + +--- + +## Gold Set Protocol + +### Sampling (1,200 paragraphs minimum) + +Stratify by: +- Content category (all 7 represented, oversample rare categories) +- Specificity level (all 4 represented) +- GICS sector (financial services, tech, healthcare, manufacturing minimum) +- Filing year (FY2023 and FY2024) + +### Human Labeling Process + +Labeling is done through a purpose-built web tool that enforces quality: +1. **Rules quiz:** Every annotator must read the codebook and pass a quiz on the rules before each labeling session. The quiz tests the three most common confusion axes: Management Role vs RMP (person-vs-function test), materiality disclaimers (Strategy Integration vs None/Other), and QV fact counting. +2. **Warm-up:** First 5 paragraphs per session are warm-up (pre-labeled, with feedback). Not counted toward gold set. +3. **Independent labeling:** Three team members independently label the full gold set using this codebook. +4. Compute inter-rater reliability: + - Cohen's Kappa (for content category — nominal, pairwise) + - Krippendorff's Alpha (for specificity level — ordinal, all annotators) + - Per-class confusion matrices + - **Target: Kappa > 0.75, Alpha > 0.67** +5. Adjudicate disagreements: third annotator tiebreaker, or discussion consensus with documented rationale +6. Run the full GenAI pipeline on the gold set and compare to human labels + +### If Agreement Is Poor + +- If Kappa < 0.60 on any category pair: revise that category's definition and boundary rules, re-pilot +- If Alpha < 0.50 on specificity: collapse 4-point to 3-point scale (merge 1+2 into "Non-specific" or 3+4 into "Substantive") +- Document the collapse decision and rationale in this codebook diff --git a/docs/LABELING-CODEBOOK.md b/docs/LABELING-CODEBOOK.md index b591c15..026cedb 100644 --- a/docs/LABELING-CODEBOOK.md +++ b/docs/LABELING-CODEBOOK.md @@ -1,366 +1,231 @@ -# Labeling Codebook — SEC Cybersecurity Disclosure Quality +# Labeling Codebook — SEC Cybersecurity Disclosure Quality (v2) -This codebook is the authoritative reference for all human and GenAI labeling. Every annotator (human or model) must follow these definitions exactly. The LLM system prompt is generated directly from this document. +This codebook is the authoritative reference for all human and GenAI labeling. Every annotator (human or model) must follow these definitions exactly. --- ## Classification Design -**Unit of analysis:** One paragraph from an SEC filing (Item 1C of 10-K, or Item 1.05/8.01/7.01 of 8-K). - -**Classification type:** Multi-class (single-label), NOT multi-label. Each paragraph receives exactly one content category. +**Unit of analysis:** One paragraph from an SEC filing (Item 1C of 10-K, or Item 1.05/8.01/7.01 of 8-K). Each paragraph is classified in isolation — do not use context from other paragraphs in the same filing. **Each paragraph receives two labels:** -1. **Content Category** — single-label, one of 7 mutually exclusive classes -2. **Specificity Level** — ordinal integer 1-4 +1. **Content Category** — one of 7 mutually exclusive classes +2. **Specificity Level** — ordinal integer 1–4 -**None/Other policy:** Required. Since this is multi-class (not multi-label), we need a catch-all for paragraphs that don't fit the 6 substantive categories. A paragraph receives None/Other when it contains no cybersecurity-specific disclosure content (e.g., forward-looking statement disclaimers, section headers, general business language). +Each paragraph receives exactly one content category (single-label, not multi-label). --- ## Dimension 1: Content Category -Each paragraph is assigned exactly **one** content category. If a paragraph spans multiple categories, assign the **dominant** category — the one that best describes the paragraph's primary communicative purpose. +### Primary Test -### Board Governance +For every paragraph, ask: **"What question does this paragraph primarily answer?"** + +| Question | Category | +|----------|----------| +| How does the board oversee cybersecurity? | Board Governance | +| How is management organized to handle cybersecurity? | Management Role | +| What does the cybersecurity program do? | Risk Management Process | +| How are third-party cyber risks managed? | Third-Party Risk | +| What happened in a cybersecurity incident? | Incident Disclosure | +| How does cybersecurity affect the business or finances? | Strategy Integration | +| *(none of the above / no substantive disclosure)* | None/Other | + +If a paragraph touches multiple categories, assign the one whose question the paragraph most directly answers. When genuinely split, the category whose content occupies the most text wins. + +--- + +### Board Governance (BG) - **SEC basis:** Item 106(c)(1) -- **Covers:** Board or committee oversight of cybersecurity risks, briefing frequency, board member cybersecurity expertise -- **Key markers:** "Audit Committee," "Board of Directors oversees," "quarterly briefings," "board-level expertise," "board committee" -- **Assign when:** The grammatical subject performing the primary action is the board or a board committee +- **Answers:** "How does the board oversee cybersecurity?" +- **Covers:** Board or committee oversight of cybersecurity risks, briefing frequency and scope, board member cybersecurity expertise, delegation of oversight responsibilities, how information flows to the board +- **Key markers:** "Board of Directors oversees," "Audit Committee," "quarterly briefings to the board," "board-level expertise," "board committee," "reports to the Board" +- **Includes:** Governance-chain paragraphs (Board → Committee → Officer → Program) where the purpose is describing the oversight structure -**Example texts:** +> *"The Audit Committee receives quarterly reports from the CISO on the Company's cybersecurity posture, including threat landscape assessments and vulnerability management results."* +> → **BG, Specificity 2** — answers "how does the board oversee?" The CISO and program details are subordinate to the reporting structure. -> *"The Board of Directors oversees the Company's management of cybersecurity risks. The Board has delegated oversight of cybersecurity and data privacy matters to the Audit Committee."* -> → Board Governance, Specificity 3 (names Audit Committee — firm-specific delegation) +--- -> *"Our Board of Directors recognizes the critical importance of maintaining the trust and confidence of our customers and stakeholders, and cybersecurity risk is an area of increasing focus for our Board."* -> → Board Governance, Specificity 1 (could apply to any company — generic statement of intent) - -> *"The Audit Committee, which includes two members with significant technology and cybersecurity expertise, receives quarterly reports from the CISO and conducts an annual deep-dive review of the Company's cybersecurity program, threat landscape, and incident response readiness."* -> → Board Governance, Specificity 3 (names specific committee, describes specific briefing cadence and scope) - -### Management Role +### Management Role (MR) - **SEC basis:** Item 106(c)(2) -- **Covers:** The specific *person* filling a cybersecurity leadership position: their name, qualifications, career history, credentials, tenure, reporting lines, management committees responsible for cybersecurity -- **Key markers:** "Chief Information Security Officer," "reports to," "years of experience," "management committee," "CISSP," "CISM," named individuals, career background -- **Assign when:** The paragraph tells you something about *who the person is* — their background, credentials, experience, or reporting structure. A paragraph that names a CISO/CIO/CTO and then describes what the cybersecurity *program* does is NOT Management Role — it is Risk Management Process with an incidental role attribution. The test is whether the paragraph is about the **person** or about the **function**. +- **Answers:** "How is management organized to handle cybersecurity?" +- **Covers:** Cybersecurity leadership roles and responsibilities, qualifications and credentials, career history and experience, management-level committee structure and membership, reporting lines between management roles, team composition and size +- **Key markers:** "CISO," "reports to the CIO," "years of experience," "CISSP," "management committee," named individuals, career background, team size +- **Assign when:** The paragraph's primary content is about management's organizational role in cybersecurity — who holds responsibilities, how those responsibilities are divided, what qualifies those people, or how management-level oversight is structured -**The person-vs-function test:** If you removed the role holder's name, title, qualifications, and background from the paragraph and the remaining content still describes substantive cybersecurity activities, processes, or oversight → the paragraph is about the function (Risk Management Process), not the person (Management Role). Management Role requires the person's identity or credentials to be the primary content, not just a brief attribution of who runs the program. +**Person-removal test (MR vs. RMP):** Remove all person-specific content (names, titles, qualifications, experience, reporting lines, team composition, committee membership). If the remaining text still describes a substantive cybersecurity program → **RMP**. If the paragraph collapses to near-nothing → **MR**. -**Example texts:** +> *"Our CFO and VP of IT jointly oversee our cybersecurity program. The CFO is responsible for risk governance and insurance, while the VP of IT manages technical operations. They report to the board quarterly on cybersecurity matters."* +> → **MR, Specificity 3** — answers "how is management organized?" Role allocation and reporting structure are the substance. -> *"Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer and is responsible for leading our cybersecurity program."* -> → Management Role, Specificity 3 — The paragraph is about the person: their credentials, experience, and reporting line. (named role, certifications, reporting line — all firm-specific) +> *"Our CISO oversees a cybersecurity program that includes vulnerability scanning, penetration testing, and incident response planning aligned with NIST CSF."* +> → **RMP, Specificity 2** — person-removal test: "a cybersecurity program that includes vulnerability scanning, penetration testing, and incident response planning aligned with NIST CSF" → still a complete program description. The CISO is attribution, not content. -> *"Management is responsible for assessing and managing cybersecurity risks within the organization."* -> → Management Role, Specificity 1 (generic, no named roles or structure) +--- -> *"Our CISO, Sarah Chen, leads a dedicated cybersecurity team of 35 professionals and presents monthly threat briefings to the executive leadership team. Ms. Chen joined the Company in 2019 after serving as Deputy CISO at a Fortune 100 financial services firm."* -> → Management Role, Specificity 4 — The paragraph is about the person: their name, team size, background, prior role. (named individual, team size, specific frequency, prior employer — multiple verifiable facts) - -> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, penetration testing, and incident response planning aligned with the NIST CSF framework."* -> → **Risk Management Process**, NOT Management Role — The CISO is mentioned once as attribution, but the paragraph is about what the program does. Remove "Our CISO oversees" and the paragraph still makes complete sense as a process description. - -### Risk Management Process +### Risk Management Process (RMP) - **SEC basis:** Item 106(b) -- **Covers:** Risk assessment methodology, framework adoption (NIST, ISO, etc.), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration -- **Key markers:** "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management," "tabletop exercises," "incident response plan," "SOC," "SIEM" -- **Assign when:** The paragraph primarily describes the company's internal cybersecurity processes, tools, or methodologies +- **Answers:** "What does the cybersecurity program do?" +- **Covers:** Risk assessment methodology, framework adoption, vulnerability management, security monitoring, incident response planning, security operations, tools and technologies, employee training programs, ERM integration +- **Key markers:** "NIST CSF," "ISO 27001," "vulnerability management," "penetration testing," "SOC," "SIEM," "incident response," "tabletop exercises" +- **Assign when:** The paragraph's primary content describes cybersecurity activities, processes, tools, or methodologies — regardless of who is mentioned as responsible -**Example texts:** +> *"We maintain a cybersecurity program aligned with the NIST Cybersecurity Framework. The program includes regular vulnerability assessments, penetration testing, and 24/7 monitoring through our Security Operations Center."* +> → **RMP, Specificity 2** — answers "what does the program do?" Domain terminology present but nothing unique to THIS company. -> *"We maintain a cybersecurity risk management program that is integrated into our overall enterprise risk management framework. Our program is designed to identify, assess, and manage material cybersecurity risks to our business."* -> → Risk Management Process, Specificity 1 (generic, could apply to any company) +--- -> *"Our cybersecurity program is aligned with the NIST Cybersecurity Framework and incorporates elements of ISO 27001. We conduct regular risk assessments, vulnerability scanning, and penetration testing as part of our continuous monitoring approach."* -> → Risk Management Process, Specificity 2 (names frameworks but no firm-specific detail) - -> *"We operate a 24/7 Security Operations Center that uses Splunk SIEM and CrowdStrike Falcon endpoint detection. Our incident response team conducts quarterly tabletop exercises simulating ransomware, supply chain compromise, and insider threat scenarios."* -> → Risk Management Process, Specificity 4 (named tools, named vendor, specific exercise frequency and scenarios — verifiable) - -### Third-Party Risk +### Third-Party Risk (TP) - **SEC basis:** Item 106(b) -- **Covers:** Vendor/supplier risk oversight, external assessor engagement, contractual security requirements, supply chain risk management -- **Key markers:** "third-party," "service providers," "vendor risk," "external auditors," "supply chain," "SOC 2 report," "contractual requirements" -- **Assign when:** The central topic is oversight of external parties' cybersecurity, not the company's own internal processes +- **Answers:** "How are third-party cyber risks managed?" +- **Covers:** Vendor/supplier cybersecurity oversight, external assessor requirements, contractual security requirements, supply chain risk management +- **Key markers:** "third-party," "service providers," "vendor risk," "SOC 2 report," "contractual requirements," "supply chain" +- **Assign when:** The central topic is oversight of external parties' cybersecurity -**Example texts:** +**TP vs. RMP:** A firm hired to assess the company's OWN security → RMP (the firm serves the company's program). Requirements imposed ON vendors, or assessment of vendors' cybersecurity → TP. -> *"We face cybersecurity risks associated with our use of third-party service providers who may have access to our systems and data."* -> → Third-Party Risk, Specificity 1 (generic risk statement) +> *"We require all critical vendors to maintain SOC 2 Type II certification and conduct annual security assessments of our top 50 service providers."* +> → **TP, Specificity 4** — requirements imposed on vendors; "50 service providers" is a QV-eligible fact. -> *"Our vendor risk management program requires all third-party service providers with access to sensitive data to meet minimum security standards, including SOC 2 Type II certification or equivalent third-party attestation."* -> → Third-Party Risk, Specificity 2 (names SOC 2 standard but no firm-specific detail about which vendors or how many) +--- -> *"We assessed 312 vendors in fiscal 2024 through our Third-Party Risk Management program. All Tier 1 vendors (those with access to customer PII or financial data) are required to provide annual SOC 2 Type II reports. In fiscal 2024, 14 vendors were placed on remediation plans and 3 vendor relationships were terminated for non-compliance."* -> → Third-Party Risk, Specificity 4 (specific numbers, specific actions, specific criteria — all verifiable) - -### Incident Disclosure +### Incident Disclosure (ID) - **SEC basis:** 8-K Item 1.05 (and 8.01/7.01 post-May 2024) -- **Covers:** Description of cybersecurity incidents — nature, scope, timing, impact assessment, remediation actions, ongoing investigation -- **Key markers:** "unauthorized access," "detected," "incident," "remediation," "impacted," "forensic investigation," "breach," "compromised" -- **Assign when:** The paragraph primarily describes what happened in a cybersecurity incident +- **Answers:** "What happened in a cybersecurity incident?" +- **Covers:** Description of actual cybersecurity incidents — nature, scope, timing, impact, remediation, investigation +- **Key markers:** "unauthorized access," "detected," "incident," "breach," "forensic investigation," "remediation," "compromised" +- **Assign when:** The paragraph describes events that actually occurred +- **Not for:** Hypothetical incident language ("we may experience...") in a 10-K → classify by actual content (usually RMP or SI) -**Example texts:** +> *"On January 15, 2024, we detected unauthorized access to our customer support portal. We activated our incident response plan and engaged Mandiant for forensic investigation."* +> → **ID, Specificity 4** — describes what happened; January 15, 2024 (specific date) and Mandiant (named third-party firm) are QV-eligible facts. -> *"We have experienced, and may in the future experience, cybersecurity incidents that could have a material adverse effect on our business, results of operations, and financial condition."* -> → Incident Disclosure, Specificity 1 (hypothetical — no actual incident described. Note: if this appears in Item 1C rather than an 8-K, consider None/Other instead since it's generic risk language) +--- -> *"On January 15, 2024, we detected unauthorized access to our customer support portal. The threat actor exploited a known vulnerability in a third-party software component. Upon detection, we activated our incident response plan, contained the intrusion, and engaged Mandiant for forensic investigation."* -> → Incident Disclosure, Specificity 4 (specific date, specific system, named forensic firm, specific attack vector — all verifiable) - -> *"In December 2023, the Company experienced a cybersecurity incident involving unauthorized access to certain internal systems. The Company promptly took steps to contain and remediate the incident, including engaging third-party cybersecurity experts."* -> → Incident Disclosure, Specificity 3 (specific month, specific action — but no named firms or quantified impact) - -### Strategy Integration +### Strategy Integration (SI) - **SEC basis:** Item 106(b)(2) -- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents -- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations" -- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves -- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment. +- **Answers:** "How does cybersecurity affect the business or finances?" +- **Covers:** Materiality assessments, cybersecurity insurance, budget/investment allocation, cost of incidents, business strategy impact +- **Key markers:** "materially affected," "reasonably likely to materially affect," "insurance," "budget," "investment" -**Example texts:** +**Materiality assessment rule:** A paragraph that states a CONCLUSION about whether cybersecurity has or will affect business outcomes is SI — regardless of how generic the language is. -> *"Cybersecurity risks, including those described above, have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."* -> → Strategy Integration, Specificity 1 (boilerplate materiality statement — nearly identical language appears across thousands of filings, but it IS a materiality assessment) +| Language | Type | Category | +|----------|------|----------| +| "Have not materially affected our business strategy, results of operations, or financial condition" | Backward-looking assessment | **SI** | +| "Are reasonably likely to materially affect" | Forward-looking assessment (SEC's Item 106(b)(2) language) | **SI** | +| "We have not experienced any material cybersecurity incidents" | Negative assertion with materiality framing | **SI** | +| "Could have a material adverse effect on our business" | Speculation — every 10-K says this | **Not SI** — classify by other content, or N/O | +| "Managing material risks" | "Material" as adjective, not an assessment | **Not SI** | +| "Risks that may materially affect... see Item 1A" | Describes what another section covers | **N/O** | -> *"We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors."* -> → Strategy Integration, Specificity 1 — The materiality assessment is the substantive content. The cross-reference is noise and does not pull the paragraph to None/Other. +**The test:** Is the company STATING A CONCLUSION about materiality? "Reasonably likely" signals an assessment (SEC's required language). Bare "could" or "may" is speculation. -> *"We maintain cybersecurity insurance coverage as part of our overall risk management strategy to help mitigate potential financial losses from cybersecurity incidents."* -> → Strategy Integration, Specificity 2 (mentions insurance but no specifics) +> *"Risks from cybersecurity threats have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."* +> → **SI, Specificity 1** — materiality assessment, but boilerplate language. -> *"We increased our cybersecurity budget by 32% to $45M in fiscal 2024, representing 0.8% of revenue. We maintain cyber liability insurance with $100M in aggregate coverage through AIG and Chubb, with a $5M deductible per incident."* -> → Strategy Integration, Specificity 4 (dollar amounts, percentages, named insurers, specific deductible — all verifiable) +--- -### None/Other +### None/Other (N/O) -- **Covers:** Forward-looking statement disclaimers, section headers, cross-references to other filing sections, general business language that mentions cybersecurity incidentally, text erroneously extracted from outside Item 1C/1.05 -- **No specificity scoring needed:** Always assign Specificity 1 for None/Other paragraphs (since there is no cybersecurity disclosure to rate) -- **SPACs and shell companies:** Companies that explicitly state they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. Paragraphs like "We have not adopted any cybersecurity risk management program. Our board is generally responsible for oversight" are None/Other — the board mention is perfunctory, not substantive governance disclosure. -- **Distinguishing from Strategy Integration:** A pure cross-reference ("See Item 1A, Risk Factors") with no materiality assessment is None/Other. But if the paragraph includes an explicit materiality conclusion ("have not materially affected our business strategy"), it becomes Strategy Integration even if a cross-reference is also present. The test: does the paragraph make a substantive claim about cybersecurity's impact on the business? If yes → Strategy Integration. If it only points elsewhere → None/Other. +- **Answers:** None of the six substantive questions +- **Covers:** Forward-looking disclaimers, section headers, cross-references, general business language, non-cybersecurity content, text extracted from outside Item 1C/1.05 +- **Always receives Specificity 1** -**Example texts:** +**SPACs and no-operations companies:** Companies with no cybersecurity program receive N/O regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a disclosure. -> *"This Annual Report on Form 10-K contains forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended."* -> → None/Other, Specificity 1 +**N/O vs. SI:** A cross-reference is N/O even if it mentions materiality. "For risks that may materially affect us, see Item 1A" → N/O (pointing elsewhere). "Risks have not materially affected us. See Item 1A" → SI (the first sentence IS an assessment; the cross-reference is subordinate). -> *"Item 1C. Cybersecurity"* -> → None/Other, Specificity 1 (section header only) +**N/O vs. RMP:** Generic risk language without cybersecurity-specific content is N/O. But if a paragraph describes actual cybersecurity measures ("we have implemented network monitoring and access controls"), it is RMP — even if the framing is generic. -> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"* -> → None/Other, Specificity 1 (cross-reference, no disclosure content) +> *"This Annual Report contains forward-looking statements within the meaning of Section 27A of the Securities Act."* +> → **N/O, Specificity 1** -> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our board of directors is generally responsible for oversight of cybersecurity risks, if any."* -> → None/Other, Specificity 1 — No substantive disclosure. The board mention is incidental; the company explicitly has no program to disclose. - -> *"We do not consider that we face significant cybersecurity risk and have not adopted any formal processes for assessing cybersecurity risk."* -> → None/Other, Specificity 1 — Absence of a program is not a program description. +> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program."* +> → **N/O, Specificity 1** — no substantive disclosure; absence of a program is not a program description. --- ## Category Decision Rules -### Rule 1: Dominant Category -If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose. +### Rule 1: Board Governance vs. Management Role -### Rule 2: Board vs. Management (the board-line test) +**Governance-chain paragraphs** (Board → Committee → Officer → Program) default to **BG** when the purpose is describing the oversight structure. They become MR only when management's organizational role — responsibilities, qualifications, committee membership — is the primary content. -**Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus. +| Pattern | Category | +|---------|----------| +| Board/committee oversees, receives reports, delegates | BG | +| Management reports TO the board (describing oversight flow) | BG | +| Management roles, responsibilities, and how they're divided | MR | +| Person's qualifications, credentials, experience | MR | +| Management-level committee structure and membership | MR | +| Board-level committee (Audit Committee, Risk Committee of the Board) | BG | +| Management-level committee (Cybersecurity Steering Committee, Security Council) | MR (if about structure/membership) or RMP (if about activities) | -| Layer | Category | Key signals | -|-------|----------|-------------| -| Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) | -| Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials | -| Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" | +### Rule 2: Management Role vs. Risk Management Process -**When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose? +Apply the **person-removal test**: remove all person-specific content (names, titles, qualifications, experience, reporting lines, team composition, committee membership). If a substantive cybersecurity program description remains → **RMP**. If the paragraph loses its substance → **MR**. -- **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus. -- **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**. -- **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content. +**The core distinction:** MR is about **roles** — who is responsible, how they're organized, what qualifies them. RMP is about **activities** — what the program does, how it operates, what tools and frameworks it uses. | Signal | Category | |--------|----------| -| Board/committee is the grammatical subject | Board Governance | -| Board delegates responsibility to management | Board Governance | -| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) | -| Management role's qualifications, experience, credentials described | Management Role | -| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) | -| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) | -| "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) | -| Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) | +| Role title as brief attribution, then program details | RMP | +| Role allocation, responsibilities divided among people | MR | +| Reporting structure between management roles | MR | +| Qualifications, credentials, experience, career history | MR | +| Activities, tools, frameworks, processes as primary content | RMP | +| "Under the direction of our CISO, the Company has implemented..." | RMP (program is the content) | -### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain) - -This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result. - -**Step 1 — Subject test:** What is the paragraph's grammatical subject? -- Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop. -- Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content. - -**Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure? -- **YES** → **Risk Management Process** (the process stands on its own; people are incidental) -- **NO** → **Management Role** (the paragraph is fundamentally about who these people are) -- Borderline → continue to Step 3 - -**Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals? -- **YES** → **Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible) -- **NO** → **Risk Management Process** (no person-specific content beyond a title attribution) +### Rule 3: Third-Party Risk vs. Risk Management Process | Signal | Category | |--------|----------| -| The person's background, credentials, tenure, experience, education, career history | Management Role | -| The person's name is given | Management Role (strong signal) | -| Reporting lines as primary content (who reports to whom, management committee structure) | Management Role | -| Role title mentioned as attribution ("Our CISO oversees...") followed by process description | **Risk Management Process** | -| Activities, tools, methodologies, frameworks as the primary content | **Risk Management Process** | -| The paragraph would still make sense if you removed the role title and replaced it with "the Company" | **Risk Management Process** | +| Company's own internal processes, tools, teams | RMP | +| Third parties hired to serve the company (assessors, consultants, pen testers) | RMP | +| Requirements imposed on vendors | TP | +| Assessment of vendor cybersecurity posture | TP | +| Third parties mentioned as one component of an internal program | RMP | +| Vendor oversight as the central topic | TP | -**Key principle:** Naming a cybersecurity leadership title (CISO, CIO, CTO, VP of Security) does not make a paragraph Management Role. The title is often an incidental attribution — the paragraph names who is responsible then describes what the program does. If the paragraph's substantive content is about processes, activities, or tools, it is Risk Management Process regardless of how many times a role title appears. Management Role requires the paragraph's content to be about the *person* — who they are, what makes them qualified, how long they've served, what their background is. +### Rule 4: Incident Disclosure vs. Strategy Integration -### Rule 3: Risk Management vs. Third-Party | Signal | Category | |--------|----------| -| Company's own internal processes, tools, teams | Risk Management Process | -| Third parties mentioned as ONE component of internal program | Risk Management Process | -| Vendor oversight is the CENTRAL topic | Third-Party Risk | -| External assessor hired to test the company | Risk Management Process (they serve the company) | -| Requirements imposed ON vendors | Third-Party Risk | - -### Rule 4: Incident vs. Strategy -| Signal | Category | -|--------|----------| -| Describes what happened (timeline, scope, response) | Incident Disclosure | -| Describes business impact of an incident (costs, revenue, insurance claim) | Strategy Integration | -| Mixed: "We detected X... at a cost of $Y" | Assign based on which is dominant — if cost is one sentence in a paragraph about the incident → Incident Disclosure | +| What happened (timeline, scope, response actions) | ID | +| Business/financial impact of an incident | SI | +| Mixed — incident with brief cost mention | ID (incident frame dominates) | +| Mixed — financial analysis with brief incident reference | SI (business frame dominates) | ### Rule 5: None/Other Threshold -Assign None/Other ONLY when the paragraph contains no substantive cybersecurity disclosure content. If a paragraph mentions cybersecurity even briefly in service of a disclosure obligation, assign the relevant content category. -**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure. +N/O only when the paragraph contains no substantive cybersecurity disclosure. If a paragraph describes any actual cybersecurity measure, process, or assessment — however generic — assign the relevant substantive category. -### Rule 6: Materiality Language → Strategy Integration -Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes: - -- **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition" -- **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect" -- **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents" - -**The test:** Is the company STATING A CONCLUSION about materiality? - -- "Risks have not materially affected our business strategy" → YES, conclusion → SI -- "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI -- "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content) -- "Managing material risks associated with cybersecurity" → NO, adjective → not SI - -The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment. - -**Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1). - -**What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks"). - -**The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers. - ---- - -## Borderline Cases - -### Case 1: Framework mention + firm-specific fact -> *"We follow NIST CSF and our CISO oversees the program."* - -The NIST mention → Level 2 anchor. The CISO reference → firm-specific. **Apply boundary rule 2→3: "Does it mention anything unique to THIS company?" Yes (CISO role exists at this company) → Level 3.** - -### Case 2: Named role but generic description -> *"Our Chief Information Security Officer is responsible for managing cybersecurity risks."* - -Names a role (CISO) → potentially Level 3. But the description is completely generic. **Apply judgment: the mere existence of a CISO title is firm-specific (not all companies have one). → Level 3.** If the paragraph said "a senior executive is responsible" without naming the role → Level 1. - -### Case 3: Specificity-rich None/Other -> *"On March 15, 2025, we filed a Current Report on Form 8-K disclosing a cybersecurity incident. For details, see our Form 8-K filed March 15, 2025, accession number 0001193125-25-012345."* - -Contains specific dates and filing numbers, but the paragraph itself contains no disclosure content — it's a cross-reference. → **None/Other, Specificity 1.** Specificity only applies to disclosure substance, not to metadata. - -### Case 4: Hypothetical incident language in 10-K -> *"We may experience cybersecurity incidents that could disrupt our operations."* - -This appears in Item 1C, not an 8-K. It describes no actual incident. → **Risk Management Process or Strategy Integration (depending on context), NOT Incident Disclosure.** Incident Disclosure is reserved for descriptions of events that actually occurred. - -### Case 5: Dual-category paragraph -> *"The Audit Committee oversees our cybersecurity program, which is led by our CISO who holds CISSP certification and reports quarterly to the Committee."* - -Board (Audit Committee oversees) + Management (CISO qualifications, reporting). The opening clause sets the frame: this is about the Audit Committee's oversight, and the CISO detail is subordinate. → **Board Governance, Specificity 3.** - -### Case 6: Management Role vs. Risk Management Process — the person-vs-function test -> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning. The program is aligned with the NIST CSF framework and integrated into our enterprise risk management process."* - -The CISO is named as attribution, but the paragraph is about what the program does — assessments, scanning, response planning, framework alignment, ERM integration. Remove "Our CISO oversees" and it still makes complete sense as a process description. → **Risk Management Process, Specificity 2** (NIST CSF framework, no firm-specific facts beyond that). - -> *"Our CISO has over 20 years of experience in cybersecurity and holds CISSP and CISM certifications. She reports directly to the CIO and oversees a team of 12 security professionals. Prior to joining the Company in 2019, she served as VP of Security at a Fortune 500 technology firm."* - -The entire paragraph is about the person: experience, certifications, reporting line, team size, tenure, prior role. → **Management Role, Specificity 4** (years of experience + team headcount + named certifications = multiple QV-eligible facts). - -### Case 7: Materiality disclaimer — Strategy Integration vs. None/Other -> *"We have not identified any cybersecurity incidents or threats that have materially affected our business strategy, results of operations, or financial condition. However, like other companies, we have experienced threats from time to time. For more information, see Item 1A, Risk Factors."* - -Contains an explicit materiality assessment ("materially affected... business strategy, results of operations, or financial condition"). The cross-reference and generic threat mention are noise. → **Strategy Integration, Specificity 1.** - -> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"* - -No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1.** - -### Case 8: SPAC / no-operations company -> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program or formal processes. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any. We have not encountered any cybersecurity incidents since our IPO."* - -Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.** - -### Case 9: Materiality language — assessment vs. speculation (v3.5 revision) -> *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."* - -The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.** - -> *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."* - -Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.** - -> *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."* - -Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.) - -> *"We face various risks related to our IT systems."* - -No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.** - -**The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk. - -### Case 10: Generic regulatory compliance language -> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."* - -This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.** - -The key distinctions: -- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted) -- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6) -- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity +**Exceptions:** +- SPACs and no-operations companies with no real program → N/O +- Pure speculation ("could have a material adverse effect") with no substantive disclosure → N/O +- Regulatory compliance acknowledged generically ("subject to various regulations") without describing what the company does to comply → N/O --- ## Dimension 2: Specificity Level -Each paragraph receives a specificity level (1-4) indicating how company-specific the disclosure is. Apply the decision test in order — stop at the first "yes." - ### Decision Test -1. **Count hard verifiable facts ONLY** (specific dates, dollar amounts, headcounts/percentages, named third-party firms, named products/tools, named certifications). TWO or more? → **Quantified-Verifiable (4)** -2. **Does it contain at least one fact from the IS list below?** → **Firm-Specific (3)** -3. **Does it name a recognized standard** (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA)? → **Sector-Adapted (2)** -4. **None of the above?** → **Generic Boilerplate (1)** +Check in order. Stop at the first "yes." + +1. **Does it contain at least one QV-eligible fact?** → **Level 4: Quantified-Verifiable** +2. **Does it contain at least one firm-specific fact (IS list)?** → **Level 3: Firm-Specific** +3. **Does it use any cybersecurity domain terminology (Domain list)?** → **Level 2: Domain-Adapted** +4. **None of the above** → **Level 1: Generic Boilerplate** None/Other paragraphs always receive Specificity 1. @@ -368,67 +233,143 @@ None/Other paragraphs always receive Specificity 1. | Level | Name | Description | |-------|------|-------------| -| 1 | Generic Boilerplate | Could paste into any company's filing unchanged. No named entities, frameworks, roles, dates, or specific details. | -| 2 | Sector-Adapted | Names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.) but contains nothing unique to THIS company. General practices (pen testing, vulnerability scanning, tabletop exercises) do NOT qualify — only named standards. | -| 3 | Firm-Specific | Contains at least one fact from the IS list that identifies something unique to THIS company's disclosure. | -| 4 | Quantified-Verifiable | Contains TWO or more hard verifiable facts (see QV-eligible list). One fact = Firm-Specific, not QV. | +| 1 | Generic Boilerplate | Could paste into any company's filing unchanged. Uses only general business/risk language. No cybersecurity domain terminology, no firm-specific details, no verifiable facts. | +| 2 | Domain-Adapted | Uses cybersecurity domain terminology that a security professional would recognize as industry-specific, but contains nothing unique to THIS company. | +| 3 | Firm-Specific | Contains at least one fact identifying something unique to THIS company's cybersecurity posture — a named role, committee, program, or organizational detail not found at every public company. | +| 4 | Quantified-Verifiable | Contains at least one hard fact that an external party could independently verify — a specific number, date, named external entity, or externally-issued credential. | -### ✓ IS a Specific Fact (any ONE → at least Firm-Specific) +--- -- **Cybersecurity-specific titles:** CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, HSE Director overseeing cybersecurity, Chief Digital Officer (when overseeing cyber), Cybersecurity Director -- **Named non-generic committees:** Technology Committee, Cybersecurity Committee, Risk Committee, ERM Committee (NOT "Audit Committee" — that exists at every public company) -- **Specific team/department compositions:** "Legal, Compliance, and Finance" (but NOT just "a cross-functional team") -- **Specific dates:** "In December 2023", "On May 6, 2024", "fiscal 2025" -- **Named internal programs with unique identifiers:** "Cyber Incident Response Plan (CIRP)" (must have a distinguishing name/abbreviation — generic "incident response plan" does not qualify) -- **Named products, systems, tools:** Splunk, CrowdStrike Falcon, Azure Sentinel, ServiceNow -- **Named third-party firms:** Mandiant, Deloitte, CrowdStrike, PwC -- **Specific numbers:** headcounts, dollar amounts, percentages, exact durations ("17 years", "12 professionals") -- **Certification claims:** "We maintain ISO 27001 certification" (holding a certification is more than naming a standard) -- **Named universities in credential context:** "Ph.D. from Princeton University" (independently verifiable) +### Domain Terminology — any one → at least Level 2 -### ✗ IS NOT a Specific Fact (do NOT use to justify Firm-Specific) +These terms originate from the cybersecurity domain. The test: **would this term appear naturally in a generic enterprise risk management document that has nothing to do with cybersecurity?** If no → it is domain terminology. -- **Generic governance:** "the Board", "Board of Directors", "management", "Audit Committee", "the Committee" -- **Generic C-suite:** CEO, CFO, COO, President, General Counsel — these exist at every company and are not cybersecurity-specific -- **Generic IT leadership (NOT cybersecurity-specific):** "Head of IT", "IT Manager", "Director of IT", "Chief Compliance Officer", "Associate Vice President of IT" — these are general corporate/IT titles, not cybersecurity roles per the IS list -- **Unnamed entities:** "third-party experts", "external consultants", "cybersecurity firms", "managed service provider" -- **Generic cadences:** "quarterly", "annual", "periodic", "regular" — without exact dates -- **Boilerplate phrases:** "cybersecurity risks", "material adverse effect", "business operations", "financial condition" -- **Standard incident language:** "forensic investigation", "law enforcement", "regulatory obligations", "incident response protocols" -- **Vague quantifiers:** "certain systems", "some employees", "a number of", "a portion of" -- **Common practices:** "penetration testing", "vulnerability scanning", "tabletop exercises", "phishing simulations", "security awareness training" -- **Generic program names:** "incident response plan", "business continuity plan", "cybersecurity program", "Third-Party Risk Management Program", "Company-wide training" — no unique identifier or distinguishing abbreviation -- **Company self-references:** the company's own name, "the Company", "the Bank", subsidiary names, filing form types -- **Company milestones:** "since our IPO", "since inception" — not cybersecurity facts +**Practices and activities:** +- Penetration testing / pen testing +- Vulnerability scanning / vulnerability assessments +- Red teaming / red team exercises +- Phishing simulations +- Security awareness training +- Threat hunting / threat intelligence +- Patch management +- Identity and access management (IAM) +- Data loss prevention (DLP) +- Network segmentation +- Encryption (as a security measure) -### QV-Eligible Facts (count toward the 2-fact threshold for Quantified-Verifiable) +**Tool and infrastructure categories:** +- SIEM (Security Information and Event Management) +- SOC (Security Operations Center) +- EDR / XDR / MDR (endpoint/extended/managed detection and response) +- WAF (Web Application Firewall) +- IDS / IPS (Intrusion Detection/Prevention System) +- MFA / 2FA (Multi-Factor Authentication) +- Firewall (as a security control) +- Antivirus / anti-malware -✓ Specific dates (month+year or exact date) -✓ Dollar amounts, headcounts, percentages -✓ Named third-party firms (Mandiant, CrowdStrike, Deloitte) -✓ Named products/tools (Splunk, Azure Sentinel) -✓ Named certifications held by individuals (CISSP, CISM, CEH) -✓ Years of experience as a specific number ("17 years", "over 20 years") -✓ Named universities in credential context +**Architectural concepts:** +- Zero trust / zero trust architecture +- Defense in depth +- Least privilege / principle of least privilege -**Do NOT count toward QV** (these trigger Firm-Specific but not QV): -✗ Named roles (CISO, CIO) -✗ Named committees -✗ Named frameworks (NIST, ISO 27001) — these trigger Sector-Adapted -✗ Team compositions, reporting structures -✗ Named internal programs -✗ Generic degrees without named university ("BS in Management") +**Named standards and frameworks:** +- NIST CSF / NIST Cybersecurity Framework +- ISO 27001 / ISO 27002 +- SOC 2 (Type I / Type II) +- CIS Controls / CIS Benchmarks +- PCI DSS, HIPAA, GDPR (in cybersecurity context) +- COBIT, MITRE ATT&CK + +**Specific threat types:** +- Ransomware, malware, phishing (as a threat) +- DDoS (Distributed Denial of Service) +- Supply chain attack / compromise +- Social engineering +- Advanced persistent threat (APT) +- Zero-day vulnerability / exploit + +### NOT Domain Terminology — remains Level 1 + +These terms belong to general business, IT, or enterprise risk management. + +- Risk assessment / risk management / risk mitigation +- Incident response plan / IRP +- Business continuity plan / disaster recovery +- Tabletop exercises (without cybersecurity qualifier) +- Enterprise risk management (ERM) +- Internal controls / policies and procedures +- Compliance (general) +- "Processes to identify, assess, and manage risks" +- "Measures to protect our systems and data" +- "Regular monitoring" / "continuous improvement" +- "Cross-functional team" +- "Dedicated cybersecurity team" (organizational approach, not a unique fact) + +--- + +### Firm-Specific Facts — any one → at least Level 3 + +These identify something unique to THIS company's cybersecurity posture. + +**IS firm-specific:** +- **Cybersecurity-specific titles:** CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, Cybersecurity Director, Chief Digital Officer (when overseeing cyber) +- **Named non-generic committees:** Technology Committee, Cybersecurity Committee, Cybersecurity Steering Committee, Risk Committee (NOT "Audit Committee" — every public company has one) +- **Specific team/department compositions:** "Legal, Compliance, and Finance" (NOT "a cross-functional team") +- **Named internal programs with distinguishing identifiers:** "Cyber Incident Response Plan (CIRP)" (generic "incident response plan" does not qualify) +- **Named individuals** in a cybersecurity role context +- **Specific organizational claims:** "24/7 security operations" (implies specific organizational investment beyond generic monitoring) + +**NOT firm-specific (too generic):** +- "The Board," "Board of Directors," "Audit Committee," "management" — exist at every public company +- CEO, CFO, COO, President, General Counsel — not cybersecurity-specific roles +- "Head of IT," "IT Manager," "Director of IT" — general IT titles, not cybersecurity leadership +- "Third-party experts," "external consultants," "cybersecurity firms" — unnamed entities +- "Quarterly," "annual," "regular" — generic cadences without specific dates +- "The Company," company self-references, subsidiary names +- Generic program names without distinguishing identifiers: "incident response plan," "cybersecurity program," "risk management program" + +--- + +### QV-Eligible Facts — any one → Level 4 + +A QV fact is one that an external party could independently verify using public records, certification databases, or third-party sources. The test: **is it both quantified (a specific number, date, or named external reference) AND independently verifiable?** + +**IS QV-eligible:** +- **Specific numbers:** dollar amounts, headcounts, percentages, years of experience (as a number), team sizes, specific durations +- **Specific dates:** month+year or exact date tied to a cybersecurity fact or event +- **Named external entities:** third-party firms (Mandiant, Deloitte, CrowdStrike), products/tools (Splunk, CrowdStrike Falcon, Azure Sentinel, ServiceNow) +- **Certifications held by individuals:** CISSP, CISM, CEH, CRISC (verifiable via issuing body) +- **Certifications/audits held by the company:** "We maintain ISO 27001 certification," "completed SOC 2 Type II audit" (verifiable external claim) +- **Named universities in credential context:** "Ph.D. from Princeton University" + +**NOT QV-eligible:** +- Named roles (CISO, CTO) → firm-specific (Level 3), not a quantified claim +- Named individuals without verifiable details → identification, not a quantified claim +- Named committees → organizational structure, not externally verifiable +- Named internal programs → firm naming, not external verification +- Named standards FOLLOWED (not certified): "aligned with NIST CSF" → Domain-Adapted (Level 2) +- Generic cadences: "quarterly," "annually" → not specific enough to verify +- Fiscal year as generic reporting context: "fiscal 2024" without a specific cybersecurity fact tied to it + +**Certification distinction:** +- "Our program is aligned with ISO 27001" → **Level 2** (references a standard) +- "We are working toward ISO 27001 certification" → **Level 3** (firm-specific intent) +- "We maintain ISO 27001 certification" → **Level 4** (verifiable claim with external body) + +--- ### Validation Step -Before finalizing specificity, review the extracted facts. Remove any that appear on the NOT list. If no facts remain after filtering → Generic Boilerplate (or Sector-Adapted if a named standard is present). Do not let NOT-list items inflate the specificity rating. +Before finalizing specificity: +1. Identify all facts in the paragraph +2. Check each against the NOT lists — remove any that appear +3. Classify the remaining facts: QV-eligible → Level 4, firm-specific → Level 3, domain terminology → Level 2 +4. Apply the decision test with validated facts only --- ## LLM Response Schema -The exact Zod schema passed to `generateObject`. This is the contract between the LLM and our pipeline. - ```typescript import { z } from "zod"; @@ -453,81 +394,23 @@ export const Confidence = z.enum(["high", "medium", "low"]); export const LabelOutput = z.object({ content_category: ContentCategory - .describe("The single most applicable content category for this paragraph"), + .describe("The single most applicable content category"), specificity_level: SpecificityLevel - .describe("1=generic boilerplate, 2=sector-adapted, 3=firm-specific, 4=quantified-verifiable"), + .describe("1=generic, 2=domain-adapted, 3=firm-specific, 4=quantified-verifiable"), category_confidence: Confidence - .describe("high=clear-cut, medium=some ambiguity, low=genuinely torn between categories"), + .describe("high=clear-cut, medium=some ambiguity, low=genuinely torn"), specificity_confidence: Confidence - .describe("high=clear-cut, medium=borderline adjacent levels, low=could argue for 2+ levels"), + .describe("high=clear-cut, medium=borderline, low=could argue 2+ levels"), reasoning: z.string() - .describe("Brief 1-2 sentence justification citing specific evidence from the text"), + .describe("1-2 sentence justification citing specific evidence from the text"), }); ``` -**Output example:** -```json -{ - "content_category": "Risk Management Process", - "specificity_level": 3, - "category_confidence": "high", - "specificity_confidence": "medium", - "reasoning": "Names NIST CSF (sector-adapted) and describes quarterly tabletop exercises specific to this company's program, pushing to firm-specific. Specificity borderline 2/3 — tabletop exercises could be generic or firm-specific depending on interpretation." -} -``` - --- -## System Prompt +## Annotator Information Template -> **Note:** The system prompt below is the v1.0 template from the initial codebook. The production Stage 1 prompt is **v2.5** (in `ts/src/label/prompts.ts`), which incorporates the IS/NOT lists, calibration examples, validation step, and decision test from this codebook. The Stage 2 judge prompt (`buildJudgePrompt()` in the same file) adds dynamic disambiguation rules and confidence calibration. **This codebook is the source of truth; the prompt mirrors it.** - -The v1.0 template is preserved below for reference. See `ts/src/label/prompts.ts` for the current production prompt. - -``` -You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings) under SEC Release 33-11216. - -For each paragraph, assign exactly two labels: - -(a) content_category — the single most applicable category: - - "Board Governance": Board/committee oversight of cyber risk, briefing cadence, board member cyber expertise. SEC basis: Item 106(c)(1). - - "Management Role": CISO/CTO/CIO identification, qualifications, reporting lines, management committees. SEC basis: Item 106(c)(2). - - "Risk Management Process": Risk assessment methods, framework adoption (NIST, ISO), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration. SEC basis: Item 106(b). - - "Third-Party Risk": Vendor/supplier security oversight, external assessor requirements, contractual security standards, supply chain risk. SEC basis: Item 106(b). - - "Incident Disclosure": Description of actual cybersecurity incidents — nature, scope, timing, impact, remediation. SEC basis: 8-K Item 1.05. - - "Strategy Integration": Material impact on business strategy/financials, cyber insurance, investment/resource allocation. SEC basis: Item 106(b)(2). - - "None/Other": Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content. - -If a paragraph spans multiple categories, assign the DOMINANT one — the category that best describes the paragraph's primary communicative purpose. - -(b) specificity_level — integer 1 through 4: - 1 = Generic Boilerplate: Could apply to any company unchanged. Conditional language ("may," "could"). No named entities or frameworks. - 2 = Sector-Adapted: Names frameworks/standards (NIST, ISO, SOC 2) or industry-specific terms, but nothing unique to THIS company. - 3 = Firm-Specific: Contains at least one fact unique to this company — named roles, specific committees, concrete reporting lines, named programs. - 4 = Quantified-Verifiable: Two or more verifiable facts — dollar amounts, dates, headcounts, percentages, named third-party firms, audit results. - -BOUNDARY RULES (apply when torn between adjacent levels): - 1 vs 2: "Does it name ANY framework, standard, or industry-specific term?" → Yes = 2 - 2 vs 3: "Does it mention anything unique to THIS company?" → Yes = 3 - 3 vs 4: "Does it contain TWO OR MORE independently verifiable facts?" → Yes = 4 - -SPECIAL RULES: - - None/Other paragraphs always get specificity_level = 1. - - Hypothetical incident language ("we may experience...") in a 10-K is NOT Incident Disclosure. It is Risk Management Process or Strategy Integration. - - Incident Disclosure is only for descriptions of events that actually occurred. - -CONFIDENCE RATINGS (per dimension): - - "high": Clear-cut classification with no reasonable alternative. - - "medium": Some ambiguity, but one option is clearly stronger. - - "low": Genuinely torn between two or more options. -Be honest — overconfident ratings on hard cases are worse than admitting uncertainty. - -Respond with valid JSON matching the required schema. The "reasoning" field should cite specific words or facts from the paragraph that justify your labels (1-2 sentences). -``` - ---- - -## User Prompt Template +Each paragraph is presented with this context: ``` Company: {company_name} ({ticker}) @@ -541,108 +424,32 @@ Paragraph: --- -## Stage 2 Judge Prompt +## Gold Set Protocol -Used when Stage 1 annotators disagree. The judge sees the paragraph plus all three prior annotations in randomized order. +### Sampling -``` -You are adjudicating a labeling disagreement among three independent annotators. Each applied the same codebook but reached different conclusions. +**Stratified by category** with specificity floor: +- ~170 paragraphs per content category × 7 = ~1,190 +- Random within each stratum (not selected for difficulty) +- Secondary constraint: minimum ~100 per specificity level across the full set +- Separate development set (~200 paragraphs) for prompt iteration — excluded from holdout -Review all three opinions below, then provide YOUR OWN independent label based on the codebook definitions above. Do not default to majority vote — use your own expert judgment. If you agree with one annotator's reasoning, explain why their interpretation is correct. +### Human Labeling -Company: {company_name} ({ticker}) -Filing type: {filing_type} -Filing date: {filing_date} -Section: {sec_item} +1. Three independent annotators label the full holdout using this codebook +2. Compute inter-rater reliability: + - Cohen's Kappa (category, pairwise) — target > 0.75 + - Krippendorff's Alpha (specificity, ordinal) — target > 0.67 +3. Gold labels = majority vote. Where all three disagree, model consensus serves as tiebreaker. -Paragraph: -{paragraph_text} +### AI-Labeled Extension ---- Prior annotations (randomized order) --- - -Annotator A: content_category="{cat_a}", specificity_level={spec_a} - Reasoning: "{reason_a}" - -Annotator B: content_category="{cat_b}", specificity_level={spec_b} - Reasoning: "{reason_b}" - -Annotator C: content_category="{cat_c}", specificity_level={spec_c} - Reasoning: "{reason_c}" -``` - ---- - -## Cost and Time Tracking - -### Per-Annotation Record - -Every API call produces an `Annotation` record with full provenance: - -```typescript -provenance: { - modelId: string, // OpenRouter model ID e.g. "google/gemini-3.1-flash-lite-preview" - provider: string, // Upstream provider e.g. "google", "xai", "anthropic" - generationId: string, // OpenRouter generation ID (from response id field) - stage: "stage1" | "stage2-judge" | "benchmark", - runId: string, // UUID per batch run - promptVersion: string, // "v1.0" — tracks prompt iterations - inputTokens: number, // From usage.prompt_tokens - outputTokens: number, // From usage.completion_tokens - reasoningTokens: number, // From usage.completion_tokens_details.reasoning_tokens - costUsd: number, // REAL cost from OpenRouter usage.cost (not estimated) - latencyMs: number, // Wall clock per request - requestedAt: string, // ISO datetime -} -``` - -### Cost Source - -OpenRouter returns **actual cost** in every response body under `usage.cost` (USD). No estimation needed. Each response also includes a `generationId` (the `id` field) which we store in every annotation record. This enables: -- Audit trail: look up any annotation on OpenRouter's dashboard -- Richer stats via `GET /api/v1/generation?id={generationId}` (latency breakdown, provider routing, native token counts) - -### Aggregation Levels - -| Level | What | Where | -|-------|------|-------| -| Per-annotation | Single API call cost + latency | In each Annotation JSONL record | -| Per-model | Sum across all annotations for that model | `bun sec label:cost` | -| Per-stage | Stage 1 total, Stage 2 total | `bun sec label:cost` | -| Per-phase | Labeling total, benchmarking total | `bun sec label:cost` | -| Project total | Everything | `bun sec label:cost` | - -### Time Tracking - -| Metric | How | -|--------|-----| -| Per-annotation latency | `Date.now()` before/after API call | -| Batch throughput | paragraphs/minute computed from batch start/end | -| Stage 1 wall clock | Logged at batch start and end | -| Stage 2 wall clock | Logged at batch start and end | -| Total labeling time | Sum of all batch durations | -| Per-model benchmark time | Tracked during benchmark runs | - -All timing is logged to `data/metadata/cost-log.jsonl` with entries like: - -```json -{ - "event": "batch_complete", - "stage": "stage1", - "modelId": "openai/gpt-oss-120b", - "paragraphsProcessed": 50000, - "wallClockSeconds": 14400, - "totalCostUsd": 38.50, - "throughputPerMinute": 208.3, - "timestamp": "2026-03-29T10:30:00Z" -} -``` +Up to 20,000 additional paragraphs labeled by model panel consensus for supplementary evaluation. Not the assignment-defined holdout. --- ## NIST CSF 2.0 Mapping -For academic grounding: - | Our Category | NIST CSF 2.0 | |-------------|-------------| | Board Governance | GOVERN (GV.OV, GV.RR) | @@ -651,221 +458,3 @@ For academic grounding: | Third-Party Risk | GOVERN (GV.SC) | | Incident Disclosure | DETECT, RESPOND, RECOVER | | Strategy Integration | GOVERN (GV.OC, GV.RM) | - ---- - -## Prompt Versioning - -Track prompt changes so we can attribute label quality to specific prompt versions: - -| Version | Date | N | Change | -|---------|------|---|--------| -| v1.0 | 2026-03-27 | 40 | Initial codebook-aligned prompt | -| v1.1 | 2026-03-28 | 40 | Added calibration examples, category decision rules. Cat 95%, Spec 68%, Both 62%. | -| v1.2 | 2026-03-28 | 40 | Expanded "what counts as unique" + materiality rule. REGRESSED (88% cat). | -| v2.0 | 2026-03-28 | 40 | Chain-of-thought schema with specific_facts array + algorithmic specificity. Gemini/Grok 5/5, GPT-OSS broken. | -| v2.1 | 2026-03-28 | 40 | Two-tier facts (organizational vs verifiable) + text enum labels. Gemini/Grok perfect but nano overrates. | -| v2.2 | 2026-03-28 | 40 | Decision-test format, simplified facts, "NOT a fact" list. Cat 95%, Spec 68%, Both 65%, Consensus 100%. | -| v2.2 | 2026-03-28 | 500 | 500-sample baseline. Cat 85.0%, Spec 60.8%, Both 51.4%, Consensus 99.6%, Spread 0.240. | -| v2.3 | 2026-03-28 | 500 | Tightened Sector-Adapted, expanded IS/NOT lists, QV boundary rules. Spec 72.0%, Both 59.2%. [1,1,2] eliminated. | -| v2.4 | 2026-03-28 | 500 | Validation step, schema constraint on specific_facts. Spec 78.6%, Both 66.8%. Nano overrating fixed. | -| v2.5 | 2026-03-28 | 500 | Improved Inc↔Strat tiebreaker, QV calibration examples. **PRODUCTION**: Cat 86.8%, Spec 81.0%, Both 70.8%, Consensus 99.4%, Spread 0.130. Inc↔Strat eliminated. | -| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). | -| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. | -| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. | -| v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). | - -When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare. - ---- - -## Iterative Prompt Tuning Protocol - -The v1.0 system prompt is built from theory and synthetic examples. Before firing the full 50K run, we iterate on real data to find and fix failure modes while it costs cents, not dollars. - -### Phase 0: Seed sample (before extraction is ready) - -Grab 20-30 real Item 1C paragraphs manually from EDGAR full-text search (`efts.sec.gov/LATEST/search-index?q="Item 1C" cybersecurity`). Paste into a JSONL by hand. This lets prompt tuning start immediately while extraction code is still being built. - -### Phase 1: Micro-pilot (30 paragraphs, all 3 Stage 1 models) - -1. Select ~30 real paragraphs covering: - - At least 2 per content category (incl. None/Other) - - At least 2 per specificity level - - Mix of industries and filing years - - 5+ deliberately tricky borderline cases - -2. Run all 3 Stage 1 models on these 30 with prompt v1.0. - -3. **You and at least one teammate independently label the same 30** using the codebook. These are your reference labels. - -4. Compare: - - Per-model accuracy vs reference - - Inter-model agreement (where do they diverge?) - - Per-category confusion (which categories do models mix up?) - - Per-specificity bias (do models systematically over/under-rate?) - - Are confidence ratings calibrated? (Do "high" labels match correct ones?) - -5. **Identify failure patterns.** Common ones: - - Models gravitating to "Risk Management Process" (largest category — pull) - - Models rating specificity too high (any named entity → firm-specific) - - Board Governance / Management Role confusion - - Missing None/Other (labeling boilerplate as Strategy Integration) - -### Phase 2: Prompt revision (v1.1) - -Based on Phase 1 failures, revise the system prompt: -- Add "common mistakes" section with explicit corrections -- Add few-shot examples for confused categories -- Sharpen boundary rules where models diverge -- Add negative examples ("This is NOT Incident Disclosure because...") - -**Do not change the Zod schema or category definitions** — only the system prompt text. Bump to v1.1. Re-run the same 30 paragraphs. Compare to v1.0. - -### Phase 3: Scale pilot (200 paragraphs) - -1. Extract 200 real paragraphs (stratified, broader set of filings). - -2. Run all 3 Stage 1 models with the best prompt version. - -3. Compute: - - **Inter-model Fleiss' Kappa** on category: target ≥ 0.65 - - **Inter-model Spearman correlation** on specificity: target ≥ 0.70 - - **Consensus rate**: % with 2/3+ agreement on both dims. Target ≥ 75%. - - **Confidence calibration**: are "high confidence" labels more likely agreed-upon? - -4. If targets not met: - - Analyze disagreements — genuine ambiguity or prompt failure? - - Prompt failure → revise to v1.2, re-run - - Genuine ambiguity → consider rubric adjustment (merge categories, collapse specificity) - - Repeat until targets met or documented why they can't be - -5. **Cost check**: extrapolate from 200 to 50K. Reasoning token usage reasonable? - -### Phase 4: Green light - -Once scale pilot passes: -- Lock prompt version (no changes during full run) -- Lock model configuration (reasoning effort, temperature) -- Document final prompt, configs, and pilot results -- Fire the full 50K annotation run - ---- - -## Pipeline Reliability & Observability - -### Resumability - -All API-calling scripts (annotation, judging, benchmarking) use the same pattern: - -1. Load output JSONL → parse each line → collect completed paragraph IDs into a Set -2. Lines that fail `JSON.parse` are skipped (truncated from a crash) -3. Filter input to only paragraphs NOT in the completed set -4. For each completion, append one valid JSON line + `flush()` - -JSONL line-append is atomic on Linux. Worst case on crash: one truncated line, skipped on reload. No data loss, no duplicate work, no duplicate API spend. - -### Error Handling - -| Error Type | Examples | Strategy | -|------------|----------|----------| -| Transient | 429, 500, 502, 503, ECONNRESET, timeout | Exponential backoff: 1s→2s→4s→8s→16s. Max 5 retries. | -| Permanent | 400, 422 (bad request) | Log to `{output}-errors.jsonl`, skip | -| Validation | Zod parse fail on LLM response | Retry once, then log + skip | -| Budget | 402 (out of credits) | Stop immediately, write session summary, exit | -| Consecutive | 10+ errors in a row | Stop — likely systemic (model down, prompt broken) | - -Error paragraphs get their own file. Retry later with `--retry-errors`. - -### Graceful Shutdown (SIGINT/SIGTERM) - -On Ctrl+C: -1. Stop dispatching new work -2. Wait for in-flight requests to complete (already paid for) -3. Write session summary -4. Print final stats, exit 0 - -### Live Dashboard (stderr) - -Updates every second: - -``` - SEC-cyBERT │ label:annotate │ google/gemini-3.1-flash-lite-preview │ v1.1 - ───────────────────────────────────────────────────────────────────────── - Progress 12,847 / 50,234 (25.6%) ETA 42m 18s - Session $1.23 │ 38m 12s elapsed │ 337.4 para/min - Totals $4.56 all-time │ 3 errors (0.02%) │ 7 retries - Latency p50: 289ms │ p95: 812ms │ p99: 1,430ms - Reasoning avg 47 tokens/para │ 12.3% of output tokens -``` - -Goes to stderr so stdout stays clean. - -### Session Log - -Every run appends to `data/metadata/sessions.jsonl`: - -```json -{ - "sessionId": "a1b2c3d4", - "command": "label:annotate", - "modelId": "google/gemini-3.1-flash-lite-preview", - "stage": "stage1", - "promptVersion": "v1.1", - "startedAt": "2026-03-29T10:00:00Z", - "endedAt": "2026-03-29T10:38:12Z", - "durationSeconds": 2292, - "paragraphsTotal": 50234, - "paragraphsProcessed": 12847, - "paragraphsSkippedResume": 37384, - "paragraphsErrored": 3, - "costUsd": 1.23, - "reasoningTokensTotal": 482000, - "avgLatencyMs": 450, - "p95LatencyMs": 812, - "throughputPerMinute": 337.4, - "concurrency": 12, - "exitReason": "complete" -} -``` - -`exitReason`: `complete` | `interrupted` (Ctrl+C) | `budget_exhausted` (402) | `error_threshold` (consecutive limit) - -### OpenRouter Generation ID - -Every annotation record includes the OpenRouter `generationId` from the response `id` field. This enables: -- **Audit trail**: look up any annotation on OpenRouter's dashboard -- **Rich stats**: `GET /api/v1/generation?id={generationId}` returns latency breakdown, provider routing, native token counts -- **Dispute resolution**: if a label looks wrong, inspect the exact generation that produced it - ---- - -## Gold Set Protocol - -### Sampling (1,200 paragraphs minimum) - -Stratify by: -- Content category (all 7 represented, oversample rare categories) -- Specificity level (all 4 represented) -- GICS sector (financial services, tech, healthcare, manufacturing minimum) -- Filing year (FY2023 and FY2024) - -### Human Labeling Process - -Labeling is done through a purpose-built web tool that enforces quality: -1. **Rules quiz:** Every annotator must read the codebook and pass a quiz on the rules before each labeling session. The quiz tests the three most common confusion axes: Management Role vs RMP (person-vs-function test), materiality disclaimers (Strategy Integration vs None/Other), and QV fact counting. -2. **Warm-up:** First 5 paragraphs per session are warm-up (pre-labeled, with feedback). Not counted toward gold set. -3. **Independent labeling:** Three team members independently label the full gold set using this codebook. -4. Compute inter-rater reliability: - - Cohen's Kappa (for content category — nominal, pairwise) - - Krippendorff's Alpha (for specificity level — ordinal, all annotators) - - Per-class confusion matrices - - **Target: Kappa > 0.75, Alpha > 0.67** -5. Adjudicate disagreements: third annotator tiebreaker, or discussion consensus with documented rationale -6. Run the full GenAI pipeline on the gold set and compare to human labels - -### If Agreement Is Poor - -- If Kappa < 0.60 on any category pair: revise that category's definition and boundary rules, re-pilot -- If Alpha < 0.50 on specificity: collapse 4-point to 3-point scale (merge 1+2 into "Non-specific" or 3+4 into "Substantive") -- Document the collapse decision and rationale in this codebook diff --git a/docs/NARRATIVE-v1.md b/docs/NARRATIVE-v1.md new file mode 100644 index 0000000..e37bde7 --- /dev/null +++ b/docs/NARRATIVE-v1.md @@ -0,0 +1,1292 @@ +# Project Narrative — SEC Cybersecurity Disclosure Quality Classifier + +This document captures the process, roadblocks, decisions, and resolutions from building the SEC cybersecurity disclosure quality classifier. It serves as the source material for the final paper and presentation. + +--- + +## Phase 1: Project Scoping and Construct Design + +### The Problem + +SEC Release 33-11216 (July 2023) created a new annual cybersecurity disclosure requirement (10-K Item 1C) and an incident disclosure requirement (8-K Item 1.05). By FY2024, ~9,000-10,000 filings exist. No validated classifier or public labeled dataset exists for assessing the quality of these disclosures. Investors, regulators, and compliance officers need scalable tools to distinguish substantive disclosures from boilerplate. + +### Methodology Decision: Ringel (2023) "Synthetic Experts" + +We adopted the Ringel (2023) "Synthetic Experts" pipeline: use frontier LLMs to generate training labels at scale, then distill into an efficient encoder model. This approach was chosen because: +- Manual labeling of 50,000+ paragraphs is infeasible for a 6-person team +- Multiple cheap LLMs annotating in parallel provide built-in quality control through inter-annotator agreement +- The encoder distillation step produces a model that can classify at inference time without LLM API costs + +### Construct: Two Classification Dimensions + +We defined two simultaneous classification tasks per paragraph: +1. **Content Category** (7 mutually exclusive classes) — what the paragraph is about, grounded in the SEC rule's own structure (Board Governance, Management Role, Risk Management Process, Third-Party Risk, Incident Disclosure, Strategy Integration, None/Other) +2. **Specificity Level** (4-point ordinal) — how company-specific the disclosure is, from generic boilerplate to quantified-verifiable facts + +The construct maps to NIST CSF 2.0 categories for academic grounding. + +--- + +## Phase 2: Data Acquisition and Corpus Construction + +### The Extraction Problem + +SEC filings are not structured data. They're HTML generated from PDFs, XBRL, and Word documents by dozens of different tools, each producing different artifacts. Building a reliable extraction pipeline for ~9,000 filings meant solving a series of messy, real-world data engineering problems. + +### Pipeline Architecture + +Built in TypeScript (~1,000 lines of extraction code across `parse-item1c.ts`, `segment.ts`, `fast-reparse.ts`, and pipeline orchestration): + +``` +EDGAR Master Index → enumerate 10-K filings → download HTML → extract Item 1C → segment paragraphs → JSONL +submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → segment → merge with 10-K corpus +``` + +### Roadblock: HTML Variability + +Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML: + +- **Word splitting from inline elements.** XBRL and styling tags break words mid-token: `Item 2` renders correctly in a browser but parses as "Item2" in code. Same with `cybersecurity`. Required detecting adjacent inline element boundaries and inserting spaces selectively. + +- **CamelCase joins from PDF converters.** PDF-to-HTML tools merge sentences across formatting boundaries: `sentence.Next sentence` instead of `sentence. Next sentence`. Required regex passes to detect missing spaces after punctuation. + +- **Page breaks mid-sentence.** Page numbers (`28`, `- 12 -`, `F-3`), running headers (`ACME CORP — ANNUAL REPORT`), and subsidiary headers (`ENTERGY ARKANSAS, LLC AND SUBSIDIARIES`) get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. + +- **Table of Contents shadowing.** "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Took several iterations to discover we needed the LAST match — this was a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it. + +- **XBRL tag pollution.** Inline XBRL wraps financial facts in `ix:header`, `ix:references`, and `ix:nonFraction` tags that carry no display content but add noise. Required stripping all `ix:*` tags before text processing. + +- **Entity encoding chaos.** ` `, ` `, `“`, `”`, `—`, `–`, `•` — each needs correct decoding, and different filing tools use different entity styles for the same characters. + +### Paragraph Segmentation + +After extracting clean section text, splitting into paragraphs had its own challenges: + +- **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. +- **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge. +- **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable. +- **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `
` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded. + +### 8-K Extraction + +**Roadblock: EDGAR full-text search misses filings.** The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01. + +**Resolution:** Built `scan-8k-items.py` to scan the SEC's bulk `submissions.zip` deterministically — a gap-free scan of every 8-K with cybersecurity content. Tries items in priority order (1.05 → 8.01 → 7.01), skips cross-reference stubs. Result: **207 cybersecurity incident 8-K filings** identified — a complete inventory. + +### Paragraph Deduplication + +Each paragraph gets a `textHash` (SHA-256 of normalized text). Deduplication at three levels: + +1. **Within-filing:** Parser artifacts sometimes produce duplicate blocks. Removed by textHash. +2. **Cross-year (same company):** Companies copy-paste identical paragraphs year-to-year. Detected but kept — the repetition itself is informative for disclosure quality analysis. +3. **Cross-company boilerplate:** Different companies use identical materiality disclaimers. Detected but kept — these are real Specificity 1 examples. + +**Result:** Only ~27 excess duplicates removed (0.04%). Most textual similarity is legitimate variation. + +### Performance at Scale + +Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built `fast-reparse.ts` (regex-only HTML stripping, no DOM) and `parallel-reparse.ts` (16 bun workers in parallel). Also deduplicates amendment filings (keeps latest per CIK×FiscalYear). + +### Corpus Statistics + +- **72,045 paragraphs** from ~9,000 filings (FY2023 + FY2024 + early FY2025) +- All 10-K Item 1C; 207 8-K paragraphs extracted separately +- Median ~7 paragraphs per filing +- 49,795 paragraphs annotated (after filtering to complete filing metadata) + +### Roadblock: Truncated Filings + +Discovered 72 filings (~0.8%) where section boundary detection cut off mid-sentence. A paragraph about CISSP certifications cut mid-sentence looks like vague boilerplate — this would corrupt specificity labels. + +**Resolution:** Exclude from training splits. Filings where the last paragraph doesn't match `/[.!?;")\u201d]\s*$/` are filtered before train/val/test creation. + +--- + +## Phase 3: Codebook Development + +### Initial Codebook (v1.0) + +Built a detailed labeling codebook (`docs/LABELING-CODEBOOK.md`) grounded in the SEC rule structure. Includes: +- 7 category definitions with SEC basis citations, key markers, and example texts +- 4 specificity levels with boundary rules +- 5 category decision rules for common ambiguities +- 5 borderline cases with worked reasoning +- Gold set protocol for human validation + +### Codebook Iteration (v3.0 — 2026-03-29) + +After analyzing 150,000+ Stage 1 annotations and identifying systematic disagreement patterns, we made three major codebook rulings: + +**Ruling A — Materiality Disclaimers:** Paragraphs with explicit materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") are Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification. Only pure cross-references with no materiality conclusion are None/Other. *This resolved ~1,094 disputed paragraphs.* + +**Ruling B — SPACs and Shell Companies:** Companies explicitly stating they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. *This resolved ~53 unresolved paragraphs and likely hundreds more.* + +**Ruling C — Person vs. Function Test (Management Role vs. RMP):** This was the single most impactful ruling, addressing the #1 disagreement axis (2,290 disputes). The line: if the paragraph is about the *person* (qualifications, credentials, background, tenure, career history) → Management Role. If it's about what the role/program *does* (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears. The test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person. + +--- + +## Phase 4: Stage 1 — Synthetic Expert Annotation + +### Tech Stack Decision + +Chose TypeScript + Vercel AI SDK v6 + OpenRouter over Python + LangChain/LiteLLM because: +- Vercel AI SDK provides native structured output with Zod schema validation +- OpenRouter gives single-API access to all candidate models with real cost tracking +- Bun runtime for fast script execution with native TypeScript support +- JSONL-append pattern for crash-safe resume without data loss or duplicate API spend + +### Prompt Engineering (12+ iterations, v1.0 → v2.5) + +This was one of the most time-intensive phases. Key lessons: + +**What worked:** +- Text enum labels ("Firm-Specific") over ordinals ("3") — universal improvement across all models +- Decision-test format ("ask in order, stop at first yes") for specificity — reduced ambiguity +- ✓ IS / ✗ NOT fact lists with explicit examples — the single biggest lever for specificity accuracy. Reduced overrating from 54 to 21 cases. +- Validation step ("review your specific_facts, remove NOT-list items") — caught model self-correction +- 13 calibration examples, each targeting a specific observed failure mode — examples outperformed rules +- Explicit Incident↔Strategy tiebreaker — completely eliminated a 20-case confusion pattern +- `specific_facts` chain-of-thought in the schema — forces the model to enumerate evidence before assigning specificity + +**What didn't work:** +- Adding more rules (v1.2) — confused models, caused regression from 95%→88% category accuracy +- Changing category definitions to structural "TEST:" format (v2.6) — regression +- "COMMON MISTAKES" section (v2.7) — improved consensus but reduced unanimity +- Attempting a Management↔RMP tiebreaker in the prompt (v2.5) — made confusion worse (this was ultimately resolved through the v3.0 codebook ruling instead) + +**Critical lesson: 40-sample pilots were misleadingly optimistic.** Results that looked good at n=40 fell apart at n=500. We standardized on 500-sample pilots for all prompt evaluation. + +### The Iteration Trajectory + +Five 40-sample pilots (v1.0, v1.1, v1.2, v2.1, v2.2-n40) followed by six 500-sample pilots (v2.2-v2.7): + +| Version | n | Both Unan | Key Change | Top Confusion Axis | +|---------|---|-----------|-----------|-------------------| +| v2.2 | 500 | 51.4% | First 500-sample baseline | Incident↔Strategy (20 cases) | +| v2.3 | 500 | 59.2% | Tightened Sector-Adapted, expanded IS/NOT lists | Inc↔Strat reduced | +| v2.4 | 500 | 66.8% | Validation step, schema constraint on specific_facts | Mgmt↔RMP emerging | +| **v2.5** | **500** | **70.8%** | Incident↔Strategy tiebreaker, QV calibration examples | **Inc↔Strat eliminated**; Mgmt↔RMP now #1 (17 cases) | +| v2.6 | 500 | 67.8% | Changed defs to "TEST:" format — **regression** | — | +| v2.7 | 500 | 67.6% | Added COMMON MISTAKES section — **regression** | — | + +The most dramatic single improvement: v2.5's Incident↔Strategy tiebreaker ("DESCRIBES what happened → Incident; ONLY discusses cost/materiality → Strategy") completely eliminated what had been the #1 confusion axis at v2.2 (20 cases → 0). This is a case where a single well-targeted rule outperformed broad prompt restructuring. + +v2.5 was locked as the production prompt. v2.6 and v2.7 demonstrated that the prompt had reached its practical ceiling — further structural changes caused regressions. The remaining disagreements (Management↔RMP, specificity boundaries) turned out to be codebook ambiguities and model-capacity issues, not prompt failures. + +### The Original Panel and the Nano Problem + +The initial Stage 1 panel was: +- `google/gemini-3.1-flash-lite-preview` +- `openai/gpt-5.4-nano` +- `x-ai/grok-4.1-fast` + +GPT-5.4-nano was chosen for its low cost and the assumption that even a small model could handle structured classification with a good enough prompt. This assumption was wrong. + +**The problem: nano wasn't thinking.** During pilot testing, we discovered nano produced **zero reasoning tokens 64% of the time**. When it did reason, the output was minimal (avg 34,356 total reasoning tokens across 500 paragraphs, vs grok's 336,993). Without reasoning, nano's classifications were essentially pattern-matching on surface features — it couldn't apply the multi-step decision logic the codebook requires (enumerate facts, filter against IS/NOT lists, count QV-eligible items, apply threshold). + +**The symptoms:** +- **Erratic specificity** — nano was simultaneously too conservative on some axes ([1,3,3] disagreements — 21 cases where nano said Generic when gemini+grok said Firm-Specific) and too liberal on others ([3,3,4] — 11 cases where nano said Quantified when the others said Firm-Specific). No prompt change fixed this because it's a model-level capacity issue: without reasoning tokens, the decision test can't execute properly. +- **Lowest pairwise agreement** — gemini×grok agreed on 95.6% of categories and 91.2% of specificity. gemini×nano: 87.4% category, 83.8% specificity. Nano was the consistent outlier. +- **Dragging down unanimity** — the gemini+grok pair was strong, but nano's disagreements broke unanimity on hundreds of paragraphs that would otherwise have been clean. + +Despite 12 prompt iterations (v1.0→v2.7) that improved overall metrics significantly, nano's behavior never stabilized. The prompt was at its practical ceiling for a model that wouldn't reason. + +### Smoke Testing: model-probe.ts + +Before running an expensive benchmark, we built `model-probe.ts` to test 9 candidate models on a single paragraph for basic structured output compliance: +- gemini-3.1-flash-lite-preview, grok-4.1-fast, gpt-4.1-mini, gpt-4.1-nano, claude-haiku-4.5, gemini-3.1-flash-preview, deepseek-chat-v3-0324:free, llama-4-maverick, qwen3-235b-a22b + +This caught schema-level incompatibilities (wrong field names, missing fields, invalid enum values) before we spent money on 500-paragraph bench runs. + +### Model Benchmark: 6 Candidates to Replace Nano + +After locking prompt v2.5, we built `model-bench.ts` to formally evaluate nano replacements. Each candidate was benchmarked against the 500-sample pilot set and compared to the existing gemini+grok annotations. + +| Model | Cost/ann | Reasoning Tokens | vs Majority (both) | Cat Outlier | Spec Outlier | Nano→X Delta | +|-------|----------|-----------------|---------------------|-------------|-------------|-------------| +| seed-2.0-lite | $0.00227 | 658 | **88.8%** | 2.2% | 3.8% | +11.6pp | +| **mimo-v2-flash** | **$0.00048** | **1,346** | **86.0%** | **5.0%** | **4.0%** | **+8.8pp** | +| glm-4.5-air | $0.00136 | 854 | 76.2% | 8.8% | 9.6% | +0.8pp | +| minimax-m2.5 | $0.00106 | 590 | 73.8% | 7.9% | 12.7% | -1.0pp | +| mistral-small-2603 | $0.00015 | **0** | 66.8% | 9.2% | 17.6% | -6.8pp | +| nemotron-3-super-120b | $0.00152 | 942 | 57.9% | **21.3%** | **20.7%** | **-16.9pp** | + +**Key findings:** + +- **Reasoning tokens are the strongest predictor of accuracy.** Mistral-small produced literally zero reasoning tokens — not a single one. Its average output was only 136 tokens (vs mimo's 1,463). It had a 17.6% specificity outlier rate. This confirmed that the nano problem wasn't prompt-specific: models that don't reason can't do this task. + +- **Price ≠ quality.** Nemotron was the most expensive candidate at $0.00152/annotation with 942 reasoning tokens (it *was* thinking), but thinking badly — 21.3% category outlier rate, worst of any candidate. Only 497/500 completed (3 failures). Replacing nano with nemotron would have been catastrophic: -16.9pp unanimity. + +- **The two mediocre options.** GLM-4.5-air (+0.8pp) and minimax-m2.5 (-1.0pp) neither helped nor hurt. Not worth the switch. + +- **Seed-2.0-lite was technically the best** at 88.8% agreement with majority, but cost 4.7x more than mimo ($0.00227 vs $0.00048) and was 2x slower (21.5s vs 11.4s latency). For 50K+ paragraphs at scale, this cost differential was significant. + +### The Winner: mimo-v2-flash + +Mimo won the slot on value: +1. **Cheapest viable option** — $0.00048/annotation (3x cheaper than most candidates) +2. **Most reasoning tokens** — 1,346 avg (highest of all 6, more than seed-2.0-lite) +3. **Lowest outlier rate** — 5.0% category, 4.0% specificity +4. **+8.8pp unanimity improvement** over nano +5. **93.4% category agreement with grok** — strongest pairwise alignment of any candidate + +**Roadblock: Mimo schema quirks.** Mimo produced non-standard outputs: capitalized confidence labels ("High" instead of "high"), numeric confidence values (0.9 instead of "high"), and flat string arrays instead of structured `{fact, type}` objects for specific_facts. Rather than trying to fix this with prompting (which would waste tokens and might break other behavior), we fixed it with Zod schema transforms — `.transform()` to normalize casing and map numbers to labels, `.union()` to accept both structured and flat fact formats. This took ~30 minutes to implement and handled all edge cases automatically. + +A dedicated `mimo-pilot.ts` script modeled the full "replace nano with mimo" scenario before committing to the panel change. + +**Final Stage 1 panel:** +- `google/gemini-3.1-flash-lite-preview` +- `xiaomi/mimo-v2-flash` ← replaced `openai/gpt-5.4-nano` +- `x-ai/grok-4.1-fast` + +### Production Run Results + +Completed 2026-03-28. **150,009 annotations** (50,003 paragraphs × 3 models), **$115.88 total cost**, **0 failures**. + +| Metric | Value | +|--------|-------| +| Both-unanimous | 35,204 (70.7%) | +| Majority agreement | 14,182 (28.5%) | +| Unresolved (3-way split) | 409 (0.8%) | +| Total cost | $115.88 | +| Failures | 0 | + +--- + +## Phase 5: Post-Stage 1 Analysis — Discovering Systematic Patterns + +After the production run, we conducted a deep distributional analysis of disagreement patterns. This analysis fundamentally changed our approach to Stage 2. + +### Model Bias Discovery + +Each model has systematic, quantifiable biases: + +| Model | Category Outlier Rate | Specificity Outlier Rate | Key Bias | +|-------|----------------------|--------------------------|----------| +| Mimo | **48.1%** | 32.5% | Over-classifies as Third-Party Risk; under-rates Spec 4 (74.3% of Spec 4 outlier cases) | +| Gemini | 30.9% | **45.7%** | Over-classifies as Management Role (81.1% in Mgmt↔RMP disputes); inflates specificity | +| Grok | 21.0% | 21.8% | Most moderate; slight RMP bias | + +These biases are not random — they're predictable by model and confusion axis. This opened the possibility of model-calibrated majority voting (using the known biases to assess when the majority is likely correct). + +### Key Distributional Findings + +1. **Management Role is the disaster category** — only 51.5% unanimous (every other category is 62-79%). Nearly half of all Management Role paragraphs need resolution. +2. **Spec 4 (Quantified-Verifiable) is the disaster specificity** — only 37.6% unanimous. Models can't agree on what counts as "quantified." +3. **Stage 1 confidence is completely useless** — 95.4% of paragraphs report all-high category confidence. Zero all-low cases. The cheap models are systematically overconfident. +4. **Specificity is effectively a 3-level scale** — Spec 2 (Sector-Adapted) is rarely disputed (82.1% unanimous). The contested boundaries are [1,3] (3,742 disputes) and [3,4] (2,898 disputes) with almost nothing at [1,2] or [2,3]. +5. **Longer paragraphs are harder** — Q5 word count (>134 words): 64.1% unanimous vs Q1 (≤51 words): 76.3%. +6. **Small companies (1-3 paragraphs) are noise-prone** — 50.0% unanimous, 10.5% unresolved. Almost all are SPACs or shell companies with non-standard disclosures. + +### Top Disagreement Axes + +| Axis | Disputes | Pattern | +|------|----------|---------| +| Management Role ↔ RMP | 2,290 | Paragraph describes processes but names CISO/CIO | +| RMP ↔ Third-Party Risk | 1,475 | Mimo over-classifies vendor mentions as Third-Party | +| None/Other ↔ Strategy Integration | 1,094 | Materiality disclaimers — genuinely ambiguous in codebook | +| Board Governance ↔ Management Role | 867 | Paragraphs at the board-management interface | +| Spec [1,3] boundary | 3,742 | NOT-list items counted as specific facts | +| Spec [3,4] boundary | 2,898 | Gemini counts roles as QV-eligible; Mimo downgrades | + +### Insight: Reading the Actual Paragraphs + +We sampled 20 paragraphs across the 4 hardest dispute types and read them in full. Patterns emerged: + +- **Management↔RMP:** Every example follows the same structure — a process-focused paragraph that names a CISO/CIO in the opening attribution. The paragraph's content is about what the program does, not who the person is. The v3.0 "person-vs-function" ruling directly addresses this. +- **None/Other↔Strategy:** All 5 sampled paragraphs are "no material incidents" boilerplate. Every single one. The materiality disclaimer ruling resolves this entirely. +- **Spec [3,4]:** Gemini counts "20 years of experience" + "CISO" as 2 QV facts → Spec 4. Grok/Mimo correctly exclude named roles from QV counting → Spec 3. The rule exists in the prompt but Gemini ignores it. +- **Small company unresolved:** All SPACs or blank check companies with "we have no operations" disclaimers. The SPAC ruling handles these. + +--- + +## Phase 6: Stage 2 — Judge Model Evaluation + +### Gold Label Construction + +Built a 50-paragraph gold set using 3 independent Sonnet agents: +- Agent A: paragraphs 0-24 +- Agent B: paragraphs 25-49 +- Agent C: all 50 as cross-check +- Adjudicator agent resolved 11 disputes with detailed reasoning +- Inter-annotator agreement: 94% category, 84% specificity, 78% both + +**Lesson learned: majority vote ≠ ground truth.** Initially scored judges against Stage 1 majority, which made gemini-3-flash look great (86% category match). Scoring against gold labels revealed it added zero value — it was rubber-stamping the majority. Always evaluate against adjudicated gold labels. + +### Judge Model Benchmarking (8 candidates) + +| Model | Mode | n | Cat | Spec | Both | Fails | Cost/call | +|-------|------|---|-----|------|------|-------|-----------| +| Majority vote | — | 50 | 78.0% | 80.0% | 60.0% | 0% | $0 | +| gpt-5.4-mini | structured | 50 | 88.0% | 80.0% | 68.0% | 0% | $0.0046 | +| GLM-5 v2 | structured | 48 | 87.5% | 89.6% | 77.1% | 4% | $0.0078 | +| GLM-5 v4 | structured+req_params | 44 | 90.9% | 88.6% | 79.5% | 12% | $0.0083 | +| GLM-5 v3 | tool calling | 50 | 84.0% | 82.0% | 72.0% | 0% | $0.0070 | + +### Roadblock: GLM-5 Structured Output Failures + +GLM-5 had the best accuracy (77-80% both-correct) but a 6-12% structured output failure rate. The model intermittently wraps JSON in markdown code blocks. + +**Investigation:** Built diagnostic scripts (`judge-diag.ts`, `judge-diag-batch.ts`) to isolate the issue. Tested all 9 failing paragraphs × 2 attempts each. Found 72% success rate, all from the same model variant (`z-ai/glm-5-20260211`). The best OpenRouter provider (Ambient) has a 6% base error rate. This is a model-level behavior, not provider-specific. + +**Attempted fixes:** +- Bumped validation retries from 1 to 3 → reduced failures from 18% to ~4-12% +- Tool calling mode → 0% failures but accuracy dropped ~7pp (72% both). Enum constraints not enforced, `undefined` categories appear. +- `provider: { require_parameters: true }` in OpenRouter → no effect +- Exacto routing → no effect + +**Resolution:** Accepted as a model-level constraint. Production strategy will use the best model with retry logic and fall back to a reliable model (gpt-5.4-mini) for persistent failures. + +### Judge Prompt Iteration (v1 → v2) + +Built a dynamic judge prompt (`buildJudgePrompt()`) with: +- **Disagreement diagnosis:** Tells the judge exactly what's in dispute and the vote distribution +- **Targeted disambiguation rules:** 7 category guidance blocks + 2 specificity guidance blocks, dynamically included only when relevant to the specific dispute +- **Structured analysis steps:** Critique each annotator → enumerate IS-list facts → determine dominant purpose → decide +- **Confidence calibration:** HIGH/MEDIUM/LOW mapped to codebook clarity, used as training weights +- **Anti-bias:** Fisher-Yates shuffle of annotator order + +**Results:** Category accuracy improved +10pp over majority vote for both models. Specificity improved +9.8pp for GLM-5 but stayed flat for gpt-5.4-mini. The disambiguation rules work well for category but specificity needs the codebook v3.0 changes. + +### Key Finding: Judge Confidence Is Highly Predictive + +| Confidence | GLM-5 Both-Correct | gpt-5.4-mini Both-Correct | +|------------|--------------------|----| +| High | 82-84% | 80.6% | +| Medium | 25-50% | 35.7% | + +This enables confidence-stratified training data: high-confidence judge labels get full weight; medium/low are downweighted or excluded. + +--- + +## Phase 7: Revised Data Quality Strategy + +The post-Stage 1 analysis and judge benchmarking led to a fundamental reassessment of our approach. + +### The Key Realization + +The best judge (77% both-correct) barely beats the raw majority vote (78% category, 80% specificity). Judging all 14,591 disputed paragraphs at 77% accuracy doesn't meaningfully improve on the majority. The judge's real value is concentrated in two places: +1. The 409 unresolved paragraphs where no majority exists +2. Cases where we have specific reason to doubt the majority + +### The Revised Plan + +**Phase 0: Codebook rulings (completed)** — Three rulings that resolve thousands of disputes at zero inference cost: materiality disclaimers → Strategy Integration, SPACs → None/Other, person-vs-function test for Management↔RMP. + +**Phase 1: Model-calibrated majority resolution** — For the 14,182 majority-agreement paragraphs, apply calibration using known model biases. When the known-biased model is the outlier on a known axis → trust majority. Flag anomalous cases for judge resolution. Expected to auto-resolve ~10,000-12,000 paragraphs. + +**Phase 2: Human gold set (1,200 paragraphs)** — Assignment requires 1,200 human-labeled paragraphs. Building a quiz-gated labeling web tool that enforces codebook knowledge before each session. Stratified sampling to ensure all categories, specificity levels, and confusion axes are represented. This becomes the calibration metric for all further work. + +**Phase 3: Judge prompt iteration** — Update judge prompt to mirror codebook v3.0 rulings. Add worked examples from the 11 gold adjudications. Iterate against expanded gold set. Target: 85%+ both-correct. + +**Phase 4: Production judge run** — Judge only the ~3,000-5,000 genuinely hard cases (unresolved + flagged majority + "both" disputes). Two models for cross-validation on the hardest cases. + +**Phase 5: Training data assembly** — Confidence-stratified tiers: + +| Tier | Source | Est. Accuracy | Paragraphs | Treatment | +|------|--------|--------------|------------|-----------| +| T1 | Both-unanimous | ~97% | 35,204 | Full weight | +| T2 | Calibrated majority | ~85-90% | ~9,000-12,000 | Full weight | +| T3 | Judge high-confidence | ~84% | ~2,000-3,000 | Full weight | +| T4 | Judge medium-confidence | ~40% | ~500-1,000 | Downweight (0.5) or soft labels | +| T5 | Judge low / failure / excluded | ??? | ~500-1,000 | Exclude | + +Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy. + +--- + +## Phase 8: Human Labeling Webapp (Labelapp) + +### Why Build a Webapp? + +The project requires 1,200 human-labeled paragraphs as a gold holdout set — the calibration metric for everything downstream. Six student annotators, three per paragraph, 600 per person. The labels need to be reliable enough to benchmark the GenAI pipeline and validate the final classifier. + +The alternative was everyone tagging in a shared JSON file or spreadsheet. That would almost certainly produce poor data quality. The failure modes are well-documented in annotation literature and we'd hit all of them: + +- **Inconsistent category names.** Free-text entry in a spreadsheet means "Risk Management Process" vs "Risk Mgmt" vs "RMP" vs "3" — all referring to the same class but requiring manual reconciliation. +- **Skipped or double-labeled paragraphs.** No enforced assignment tracking means annotators can accidentally skip paragraphs or label the same one twice without anyone noticing until export. +- **No codebook enforcement.** The labeling codebook has 7 categories, 4 specificity levels, 5 decision rules, and 3 codebook rulings (v3.0). Without quiz gating, annotators can start labeling without understanding the materiality disclaimer ruling, the person-vs-function test, or the QV counting threshold — exactly the boundaries where annotation quality lives or dies. +- **No feedback loop.** In a spreadsheet, an annotator who misunderstands the SPAC ruling labels 600 paragraphs before anyone catches it. A webapp with warmup feedback catches misunderstanding in the first 5 paragraphs. +- **No timing data.** For the writeup, we need per-paragraph labeling times to report annotator effort and identify paragraphs that are disproportionately hard. A spreadsheet gives you nothing; even a basic timer gives you wall-clock time corrupted by idle periods. + +A purpose-built labeling tool turns all of these failure modes into solved problems. Constrained radio buttons eliminate typos. Server-side assignment tracking prevents skips and duplicates. Quiz gating enforces codebook knowledge. Warmup paragraphs with gold feedback catch misunderstandings early. Active timing with idle detection gives clean data for the writeup. + +### The Onboarding Funnel + +Every annotation session follows the same enforced path: + +1. **Login** → annotator selects their name, enters password. Session cookie (HMAC-SHA256 signed, 8-hour expiry). +2. **Dashboard** → shows progress, links to training materials or labeling. +3. **Quiz** → 8 questions (2 per type), random draw from a bank of ~30. Four question types target the exact codebook boundaries that cause the most disagreement in the GenAI pipeline: + - **Person-vs-function** (Management Role vs RMP) — the #1 disagreement axis (2,290 disputes in Stage 1) + - **Materiality disclaimers** (Strategy Integration vs None/Other) — resolved ~1,094 disputes via codebook ruling + - **QV fact counting** (Specificity 3 vs 4) — the hardest specificity boundary + - **SPAC exception** (None/Other for shell companies) + - Pass threshold: 7/8 correct. Immediate feedback with codebook explanation after each answer. Failed → review mistakes → retry. +4. **Warmup** → 5 pre-selected paragraphs with known gold labels. Identical UI to real labeling, but after submit, the annotator sees the gold answer + explanation. This catches systematic misunderstandings before they contaminate 600 labels. +5. **Labeling** → the real thing. 600 assigned paragraphs per annotator. + +The quiz questions are not random trivia — they're targeted at the exact confusion axes that the GenAI pipeline struggles with. If an annotator can't reliably distinguish Management Role from RMP, their labels on that axis are noise. Better to catch that before they start than after. + +### Labeling Interface Design + +The labeling UI prioritizes speed and consistency: + +- **Paragraph display:** Full text with filing metadata badges (company, ticker, filing type, date, SEC item) in the header bar. +- **Constrained input:** Radio buttons for both category (7 options) and specificity (4 options). No free-text entry for classifications. +- **Keyboard shortcuts:** 1-7 for category, Q/W/E/R for specificity, N to focus notes, Enter to submit. An experienced annotator never touches the mouse. +- **Codebook sidebar:** Floating button opens a slide-out panel with all category definitions, IS/NOT lists, specificity levels, and decision rules. Always one click away — annotators don't need to switch to a separate document. +- **Progress bar:** Shows completed/total in the header. Annotators know where they stand. +- **Notes field:** Optional free-text for edge cases or uncertainty. Useful for adjudication — if an annotator flags "this could be either Management Role or RMP, went with RMP because the person-vs-function test says..." that reasoning helps the adjudicator. + +### Sampling Strategy + +The 1,200 paragraphs are not randomly sampled. Random sampling from 50K paragraphs would over-represent the easy cases (Board Governance at Specificity 1 is unambiguous) and under-represent the hard cases that actually test annotation quality. + +Instead, the sampling is stratified by the disagreement patterns discovered in the Stage 1 analysis (Phase 5): + +| Stratum | Count | Why | +|---------|-------|-----| +| Management ↔ RMP split votes | 120 | #1 disagreement axis — validates the person-vs-function ruling | +| None/Other ↔ Strategy splits | 80 | Materiality disclaimer boundary | +| Specificity [3,4] splits | 80 | QV counting — the hardest specificity boundary | +| Board ↔ Management splits | 80 | Board/management interface | +| Rare category guarantee | 120 | ≥15 per category, extra for Incident Disclosure (sparse) | +| Proportional stratified random | 720 | Fill remaining from category × specificity cells | + +This ensures the gold set is informative where it matters most: at the decision boundaries where both humans and models are most likely to disagree. + +### Assignment: Balanced Incomplete Block Design (BIBD) + +Each paragraph gets exactly 3 of 6 annotators. The assignment uses a balanced incomplete block design: + +- C(6,3) = 20 unique triples. Assign 60 paragraphs to each triple. +- Each annotator appears in C(5,2) = 10 triples → 10 × 60 = 600 paragraphs per person. +- Every annotator pair shares equal paragraph overlap → pairwise Cohen's Kappa is statistically valid across all 15 pairs. + +This is important for the writeup: we can report inter-rater reliability as a full pairwise matrix, not just an average that hides weak pairs. + +### Active Timer and Idle Detection + +The initial implementation tracked raw wall-clock `duration_ms` per label — `Date.now()` when the paragraph loaded, minus `Date.now()` at submit. This is corrupted by any idle time (annotator walks away, checks email, gets coffee). + +We added `useActiveTimer`, a React hook that tracks active vs idle time using mouse/keyboard/scroll/focus events with a 30-second idle threshold. When no activity is detected for 30 seconds, the timer pauses and the header shows an amber "idle" indicator. Both `duration_ms` (wall-clock) and `active_ms` (idle-excluded) are submitted with every label. + +For the writeup, `active_ms` is the metric to report — it reflects actual cognitive effort per paragraph. `duration_ms` is retained for completeness. Pre-existing labels (before the timer change) have `active_ms = NULL` and are excluded from timing analysis. + +### Infrastructure Decisions + +**Stack:** Next.js (App Router) + Drizzle ORM + Postgres + Tailwind + shadcn/ui. Deployed via Docker with a Postgres sidecar. + +**Migrations:** Switched from `drizzle-kit push --force` (schema diffing at startup) to file-based Drizzle migrations (`drizzle-kit generate` + `drizzle-kit migrate`). A `scripts/ensure-migration-baseline.ts` script handles the transition for existing databases by seeding the migration journal with the baseline hash. + +**Monorepo:** The labelapp triggered converting the repo to a Bun workspace monorepo with shared Zod schemas (`packages/schemas/`). This ensures the labelapp's category/specificity enums are identical to the GenAI pipeline's — no possibility of a mismatch between what the models label and what the humans label. + +### Adjudication + +After all 3 annotators label a paragraph: +- **3/3 agree** on both dimensions → consensus (no intervention needed) +- **2/3 agree** on both dimensions → majority rules +- **Otherwise** → flagged for admin adjudication + +The admin page shows disputed paragraphs with all 3 labels side-by-side, annotator notes, and Stage 1 consensus for reference. The adjudicator picks a label, enters a custom one, or marks it for team discussion. Adjudications are stored separately from labels for audit trail. + +### Key Technical Artifacts + +| Artifact | Location | +|----------|----------| +| Implementation plan | `docs/labelapp-plan.md` | +| Agent guide | `labelapp/AGENTS.md` | +| Database schema | `labelapp/db/schema.ts` | +| Active timer hook | `labelapp/hooks/use-active-timer.ts` | +| Labeling UI | `labelapp/app/label/page.tsx` | +| Quiz questions | `labelapp/lib/quiz-questions.ts` | +| Warmup paragraphs | `labelapp/lib/warmup-paragraphs.ts` | +| BIBD assignment generator | `labelapp/lib/assignment.ts` | +| IRR metrics (Kappa, Alpha) | `labelapp/lib/metrics.ts` | +| Stratified sampling | `labelapp/lib/sampling.ts` | +| Baseline migration | `labelapp/drizzle/0000_baseline.sql` | +| Migration transition script | `labelapp/scripts/ensure-migration-baseline.ts` | +| Docker entrypoint | `labelapp/entrypoint.sh` | + +### Opus Golden Labeling + +With the human gold set nearing completion, we added a parallel labeling pass using Claude Opus 4.6 as an additional expert annotator. The motivation is empirical: the GenAI pipeline's Stage 1 consensus + Stage 2 judge combination has shown strong alignment with the codebook throughout development, and Opus represents a significant capability jump over the models used in Stages 1 and 2. Having an independent Opus annotation for every gold-set paragraph gives us a third perspective alongside the human labels and the existing pipeline labels — useful for adjudication, for measuring human-vs-model agreement, and as an upper bound on what automated annotation can achieve. + +**Implementation:** Rather than routing through OpenRouter (which would cost ~$27-80 depending on the model), we used the Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`) to call Opus 4.6 through the existing Claude Code subscription. The Agent SDK's `query()` function accepts a custom system prompt and structured output schema, so we configured it as a fully isolated classifier: no tools, no hooks, no settings, no session persistence — just a system prompt and a JSON schema response. + +**Key design decisions:** + +1. **Full codebook as system prompt.** The Stage 1/2 pipeline uses a condensed v2.5 operational prompt (~4KB). For Opus, we feed the entire labeling codebook (`docs/LABELING-CODEBOOK.md`, ~42KB) plus the operational prompt plus the JSON output schema. Opus has the context window and reasoning depth to actually use the worked examples, borderline cases, and decision rules that cheaper models would ignore. + +2. **Reasoning traces saved.** Opus's adaptive thinking produces step-by-step codebook application (e.g., "Count QV-eligible facts: specific date (2020), 24 years (quantified)... two hard verifiable facts → Quantified-Verifiable"). These are saved in the `golden.thinking` field alongside each annotation — valuable both for adjudication and for understanding where the codebook's boundaries create ambiguity. + +3. **Raw confidence preserved.** Opus returns numeric confidence (0-1) rather than the categorical high/medium/low that cheaper models produce. We save the raw values (`golden.rawCategoryConfidence`, `golden.rawSpecificityConfidence`) before coercing them through the existing `Confidence` transform. This gives a finer-grained signal for weighting or analysis. + +4. **Serial execution at 1 req/s.** The Claude Code subscription has rate limits, so the batch runs serially with a 1-second delay between requests. At ~4 paragraphs/minute (including Opus thinking time), the full 1,200-paragraph set completes in ~5 hours. Crash-safe JSONL checkpoint resume means it can be interrupted and restarted without re-running completed paragraphs. + +**Output:** `data/annotations/golden/opus.jsonl` — standard `Annotation` records (compatible with the existing pipeline) plus a `golden` block containing thinking traces, raw confidence values, and the model's specific fact extractions. The `provenance.promptVersion` is tagged `v2.5+codebook` to distinguish from standard Stage 1/2 annotations. + +--- + +## Phase 9: Pre-Training Strategy — DAPT + TAPT + +### The Decision: Own Filings Over PleIAs/SEC + +For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options: + +1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090. +2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline. + +We chose option 2. The reasoning: + +- **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task. +- **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot. +- **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway. +- **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation. + +The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL. + +### Adding TAPT: "Don't Stop Pretraining" + +Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. + +Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify. + +The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT. + +### The Training Pipeline + +``` +ModernBERT-large (base, 395M params) + → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large + → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large + → Fine-tune on labeled data with dual classification heads → Final classifier +``` + +This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step. + +--- + +## Phase 10: Data Quality Audit and Corpus Remediation + +### The Discovery + +While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data: + +1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs. + +2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. + +### The Generator Investigation + +Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup. + +The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate. + +By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform. + +Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`. + +### Six Surgical Patches + +All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation. + +| Patch | Method | Paragraphs | Annotated | +|-------|--------|-----------|-----------| +| 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 | +| 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 | +| 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 | +| 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 | +| 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 | +| **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** | + +The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse. + +### Orphan Word Re-Annotation + +The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs: +- Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline) +- Management Role 0.49x under-represented +- Board Governance 0.60x under-represented + +Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong. + +**Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures. + +**Results:** +- **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real +- **37 paragraphs (2.4%)** changed consensus specificity +- **152 total (9.9%)** changed on at least one dimension +- mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%) +- 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings +- Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34) + +The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs. + +### No-Cyber-Keyword Paragraphs: A False Alarm + +The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other. + +**Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept. + +### Heading-Stripped Paragraphs: Labels Still Valid + +For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label. + +### Embedded Bullet Lists: The Cascade Failure + +A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on: + +> establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023... + +The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `
` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them. + +**Root cause: a three-part cascade failure in the extraction pipeline.** + +1. **Bullet character not recognized.** The HTML used `·` (middle dot in Symbol font) instead of `•` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires. +2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block. +3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph. + +The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text. + +We identified two classes of this failure: + +1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers). +2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text. + +All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal. + +### Sample Weighting for Fine-Tuning + +The quality tier system maps directly to training sample weights: + +| Tier | Weight | Rationale | +|------|--------|-----------| +| clean | 1.0 | No issues | +| headed | 1.0 | Heading removed, body text intact | +| minor | 1.0 | Orphan word restored | +| degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs | + +This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data. + +### Data Integrity Framework + +The audit produced a formal data integrity framework: + +1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor +2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash +3. Annotations link by UUID — stable across patches +4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes +5. Every patch is documented with scope, method, validation, and annotation impact +6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file + +### Quality Tier System + +Each paragraph gets a quality tier based on detected issues: + +| Tier | Criteria | Count | % | +|------|----------|-------|---| +| clean | No detected issues | 58,165 | 80.7% | +| headed | Had inlined heading (now stripped) | 7,402 | 10.3% | +| degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% | +| minor | Had orphan word (now fixed) | 2,147 | 3.0% | + +All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning. + +--- + +## Phase 11: DAPT Corpus Preparation + +### Corpus Cleaning + +The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required: + +**Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers. + +**Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms. + +**Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms. + +The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed. + +### Corpus Statistics (Final) + +| Metric | Value | +|--------|-------| +| Full corpus | 14,568 docs, ~1.056B tokens | +| Training subset | ~7,200 docs (newest 500M tokens, FY2024-2025) | +| Training sequences (seq_len=8192) | ~60K | +| Steps per epoch (eff. batch=32) | ~1,950 | +| Actual training time | ~13.5 hours (RTX 3090, 27s/step) | + +### Sequence Length Decision + +ModernBERT was pre-trained at 8192 tokens (Warner et al., 2024). We match this during DAPT to ensure all positional embedding and attention weights — including ModernBERT's alternating local/global attention pattern — receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates, and the global attention layers (every 3rd layer, RoPE theta 160K) would never see long-range context during DAPT. + +### Epoch Decision + +We train for 1 epoch (single pass), following the empirical consensus: + +- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across 2-8B token corpora. Sufficient for consistent downstream gains across all four domains tested. +- **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold. + +### Hyperparameters Aligned with Prior ModernBERT DAPT Work + +We aligned hyperparameters with the ModernBERT paper and two published DAPT efforts: + +- **MLM probability (30%):** Matches ModernBERT pre-training (Warner et al., 2024). +- **Weight decay (1e-5):** Matches ModernBERT pre-training and both BioClinical-ModernBERT (Sounack et al., 2025) and Patent-ModernBERT (Luo et al., 2025). The commonly-cited 0.01 is a BERT/RoBERTa default that doesn't apply to ModernBERT. +- **Learning rate (5e-5):** Conservative because we start from the published post-decay checkpoint. BioClinical and Patent-ModernBERT used 3e-4 but started from pre-decay stable-phase checkpoints that the ModernBERT authors released specifically for continued pre-training. + +### Training Optimizations + +Initial training ran at ~47s/step (projected ~56 hours for 1B tokens). Through iterative optimization we brought this down to ~13.5 hours: + +1. **Flash Attention 2** (Dao, 2024) — installed via precompiled wheel after upgrading to PyTorch 2.11+cu130 (CUDA 13.0 to match the driver). Without FA2, ModernBERT fell back to O(n²) eager attention at 8192 seq_len. This cut s/step from ~47s to ~27s. + +2. **torch.compile** — JIT-compiles non-attention ops into fused CUDA kernels. With external FA2, Dynamo hits graph breaks at every attention layer, so there was **no compute speedup**. However, fusing the surrounding ops (FFN, layer norms, residuals) unexpectedly **halved activation memory** (18.2GB → 11.9GB at batch=2) by eliminating intermediate tensor allocations. + +3. **Batch size increase** — torch.compile's memory savings freed enough VRAM to increase from batch=2 to batch=4. At seq_len=8192 the GPU is already compute-saturated, so larger batches didn't meaningfully improve s/step (~27s in all configurations). The benefit was marginal reduction in gradient accumulation overhead. + +4. **Corpus subsampling** — the single biggest wall-time reduction. Ponnock (2025) showed diminishing returns past 250M tokens for SEC DAPT. Subsampling from 1.06B to 500M tokens (newest filings) halved training from ~29h to ~13.5h. + +5. **Fused AdamW + non-reentrant gradient checkpointing + tf32** — minor optimizations (~1-2% combined). Fused optimizer merges parameter updates into a single kernel. Non-reentrant checkpointing enables torch.compile compatibility. + +**What didn't work:** Increasing batch size beyond 2 provided no s/step improvement because the 3090 is compute-saturated at seq_len=8192 (attention is O(n²) FLOPs even with FA2). SDPA (PyTorch's native attention) couldn't replace external FA2 without OOMing due to different memory allocation patterns. torch.compile couldn't accelerate the attention bottleneck because FA2's custom CUDA kernels are opaque to Dynamo's graph tracer. + +**The fundamental constraint** is hardware: the RTX 3090's 35.6 bf16 TFLOPS sets a hard ceiling on throughput at 8192 seq_len. An AWS g7e.2xlarge (RTX PRO 6000 Blackwell, 236 bf16 TFLOPS, 96GB VRAM) could complete the same run in ~3.7 hours for ~$5 on spot pricing — the 96GB VRAM allows dropping gradient checkpointing entirely (eliminating activation recomputation) and running batch=16. + +Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROCEDURE.md`. + +### Early Training Results + +| Step | Loss | grad_norm | LR | Epoch | Note | +|------|------|-----------|-----|-------|------| +| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Warmup phase | +| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | Steady decline | +| 1800 | 0.7253 | 0.073 | 1.48e-6 | 0.97 | LR near zero, loss plateaued | +| **Final** | **0.7250** | **0.043** | **5.7e-8** | **1.00** | **Eval loss: 0.7250, perplexity: 1.65** | + +The loss dropped from 0.80 → 0.72 — a gentle 10% decline over one epoch. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; DAPT taught it SEC-specific token co-occurrence patterns ("NIST CSF", "materiality assessment", "tabletop exercise"), not language fundamentals. grad_norm remained stable at 0.04-0.07 throughout with zero instability. Total training time: ~14 hours across two sessions on an RTX 3090 (resumed from checkpoint-1280). + +The DAPT checkpoint is saved at `checkpoints/dapt/modernbert-large/final/` and is ready for TAPT. + +### TAPT Configuration + +The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. This changes several training decisions vs. DAPT. Config file: `python/configs/tapt/modernbert.yaml`. + +| Parameter | DAPT | TAPT | Rationale for change | +|-----------|------|------|---------------------| +| `max_seq_length` | 8192 | 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). Using 8192 would mean 98.5% padding — pure waste. See seq_len discussion below. | +| `num_train_epochs` | 1 | 5 | Gururangan et al. (2020) ran 100 epochs on 50-500K token TAPT corpora. We match total token exposure: 5 × 10M = 50M tokens ≈ upper bound of their TAPT exposure. | +| `whole_word_mask` | false | true | Masks entire words instead of subword pieces. Prevents trivially solvable masking patterns (e.g., masked `cyber` next to unmasked `security`). The model already knows subword composition from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). | +| `per_device_train_batch_size` | 4 | 32 | Short sequences free VRAM. Tested: batch=32 uses 22.7 GB with torch.compile (vs. OOM at batch=48). | +| `gradient_accumulation_steps` | 8 | 1 | Effective batch = 32 in both cases. No accumulation needed since batch=32 fits directly. | +| `gradient_checkpointing` | true | false | Not needed at seq_len=512 — activations are small. Gradient checkpointing would slow training 30-40% for no memory benefit. | +| `save_strategy` / `eval_strategy` | steps (256) | epoch | 5 epochs; checkpoint and evaluate after each one. | +| `validation_split` | 0.02 | 0.05 | Larger val split for a 50x smaller dataset — need enough samples for stable eval loss. | + +**Sequence length (512 vs. 8192):** The concern with a shorter seq_len is degrading the model's long-range attention capabilities. Three factors make this a non-issue for TAPT: + +1. **The data is short.** Paragraphs average 127 tokens. There is no long-range structure to learn — the information simply isn't there. +2. **Scale of exposure.** TAPT is 50M token-exposures (5 epochs × 10M). ModernBERT was pre-trained on ~2T tokens; DAPT added 500M. 50M is 0.0025% of original pre-training — far too small to cause catastrophic forgetting of patterns established over trillions of tokens. +3. **RoPE positions are independent.** ModernBERT uses rotary position embeddings. Positions 0-511 compute identically whether max_length is 512 or 8192. Training at 512 updates the same parameters; positions 512-8191 remain as-is from DAPT, not degraded. + +**Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked. + +**Training time:** ~2,139 steps/epoch × 5 epochs = ~10,695 total steps. 50 minutes on the RTX 3090 at ~3.56 steps/s (averaged over full run including torch.compile warmup). + +### TAPT Results + +| Metric | Value | +|--------|-------| +| Epochs | 5 | +| Total steps | 10,695 | +| Training time | 50 minutes | +| Initial loss | 1.46 | +| Final train loss (avg) | 0.6428 | +| Final eval loss | 1.0754 | +| Final perplexity | 2.11 | +| Throughput | 114 samples/s, 3.56 steps/s | + +Loss dropped from 1.46 → 1.08 over 5 epochs. For comparison, DAPT ended at eval loss 0.72 with standard subword masking at the same 30% rate — the gap reflects the harder whole-word masking objective (no subword hints), not a weaker model. The model learns to predict masked domain terms ("CISO", "materiality", "tabletop") from surrounding paragraph context alone, which is exactly the inductive bias TAPT is designed to create. + +The TAPT checkpoint is saved at `checkpoints/tapt/modernbert-large/final/` and is ready for fine-tuning. + +### TAPT Launch — Whole-Word Masking Bugs + +Launching TAPT required fighting through four bugs in `transformers`' `DataCollatorForLanguageModeling` when `whole_word_mask=True`, plus a Python 3.14 incompatibility that forced a version rollback. + +**Bug 1: `offset_mapping` stripped before reaching the collator.** The Trainer's default `remove_unused_columns=True` drops any dataset column not in the model's `forward()` signature. Since `offset_mapping` is a collator input (not a model input), it was silently removed, causing the collator to receive a 0-dimensional array and crash with `IndexError: too many indices for array`. Fix: set `remove_unused_columns=False` when whole-word masking is enabled. + +**Bug 2: `offset_mapping` can't survive `tokenizer.pad()`.** Even with the column present, the collator's `torch_call()` passes all features — including `offset_mapping` — through `tokenizer.pad()`, which tries to tensorize the variable-length nested lists and crashes with `ValueError`. The collator pops `offset_mapping` *after* padding, but padding already failed. Fix: subclass `DataCollatorForLanguageModeling` to strip `offset_mapping` before padding. + +**Bug 3: `offset_mapping` word boundary detection is broken for BPE tokenizers.** This was the most insidious bug — training ran but loss was ~6-8 (near-random, vs. expected ~1.5-2.0). The upstream `_calc_word_ids_and_prob_mask` detects word boundaries by checking if `token_start != prev_token_end` in the offset mapping. But BPE tokenizers (like ModernBERT's) absorb leading spaces into tokens, making ALL offsets contiguous: `"The" → (0,3), " company" → (3,11)`. Since 3 == 3, the algorithm treats the entire sequence as one giant "word." When 30% masking is applied to these mega-groups, it masks enormous contiguous spans, making prediction nearly impossible. + +**Fix:** Replaced `offset_mapping` entirely with the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type (BPE, WordPiece, SentencePiece). The `WholeWordMaskCollator` in `python/src/dapt/train.py` implements whole-word masking from scratch: extracts `word_ids` before padding, selects `mlm_probability` fraction of unique word IDs per sequence, and masks all tokens belonging to selected words. + +**Python 3.14 incompatibility.** Two separate issues forced a rollback to Python 3.13: +1. Python 3.14 changed the multiprocessing start method from `fork` to `forkserver`, requiring picklable dataloader collators (closures crash with `PicklingError`). +2. Python 3.14 changed `pickle.Pickler._batch_setitems` to take 3 arguments, breaking `dill` (used by `datasets` for config hashing). This was unfixable — even `dill` 0.4.1 and `datasets` 4.8.4 crashed. The breakage is deep in the `datasets` builder machinery and hit every codepath (`load_dataset`, `Dataset.from_list`, `dataset.map`). + +Rolled `pyproject.toml` from `requires-python = ">=3.14"` to `">=3.13,<3.14"` and updated the flash-attn wheel URL from cp314 to cp313. + +--- + +## Cost and Time Ledger + +### Tooling + +All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner. + +### API Cost Ledger + +| Phase | Cost | Annotations | Notes | +|-------|------|-------------|-------| +| Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample | +| Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron | +| Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling | +| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | +| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | +| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | +| Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. | +| **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 | + +Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. + +### Time Ledger + +| Phase | Hours | Notes | +|-------|-------|-------| +| Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. | +| Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. | +| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | +| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | +| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | +| Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs | +| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | +| Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure | +| DAPT pre-training | ~14.5h GPU | 1 epoch on 500M tokens, RTX 3090. Two sessions (resumed from checkpoint-1280). | +| TAPT debugging + pre-training | ~2h dev + ~50min GPU | 4 bugs in transformers whole-word masking + Python 3.14 rollback. Training: 5 epochs on 72K paragraphs, 50 min. | +| Human labeling (1,200 paragraphs, 6 annotators) | 21.5h active | $0 (team labor) | +| Post-labeling analysis + gold set tooling | ~3h | $0 | +| **Total to date** | **~76.5h** | Includes ~15.3h GPU + 21.5h human labeling | + +### Remaining Work (estimated) + +| Phase | Est. Hours | Est. Cost | +|-------|-----------|-----------| +| GenAI holdout benchmark (6 models × 1,200) | ~1h | ~$15-43 | +| Opus golden re-run (1,200 paragraphs) | ~1h | $0 (subscription) | +| Gold set adjudication (13+ signals/paragraph) | ~4h | $0 | +| Training data assembly | ~2h | $0 | +| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | +| Evaluation + comparison + write-up | ~6-8h | $0 | + +--- + +## Model Census — Every Model We Tried + +Over the course of the project, we evaluated **18 distinct models** across three phases: initial panel selection, Stage 1 replacement bench, and Stage 2 judge selection. Each decision narrowed the field based on empirical evidence. + +### Phase 0: Smoke Test (model-probe.ts) — 9 candidates + +Tested basic structured output compliance on a single paragraph before committing to expensive benchmarks. + +| Model | Provider | Result | +|-------|----------|--------| +| google/gemini-3.1-flash-lite-preview | Google | ✅ Pass — selected for panel | +| x-ai/grok-4.1-fast | xAI | ✅ Pass — selected for panel | +| openai/gpt-4.1-mini | OpenAI | ✅ Pass — not selected (cost) | +| openai/gpt-4.1-nano | OpenAI | ✅ Pass — later replaced by gpt-5.4-nano | +| anthropic/claude-haiku-4.5 | Anthropic | ✅ Pass — not selected (cost tier) | +| google/gemini-3.1-flash-preview | Google | ✅ Pass — too expensive for Stage 1 | +| deepseek/deepseek-chat-v3-0324:free | DeepSeek | Tested — free tier limitations | +| meta-llama/llama-4-maverick | Meta | Tested | +| qwen/qwen3-235b-a22b | Alibaba | Tested | + +### Phase 1: Early Pilots (v1.0-v1.2) — Original panel + +The very first panel used **gpt-oss-120b** (OpenAI's open-source 120B model), not nano: +- `google/gemini-3.1-flash-lite-preview` +- `openai/gpt-oss-120b` (also tested with `:exacto` routing suffix) +- `x-ai/grok-4.1-fast` + +gpt-oss-120b was replaced by gpt-5.4-nano between v1.2 and v2.1 — nano was cheaper and appeared to perform comparably on the small (n=40) pilot samples. + +### Phase 2: 500-Sample Pilots (v2.2-v2.7) — Nano era + +Panel during the main prompt iteration: +- `google/gemini-3.1-flash-lite-preview` +- `openai/gpt-5.4-nano` ← the problem model +- `x-ai/grok-4.1-fast` + +Nano's issues (0 reasoning tokens 64% of the time, erratic specificity) were persistent but masked by the 40→500 sample transition being attributed to prompt changes rather than model inadequacy. + +### Phase 3: Stage 1 Replacement Bench (model-bench.ts) — 6 candidates + +After locking prompt v2.5, formally benchmarked replacements for nano: + +| Model | Provider | Reasoning Tokens | Cost/ann | Outcome | +|-------|----------|-----------------|----------|---------| +| xiaomi/mimo-v2-flash | Xiaomi | 1,346 | $0.00048 | **✅ Winner** — best value, lowest outlier rate | +| bytedance-seed/seed-2.0-lite | ByteDance | 658 | $0.00227 | Runner-up — highest accuracy but 4.7x more expensive | +| z-ai/glm-4.5-air | Zhipu AI | 854 | $0.00136 | Mediocre — barely moved the needle (+0.8pp) | +| minimax/minimax-m2.5 | MiniMax | 590 | $0.00106 | Mediocre — slightly worse than nano (-1.0pp) | +| mistralai/mistral-small-2603 | Mistral | **0** | $0.00015 | ❌ Zero reasoning tokens. Cheapest but useless. | +| nvidia/nemotron-3-super-120b-a12b | NVIDIA | 942 | $0.00152 | ❌ Worst performer despite being expensive. 21% outlier rate. | + +### Phase 4: Production Stage 1 — Final panel + +- `google/gemini-3.1-flash-lite-preview` (Google) +- `xiaomi/mimo-v2-flash` (Xiaomi) ← replaced nano +- `x-ai/grok-4.1-fast` (xAI) + +Three models from three providers — minimizes correlated errors. + +### Phase 5: Stage 2 Judge Bench (judge-bench.ts) — 8 candidates + +| Model | Provider | Mode | Both vs Gold | Fails | Outcome | +|-------|----------|------|-------------|-------|---------| +| z-ai/glm-5 | Zhipu AI | structured | 77-80% | 4-12% | Best accuracy but unreliable structured output | +| z-ai/glm-5 | Zhipu AI | tool calling | 72% | 0% | Reliable but -7pp accuracy | +| openai/gpt-5.4-mini | OpenAI | structured | 68% | 0% | Reliable, weaker on specificity | +| openai/gpt-5.4 | OpenAI | structured | Tested | 0% | Expensive, diminishing returns over mini | +| anthropic/claude-sonnet-4.6 | Anthropic | structured | Used for gold | 0% | Gold label creation, too expensive for production judge | +| google/gemini-3-flash-preview | Google | structured | Tested | — | Rubber-stamped majority — added zero value | +| x-ai/grok-4.20-beta | xAI | structured | Tested | — | Benchmarked | +| xiaomi/mimo-v2-pro | Xiaomi | structured | Tested | — | Benchmarked | +| moonshotai/kimi-k2.5 | Moonshot AI | structured | Tested | — | Only 26/50 completed — high failure rate | + +### Phase 6: Holdout Benchmark — 6 models from 6 suppliers + +After human labeling completion, 6 models benchmark against the 1,200 holdout with v3.0 prompt: + +| Model | Provider | Cost/call | Latency | Notes | +|-------|----------|-----------|---------|-------| +| openai/gpt-5.4 | OpenAI | $0.009 | 5s | | +| moonshotai/kimi-k2.5 | Moonshot | $0.006 | 33s | | +| google/gemini-3.1-pro-preview | Google | $0.006 | 3s | | +| z-ai/glm-5 | Zhipu | $0.006 | ~40s | exacto routing | +| minimax/minimax-m2.7 | MiniMax | $0.002 | 11s | Raw text mode (markdown fences) | +| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | 32s | exacto routing | + +Plus Opus 4.6 (Anthropic) via Agent SDK on all 1,200 holdout paragraphs. + +### Summary: 21+ Models, 12 Providers + +| Provider | Models Tested | Role | +|----------|--------------|------| +| Google | gemini-3.1-flash-lite, gemini-3.1-flash, gemini-3-flash, gemini-3.1-pro | Stage 1 panel + benchmark | +| OpenAI | gpt-oss-120b, gpt-5.4-nano, gpt-4.1-mini, gpt-4.1-nano, gpt-5.4-mini, gpt-5.4 | Benchmark | +| xAI | grok-4.1-fast, grok-4.20-beta | Stage 1 panel | +| Xiaomi | mimo-v2-flash, mimo-v2-pro | Stage 1 panel + benchmark | +| Anthropic | claude-haiku-4.5, claude-sonnet-4.6, claude-opus-4.6 | Gold labels (Opus), judge | +| Zhipu AI | glm-4.5-air, glm-5 | Benchmark | +| MiniMax | minimax-m2.5, minimax-m2.7 | Benchmark | +| Moonshot AI | kimi-k2.5 | Benchmark | +| ByteDance | seed-2.0-lite | — (too expensive for scale) | +| NVIDIA | nemotron-3-super-120b | — (worst performer) | +| Mistral | mistral-small-2603 | — (zero reasoning) | +| Meta | llama-4-maverick | — (smoke test only) | +| Alibaba | qwen3-235b-a22b | — (smoke test only) | +| DeepSeek | deepseek-chat-v3-0324 | — (smoke test only) | + +--- + +## Phase 13: GenAI Holdout Benchmark + +### Benchmark Panel + +With human labeling complete, the next step is running 6+ GenAI models from 3+ suppliers on the same 1,200 holdout paragraphs — both as an assignment requirement and to generate the 13+ annotation signals needed for gold set adjudication. + +The benchmark panel uses the v3.0 prompt (with codebook rulings) and runs via OpenRouter: + +| Model | Supplier | Cost/call | Latency | Structured Output | +|-------|----------|-----------|---------|-------------------| +| openai/gpt-5.4 | OpenAI | $0.009 | 5s | Native | +| moonshotai/kimi-k2.5 | Moonshot | $0.006 | 33s | Native | +| google/gemini-3.1-pro-preview | Google | $0.006 | 3s | Native | +| z-ai/glm-5 | Zhipu | $0.006 | ~40s | Native (exacto routing) | +| minimax/minimax-m2.7 | MiniMax | $0.002 | 11s | Raw text + fence stripping | +| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | 32s | Native (exacto routing) | + +Plus Claude Opus 4.6 via Agent SDK (subscription, no per-call cost) with full codebook as system prompt. + +Combined with the 3 Stage 1 models already on file: **10 models from 8 suppliers**. + +**Minimax structured output workaround:** MiniMax m2.7 wraps JSON responses in markdown code fences (` ```json ... ``` `), which the Vercel AI SDK's `Output.object()` parser cannot handle. Rather than using tool calling (which drops accuracy ~7pp based on GLM-5 testing) or a fallback retry (2x cost), minimax models skip structured output entirely and use raw text generation with regex fence stripping before Zod validation. The enum values are correct with the full v3.0 prompt; only the fences are the issue. + +### Opus Golden Re-Run + +The Opus golden labeling was re-run on the correct 1,200 holdout paragraphs. A previous run had annotated a different set of 1,200 paragraphs due to `.sampled-ids.json` being overwritten (previous labels preserved at `data/annotations/golden/opus.wrong-sample.jsonl`). The re-run uses parallelized Agent SDK workers (configurable concurrency) with serialized file writes for crash safety. + +--- + +## Key Technical Artifacts + +| Artifact | Location | Description | +|----------|----------|-------------| +| Labeling codebook | `docs/LABELING-CODEBOOK.md` | Authoritative reference, v3.0 with codebook rulings | +| Stage 1 annotations | `data/annotations/stage1.jsonl` | 150,009 annotations (120 MB) | +| Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` | 72,045 paragraphs with filing metadata | +| Gold labels | `data/bench/judges/gold-final.json` | 50 adjudicated gold labels | +| Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning | +| Human labels (raw) | `data/gold/human-labels-raw.jsonl` | 3,600 labels with timing, notes, session IDs | +| Human label metrics | `data/gold/metrics.json` | Full IRR: per-dimension alpha, pairwise kappa matrices, per-category/stratum rates | +| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` | 1,200 holdout paragraphs with Stage 1 consensus metadata | +| Diagnostic charts | `data/gold/charts/` | 16 analysis charts (kappa heatmaps, confusion matrices, distributions, etc.) | +| Analysis script | `scripts/analyze-gold.py` | Comprehensive cross-source analysis (human × Stage 1 × Opus) | +| Annotation prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v3.0) + buildJudgePrompt() | +| Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency | +| Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 | +| Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis | +| No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs | +| Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers | +| Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles | +| Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data | +| Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison | +| Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation | +| Model benchmarking | `ts/scripts/model-bench.ts` | Stage 1 candidate evaluation | +| Golden annotation (Opus) | `ts/src/label/golden.ts` | Agent SDK runner for gold set, saves reasoning traces | +| Golden annotations | `data/annotations/golden/opus.jsonl` | Opus 4.6 labels + thinking + raw confidence (re-run on correct holdout) | +| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` | 6 models × 1,200 paragraphs, v3.0 prompt | +| Stale golden (wrong sample) | `data/annotations/golden/opus.wrong-sample.jsonl` | Original Opus run on wrong 1,200 paragraphs (preserved) | + +--- + +## Phase 14: 13-Signal Analysis & F1 Strategy + +### Benchmark Complete + +All 6 benchmark models + Opus completed 1,200 annotations each. Total benchmark cost: $45.63. Every paragraph in the holdout now has exactly 13 independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark. + +Model performance sorted by leave-one-out "both" accuracy (each source vs majority of other 12): Opus 4.6 (84.0%), Kimi K2.5 (83.3%), Gemini Pro (82.3%), GPT-5.4 (82.1%), GLM-5 (81.4%), MIMO Pro (81.4%), Grok Fast (80.0%). Best human: Xander at 76.9%. Worst: Aaryan at 15.8%. + +### The "Is Opus Special?" Question + +We tested whether Opus's apparent dominance was an artifact of using it as the reference. Answer: no. In leave-one-out analysis, Opus has the lowest "odd one out" rate at 7.4% — it disagrees with the remaining 12 sources less than any other source. But the top 6 GenAI models are within 3pp of each other — any could serve as reference with similar results. The 13-signal majority is 99.5% identical to the 10-GenAI majority; adding 3 human votes barely shifts consensus because 10 outvotes 3. + +### Adjudication Tiers + +The 13-signal consensus enables tiered adjudication: +- **Tier 1 (63.0%):** 756 paragraphs where 10+/13 agree on both dimensions. Auto-gold, zero human work. +- **Tier 2 (18.0%):** 216 paragraphs where human majority and GenAI majority agree. Cross-validated. +- **Tier 3 (2.2%):** 26 paragraphs where humans split but GenAI converges. +- **Tier 4 (16.8%):** 202 paragraphs with universal disagreement. Expert adjudication needed. + +81% of the holdout can be adjudicated automatically. The 202 Tier 4 paragraphs are dominated by MR↔RMP confusion (the #1 axis everywhere) and are the natural error analysis corpus. + +### Specificity: GenAI Is More Consistent Than Humans + +GenAI spec unanimity is 60.1% vs human spec unanimity of 42.2% (+18pp). Specificity calibration plots show that GPT-5.4, Gemini Pro, and Kimi K2.5 closely track Opus across all 4 specificity levels. MiniMax M2.7 is the only model with systematic specificity bias (−0.26 vs Opus). Among humans, Aaryan's +1.30 bias dwarfs all other sources. + +### F1 Strategy + +The assignment requires macro F1 > 0.80 on category. Based on the data: +- The best GenAI models agree with human majority ~83-87% on category +- Training on 35K+ unanimous Stage 1 labels with DAPT+TAPT should approach this ceiling +- The swing categories for macro F1 are MR (~65-80%), TPR (~70-90%), N/O (~60-85%) +- Focal loss for class imbalance + SCL for boundary separation + ensemble for robustness + +Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a random sample. Mitigation: report F1 on both the full holdout and a proportional subsample. The delta quantifies model degradation at decision boundaries. + +### Cost Ledger Update + +| Phase | Cost | Time | +|-------|------|------| +| Stage 1 (150K annotations) | $115.88 | ~30 min | +| Orphan re-annotation | $3.30 | ~9 min | +| Benchmark (6 models × 1,200) | $45.63 | ~1h | +| Opus golden (1,200) | $0 (subscription) | ~30 min | +| Human labeling | $0 (class assignment) | 21.5h active | +| Post-labeling analysis | ~3h | | +| **Total API** | **$164.81** | | + +--- + +## Phase 15: Codebook v3.5 — The Prompt Drift Discovery + +### The Problem + +Cross-analysis of human vs GenAI labels on the holdout revealed a systematic, directional disagreement on three axes: + +1. **SI↔N/O (23:0 asymmetry):** When humans and GenAI disagreed on this axis, humans ALWAYS called it SI and GenAI called it N/O. Never the reverse. Root cause: the labelapp trained humans that any language connecting cybersecurity to business materiality — even forward-looking ("could materially affect") — is SI at Specificity 1. Stage 1 models (v2.5 prompt) lacked this rule entirely. Even v3.0 benchmark models, which had the backward-looking materiality rule, were conservative about forward-looking variants. + +2. **MR↔RMP (253 paragraphs, 38:13 asymmetry):** GenAI systematically calls MR paragraphs RMP. The v3.0 "person-vs-function test" helps but leaves genuinely mixed paragraphs (both person and process as grammatical subjects) unresolved. These near-even splits need a deterministic tiebreaker chain. + +3. **BG↔MR (149 paragraphs, 33:6 asymmetry):** GenAI systematically under-calls BG. The problem is governance chain paragraphs that describe the board receiving reports from management — is this about the board's oversight function or the officer's reporting duty? + +### The Audit + +A Stage 1 audit found ~1,076 paragraphs (649 unanimous + 383 majority N/O) with materiality language that should be SI under the broadened rule. 1.3% of the corpus overall — but potentially concentrated on exactly the boundary cases the holdout over-samples. On the holdout, mimo-v2-flash was actually the most accurate Stage 1 model on this axis, dissenting toward SI 263 times when the other two said N/O. + +The MR↔RMP and BG↔MR axes are cleaner in Stage 1 unanimity — only 0.2% of unanimous BG labels are problematic, and the MR/RMP tiebreaker mainly affects disputed labels (already going to Stage 2). The v2.5→v3.5 gap is primarily an SI↔N/O problem. + +### Initial v3.5 Rulings (Round 1) + +Three rulings, all driven by the 13-signal cross-analysis: + +**Rule 6 broadened (SI↔N/O):** ALL materiality language → SI, not just backward-looking disclaimers. Forward-looking ("could materially affect"), conditional ("reasonably likely to"), and negative assertions ("have not experienced material incidents") are all Strategy Integration at Specificity 1. + +**Rule 2 expanded (BG↔MR):** Added the board-line test with governance hierarchy layers and a dominant-subject test for cross-layer paragraphs. + +**Rule 2b expanded (MR↔RMP):** Three-step decision chain: subject test → person-removal test → qualifications tiebreaker. + +These rulings were tested by re-running all 7 benchmark models (6 OpenRouter + Opus) on 359 confusion-axis holdout paragraphs with the v3.5 prompt ($18, stored separately from v3.0 data). + +### The Prompt Drift Lesson + +Running Stage 1 (150K annotations) before human labeling created a subtle but significant problem: the codebook evolved through v2.5 → v3.0 → v3.5, but the training data is frozen at v2.5. Each codebook revision was driven by empirical analysis of disagreement patterns — which required the Stage 1 data AND human labels to exist first. The dependency is circular: you can't know what rules are needed until you see where annotators disagree, but you can't undo the labels already collected. + +### Iteration: 6 Rounds on 26 Regression Paragraphs ($1.02) + +The initial v3.5 re-run revealed that the rulings over-corrected. We identified 26 "regression" paragraphs — cases where v3.0 matched human majority but v3.5 did not — and iterated the prompt using GPT-5.4 on these 26 paragraphs ($0.17/round) to diagnose and fix each over-correction. + +**Round 1 (v3.5a) — 5/26.** Catastrophic. All three rulings over-fired simultaneously. SI was called on every paragraph with the word "material." BG was called whenever a committee was named. MR was called whenever a person was a grammatical subject. The rulings were correct in intent but models interpreted them too aggressively. + +**Round 2 (v3.5b) — 13/25.** Three fixes: (A) Replaced the BG "dominant-subject test" with a "purpose test" — if the paragraph describes oversight structure, it's BG; mere committee mentions don't flip the category. (B) Made MR↔RMP Step 1 non-decisive — a person being the grammatical subject is a signal, not a conclusion; always proceed to Step 2 (person-removal test). (C) Added cross-reference exception for SI. Improvement: +8. + +**Round 3 (v3.5c) — 20/26.** The cross-reference exception eliminated the 5 most egregious SI over-predictions — paragraphs like "For a description of risks that may materially affect us, see Item 1A" that v3.5a called SI but are obviously N/O. These were pure pointers with materiality language embedded in the cross-reference text, not materiality assessments. +7. + +**Round 4 (v3.5d) — 22/26.** The critical insight: not all materiality language is a materiality *assessment*. Reading the 6 remaining errors revealed a spectrum: + +- "Cybersecurity risks have not materially affected our business strategy" → **Assessment** (conclusion about actual impact) → SI ✓ +- "Risks are reasonably likely to materially affect us" → **Assessment** (SEC Item 106(b)(2) standard) → SI ✓ +- "Cybersecurity threats could have a material adverse effect on our business" → **Speculation** (generic risk warning in every 10-K) → NOT SI ✗ +- "Managing material risks associated with cybersecurity" → **Adjective** ("material" means "significant") → NOT SI ✗ +- "...which could result in material adverse effects" at the end of an RMP paragraph → **Consequence clause** (doesn't override primary purpose) → NOT SI ✗ + +The tightened rule: only backward-looking conclusions and SEC-qualified forward-looking ("reasonably likely to") trigger SI. Generic "could have a material adverse effect" does not. This distinction — assessment vs. speculation — resolved 3 errors without breaking any correct calls. +2. + +We also verified each error against human annotator votes. All 6 remaining errors had the human majority correct (checked by reading the actual paragraph text and codebook rules). Interestingly, on 3 of the 6, the project lead's own label was the dissenting human vote — he had been the one calling these SI, validating that the over-calling pattern was a real and consistent interpretation difference, not random noise. + +**Round 5 (v3.5e) — 19/25.** Regression. We attempted to add an explicit BG↔RMP example ("CISO assists the ERMC in monitoring...→ RMP") to the disambiguation guidance. This caused 3 previously-correct paragraphs to flip to BG — the example made models hyper-aware of committee mentions and triggered BG more broadly. Lesson: **targeted examples can backfire when the pattern is too specific.** The model generalizes from the example in unpredictable ways. + +**Round 6 (v3.5f) — 21/26.** Reverted the Round 5 BG↔RMP example. Kept the N/O↔RMP "actual measures" clarification from Round 5 (if a paragraph describes specific security measures the company implemented, it's RMP even in risk-factor framing). This stabilized at 21-22/26, with the 2-paragraph swing attributable to LLM non-determinism at temperature=0. + +### The 4 Irreducible Errors + +The remaining errors after Round 4/6 fall into two patterns: + +**BG over-call on process paragraphs (2 errors):** A paragraph describing monitoring methods (threat intelligence, security tools, detection capabilities) where a management committee (ERMC) is woven throughout as the entity being assisted. Content is clearly RMP but the committee mention triggers BG. These are genuinely dual-coded — the monitoring IS part of the committee's function. Human majority says RMP (2-1 in both cases). + +**N/O over-call on borderline RMP paragraphs (2 errors):** Paragraphs that describe risk management activities ("assessing, identifying, and managing material risks") but are framed as risk-factor discussions with threat enumeration. The SI tightening correctly stopped calling them SI, but they overcorrected to N/O instead of RMP. The N/O↔RMP boundary depends on whether the paragraph describes what the company DOES (→ RMP) vs. what risks it faces (→ N/O). These paragraphs do both. + +All 4 have human 2-1 splits — reasonable annotators disagree on these. Further prompt iteration risks over-fitting to these 4 specific paragraphs at the cost of breaking the other 355 correctly-classified ones. + +### The SI Rule: Assessment vs. Speculation + +The most important finding from the iteration is the distinction between materiality *assessments* and materiality *language*: + +| Pattern | Classification | Reasoning | +|---------|---------------|-----------| +| "have not materially affected our business strategy" | **SI** | Backward-looking conclusion — the company is reporting on actual impact | +| "reasonably likely to materially affect" | **SI** | Forward-looking with SEC qualifier — Item 106(b)(2) disclosure | +| "have not experienced material cybersecurity incidents" | **SI** | Negative assertion — materiality conclusion about past events | +| "could have a material adverse effect" | **NOT SI** | Generic speculation — appears in every 10-K, not an assessment | +| "managing material risks" | **NOT SI** | Adjective — "material" means "significant," not a materiality assessment | +| "For risks that may materially affect us, see Item 1A" | **NOT SI** | Cross-reference — pointing elsewhere, not making a conclusion | +| "...which could result in material losses" (at end of RMP paragraph) | **NOT SI** | Consequence clause — doesn't override the paragraph's primary purpose | + +This distinction reduced the Stage 1 correction set from ~1,014 to 308 paragraphs. The original broad flag ("any paragraph with the word 'material'") caught ~700 paragraphs that were correctly labeled N/O by Stage 1 — they contained generic "could have a material adverse effect" boilerplate that is NOT a materiality assessment. Only 180 paragraphs contain actual backward-looking or SEC-qualified assessments that v2.5 miscoded. + +### Final v3.5 Gold Re-Run + +After locking the prompt at v3.5f, all 7 models (Opus + 6 benchmark) were re-run on the 359 confusion-axis holdout paragraphs with the final prompt (~$18). v3.0 data preserved in original paths (`bench-holdout/`, `golden/`). v3.5f results stored separately (`bench-holdout-v35/`, `golden-v35/`). The v3.0→v3.5 comparison — per model, per axis — is itself a publishable finding about how prompt engineering systematically shifts classification boundaries in frontier LLMs. + +### The SI↔N/O Paradox — Resolved + +The v3.5f re-run showed a troubling result: SI↔N/O accuracy *dropped* 6pp vs v3.0 (60% vs 66%), with the H=SI/M=N/O asymmetry worsening from 20 to 25 cases. The initial hypothesis was that models became globally conservative when told to distinguish assessment from speculation. + +A paragraph-by-paragraph investigation of all 27 SI↔N/O errors revealed the opposite: **the models are correct, and the humans are systematically wrong.** + +Of the 25 H=SI / M=N/O cases: +- ~20 are pure "could have a material adverse effect" speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. All 6 models unanimously call N/O. +- ~3 are genuinely ambiguous (SPACs with assessment language, past disruption without explicit materiality language). +- ~2 are edge cases (negative assertions embedded at end of BG paragraphs). + +Of the 2 H=N/O / M=SI cases: +- Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI. All 6 models unanimously call SI. + +**Root cause of human error:** Annotators systematically treat ANY mention of "material" + "business strategy" + "financial condition" as SI — even when wrapped in pure speculation ("could," "if," "may"). The codebook's assessment-vs-speculation distinction is correct; humans weren't consistently applying it. + +**Codebook Case 9 contradiction fixed:** The investigation also discovered that Case 9 ("could potentially have a material impact" → SI) directly contradicted Rule 6 ("could = speculation, not assessment"). Case 9 has been corrected: the "could" example is now N/O, with explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially" (speculation). + +Two minor prompt clarifications were added (consequence clause refinement for negative assertions, investment/resource SI signal) and tested on 83 SI↔N/O paragraphs ($0.55). Net effect: within stochastic noise — confirming the prompt was already correct. + +### Implications for Training + +- **Gold adjudication on SI↔N/O:** Trust model consensus over human majority. When 6/6 models unanimously agree and the paragraph contains only speculative language → use model label. Apply SI deterministically via regex for backward-looking assessments and SEC qualifiers. Expected impact: SI↔N/O accuracy rises from ~60% to ~95%+ against corrected gold labels. +- **Stage 2 judge** must use v3.5 prompt. This is where the codebook evolution actually matters for training data quality. +- **Stage 1 corrections re-flagged:** Tightened criteria reduced flagged paragraphs from 1,014 to 308 (180 materiality assessments + 128 SPACs). The 706 excluded paragraphs contained generic "could" boilerplate that was correctly labeled N/O by v2.5. +- **Gold adjudication on other axes:** On MR↔RMP and BG↔MR, v3.5 improves alignment with humans by ~4pp on hard cases but the improvement is more modest on easy cases. +- **MiniMax exclusion:** MiniMax M2.7 is a statistical outlier (z=−2.07 in inter-model agreement) and the most volatile model across prompt versions (40.7% category change rate). Data retained per assignment requirements but excluded from gold scoring majority. + +### Cost Ledger Update + +| Phase | Cost | Time | +|-------|------|------| +| v3.5 initial re-run (7 × 359) | ~$18 | ~10 min | +| v3.5 iteration (6 × 26 × GPT-5.4) | $1.02 | ~15 min | +| v3.5f final re-run (7 × 359) | ~$18 | ~10 min | +| SI↔N/O investigation (37 + 83 × GPT-5.4) | $0.55 | ~1 min | +| **v3.5 subtotal** | **~$37.57** | | +| **Running total API** | **~$202.57** | | + +--- + +## Lessons Learned + +### On Prompt Engineering +- Calibration examples beat rules. Each example targets a specific observed failure mode. +- Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic. +- More rules ≠ better. After the core structure is right, additional rules cause regression. +- The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change. +- **Rules over-correct before they converge.** The v3.5 iteration showed a consistent pattern: a new rule fixes the target problem but creates 2-3 new errors on adjacent cases. Each fix required a counter-fix. "Materiality language → SI" fixed the 23:0 asymmetry but created cross-reference false positives and speculation false positives that each required their own exception. Six rounds of test-fix-test were needed to reach equilibrium. +- **Targeted examples backfire.** Adding a specific example to a disambiguation rule ("CISO assists the ERMC in monitoring → RMP") caused regression elsewhere — models generalize from examples in unpredictable ways. General principles ("content matters more than names") are safer than specific examples in disambiguation guidance. +- **Assessment vs. language is a fundamental distinction.** The word "material" appears in thousands of SEC paragraphs but carries different force in different grammatical contexts. "Have not materially affected" (conclusion) vs. "could have a material adverse effect" (speculation) vs. "material risks" (adjective) are three different speech acts. Models don't naturally distinguish these without explicit guidance. +- **Check the humans — they can be systematically wrong.** On SI↔N/O, human annotators systematically over-called SI on any paragraph mentioning "material" + "business strategy," even when the language was pure speculation. The 25:2 asymmetry initially looked like model failure but was actually human failure to apply the assessment-vs-speculation distinction. When all 6 frontier models unanimously disagree with a 2/3 human majority, investigate before assuming the humans are right. The models' consistency (unanimous agreement across architectures and providers) is itself strong evidence. + +### On Model Selection +- Reasoning tokens are the strongest predictor of accuracy, not price or model size. +- Schema compliance varies — fix with Zod transforms, not prompt changes. +- Test both structured output AND tool calling for any candidate. They are not equivalent. + +### On Evaluation +- **Never evaluate against majority vote.** Build gold labels. Majority vote as ground truth makes models that rubber-stamp the majority look good. +- **Judge confidence is highly predictive** of accuracy. Use it to weight training samples. +- **Stage 1 confidence is useless** — cheap models are systematically overconfident (95%+ all-high). + +### On Data Quality at Scale +- The biggest wins come from understanding *where* and *why* models disagree, not from blanket improvements. +- Systematic model biases are quantifiable and predictable. Use them as signal, not noise. +- Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change. +- Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling. +- **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying. +- **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average. +- **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic. + +### On Training Infrastructure +- **Whole-word masking in `transformers` is broken for BPE tokenizers.** The upstream `DataCollatorForLanguageModeling(whole_word_mask=True)` uses `offset_mapping` to detect word boundaries by checking for gaps in character offsets. This fails silently for BPE tokenizers that absorb leading spaces — all offsets are contiguous, so the entire sequence becomes one "word." Loss appears to train but sits at ~6-8 (near-random). The fix is to use the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type, and implement masking yourself. +- **Python 3.14 is not ready for ML.** Both `dill` (via `datasets`) and PyTorch's multiprocessing (`fork` → `forkserver`) have breaking incompatibilities. Rolling back to 3.13 was the only viable path. +- **Flash Attention is mandatory for long sequences.** Without FA2, ModernBERT at seq_len=8192 ran at ~47s/step on an RTX 3090. With FA2, the same configuration ran at ~25s/step — and enabled further optimizations (batch size increase, torch.compile) that pushed it further. +- **Align hyperparameters with the base model's pre-training config.** ModernBERT was trained with weight_decay=1e-5 and 30% MLM probability. Using the BERT/RoBERTa default of 0.01 weight decay would have been wrong. Both published ModernBERT DAPT papers (BioClinical, Patent) independently validated these values. +- **torch.compile + gradient_checkpointing together is more than the sum of its parts.** On ModernBERT, this combination resolves a memory anomaly specific to FA2 during MLM training (AnswerDotAI/ModernBERT#172), freeing VRAM for larger batch sizes. +- **Precompiled wheels save hours.** Building flash-attn from source requires matching CUDA toolkit versions, which is fragile. Precompiled wheels for the exact {python, torch, CUDA} combination avoid this entirely. +- **torch.compile's value can be memory, not speed.** When the bottleneck is opaque custom CUDA kernels (like FA2), torch.compile can't accelerate them. But it can still fuse the *surrounding* ops, dramatically reducing activation memory. In our case, compile provided 0% speedup but 35% memory reduction — enough to double the batch size. +- **Corpus subsampling is the biggest lever on consumer hardware.** When you're compute-bound, no software optimization can beat "process less data." The scaling laws literature (Ponnock 2025) provides empirical justification for stopping early. +- **At long sequence lengths, the GPU saturates at small batches.** Increasing batch from 2→4 at seq_len=8192 provided no s/step improvement on an RTX 3090 — the matmul dimensions are already large enough to fill all 82 SMs. This is the opposite of short-sequence fine-tuning where batch size scaling is the primary throughput lever. + +--- + +## References + +- Warner, B., Clavié, B., Soldaini, L., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Fine-tuning and Inference." arXiv:2412.13663. +- Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N.A. (2020). "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." *Proceedings of ACL 2020*, pp. 8342-8360. +- Ponnock, J. (2025). "The Data Efficiency Frontier of Financial Foundation Models: Scaling Laws from Continued Pretraining." arXiv:2512.12384. +- Sounack, T., et al. (2025). "BioClinical ModernBERT: A Domain-Adapted Encoder for Biomedical and Clinical NLP." arXiv:2506.10896. +- Luo, Z., et al. (2025). "Patent ModernBERT: A Pretrained Language Model for Intellectual Property." arXiv:2509.14926. +- Dao, T. (2024). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." *Proceedings of ICLR 2024*. +- Ringel, D.M. (2023). "Creating Synthetic Experts with Generative Artificial Intelligence." arXiv:2310.15560. diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index e37bde7..39cc0ff 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -46,46 +46,27 @@ submissions.zip → scan for 8-K Item 1.05 → download HTML → extract → seg Every filing's HTML is different. The same logical content looks completely different depending on the tool that generated the HTML: -- **Word splitting from inline elements.** XBRL and styling tags break words mid-token: `Item 2` renders correctly in a browser but parses as "Item2" in code. Same with `cybersecurity`. Required detecting adjacent inline element boundaries and inserting spaces selectively. - +- **Word splitting from inline elements.** XBRL and styling tags break words mid-token: `Item 2` renders correctly in a browser but parses as "Item2" in code. Required detecting adjacent inline element boundaries and inserting spaces selectively. - **CamelCase joins from PDF converters.** PDF-to-HTML tools merge sentences across formatting boundaries: `sentence.Next sentence` instead of `sentence. Next sentence`. Required regex passes to detect missing spaces after punctuation. - -- **Page breaks mid-sentence.** Page numbers (`28`, `- 12 -`, `F-3`), running headers (`ACME CORP — ANNUAL REPORT`), and subsidiary headers (`ENTERGY ARKANSAS, LLC AND SUBSIDIARIES`) get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. - -- **Table of Contents shadowing.** "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Took several iterations to discover we needed the LAST match — this was a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it. - -- **XBRL tag pollution.** Inline XBRL wraps financial facts in `ix:header`, `ix:references`, and `ix:nonFraction` tags that carry no display content but add noise. Required stripping all `ix:*` tags before text processing. - +- **Page breaks mid-sentence.** Page numbers, running headers, and subsidiary headers get spliced into the middle of content paragraphs. Required filtering a catalog of page artifact patterns. +- **Table of Contents shadowing.** "Item 1C" appears at least twice in every 10-K — once in the Table of Contents and once in the actual content. Using the first match extracts the wrong section. Required the LAST match — a silent failure that produced empty or wrong extractions for hundreds of filings before we caught it. +- **XBRL tag pollution.** Inline XBRL wraps financial facts in `ix:header`, `ix:references`, and `ix:nonFraction` tags that carry no display content but add noise. - **Entity encoding chaos.** ` `, ` `, `“`, `”`, `—`, `–`, `•` — each needs correct decoding, and different filing tools use different entity styles for the same characters. ### Paragraph Segmentation After extracting clean section text, splitting into paragraphs had its own challenges: -- **Bullet list merging.** Disclosures frequently use bullet lists ("Our program includes: • risk assessment • vulnerability scanning"). Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. -- **Continuation line detection.** Sentences split across HTML block elements need rejoining. Heuristic: if the previous block lacks terminal punctuation and the next starts lowercase or with a continuation phrase (`and`, `or`, `including`, `such as`), merge. -- **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries to keep annotation units manageable. -- **Table-based bullet lists and the cascade failure.** Some generators (notably EFiling/XDX) render bullet lists as HTML tables with one `
` elements with spacers), but the information was lost during text extraction. The data quality audit found 2,210 paragraphs with embedded bullet points across the corpus — most from this class of failure. These paragraphs are still classifiable (the models unanimously labeled this example as Incident Disclosure / Specificity 4), but the text quality is degraded. +- **Bullet list merging.** Disclosures frequently use bullet lists. Bullets need to be merged with their intro sentence; a standalone "• vulnerability scanning" is meaningless. +- **Continuation line detection.** Sentences split across HTML block elements need rejoining. +- **Length boundaries.** Under 20 words → likely a header (filtered). Over 500 words → split at sentence boundaries. +- **Table-based bullet lists and the cascade failure.** Some generators render bullet lists as HTML tables with non-standard bullet characters. Since `stripHtml()` doesn't recognize `·` as a bullet marker, the merge logic never fires, causing multi-element run-on paragraphs. Found 2,210 paragraphs affected. ### 8-K Extraction **Roadblock: EDGAR full-text search misses filings.** The EFTS keyword search doesn't reliably return all cybersecurity 8-Ks. Post-May 2024, companies moved non-material disclosures from Item 1.05 to Items 8.01 or 7.01. -**Resolution:** Built `scan-8k-items.py` to scan the SEC's bulk `submissions.zip` deterministically — a gap-free scan of every 8-K with cybersecurity content. Tries items in priority order (1.05 → 8.01 → 7.01), skips cross-reference stubs. Result: **207 cybersecurity incident 8-K filings** identified — a complete inventory. - -### Paragraph Deduplication - -Each paragraph gets a `textHash` (SHA-256 of normalized text). Deduplication at three levels: - -1. **Within-filing:** Parser artifacts sometimes produce duplicate blocks. Removed by textHash. -2. **Cross-year (same company):** Companies copy-paste identical paragraphs year-to-year. Detected but kept — the repetition itself is informative for disclosure quality analysis. -3. **Cross-company boilerplate:** Different companies use identical materiality disclaimers. Detected but kept — these are real Specificity 1 examples. - -**Result:** Only ~27 excess duplicates removed (0.04%). Most textual similarity is legitimate variation. - -### Performance at Scale - -Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built `fast-reparse.ts` (regex-only HTML stripping, no DOM) and `parallel-reparse.ts` (16 bun workers in parallel). Also deduplicates amendment filings (keeps latest per CIK×FiscalYear). +**Resolution:** Built `scan-8k-items.py` to scan the SEC's bulk `submissions.zip` deterministically — a gap-free scan of every 8-K with cybersecurity content. Result: **207 cybersecurity incident 8-K filings** identified. ### Corpus Statistics @@ -94,597 +75,32 @@ Initial extraction with cheerio (DOM parser) was slow for 9,000 filings. Built ` - Median ~7 paragraphs per filing - 49,795 paragraphs annotated (after filtering to complete filing metadata) -### Roadblock: Truncated Filings - -Discovered 72 filings (~0.8%) where section boundary detection cut off mid-sentence. A paragraph about CISSP certifications cut mid-sentence looks like vague boilerplate — this would corrupt specificity labels. - -**Resolution:** Exclude from training splits. Filings where the last paragraph doesn't match `/[.!?;")\u201d]\s*$/` are filtered before train/val/test creation. - --- -## Phase 3: Codebook Development - -### Initial Codebook (v1.0) - -Built a detailed labeling codebook (`docs/LABELING-CODEBOOK.md`) grounded in the SEC rule structure. Includes: -- 7 category definitions with SEC basis citations, key markers, and example texts -- 4 specificity levels with boundary rules -- 5 category decision rules for common ambiguities -- 5 borderline cases with worked reasoning -- Gold set protocol for human validation - -### Codebook Iteration (v3.0 — 2026-03-29) - -After analyzing 150,000+ Stage 1 annotations and identifying systematic disagreement patterns, we made three major codebook rulings: - -**Ruling A — Materiality Disclaimers:** Paragraphs with explicit materiality assessments ("have not materially affected our business strategy, results of operations, or financial condition") are Strategy Integration, even if boilerplate. A cross-reference to Risk Factors appended to a materiality assessment does not change the classification. Only pure cross-references with no materiality conclusion are None/Other. *This resolved ~1,094 disputed paragraphs.* - -**Ruling B — SPACs and Shell Companies:** Companies explicitly stating they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. *This resolved ~53 unresolved paragraphs and likely hundreds more.* - -**Ruling C — Person vs. Function Test (Management Role vs. RMP):** This was the single most impactful ruling, addressing the #1 disagreement axis (2,290 disputes). The line: if the paragraph is about the *person* (qualifications, credentials, background, tenure, career history) → Management Role. If it's about what the role/program *does* (processes, activities, tools, frameworks) → Risk Management Process, even if a CISO/CIO/CTO title appears. The test: would the paragraph still make sense if you removed the person's name, title, and credentials? If yes → the paragraph is about the function, not the person. - ---- - -## Phase 4: Stage 1 — Synthetic Expert Annotation - -### Tech Stack Decision - -Chose TypeScript + Vercel AI SDK v6 + OpenRouter over Python + LangChain/LiteLLM because: -- Vercel AI SDK provides native structured output with Zod schema validation -- OpenRouter gives single-API access to all candidate models with real cost tracking -- Bun runtime for fast script execution with native TypeScript support -- JSONL-append pattern for crash-safe resume without data loss or duplicate API spend - -### Prompt Engineering (12+ iterations, v1.0 → v2.5) - -This was one of the most time-intensive phases. Key lessons: - -**What worked:** -- Text enum labels ("Firm-Specific") over ordinals ("3") — universal improvement across all models -- Decision-test format ("ask in order, stop at first yes") for specificity — reduced ambiguity -- ✓ IS / ✗ NOT fact lists with explicit examples — the single biggest lever for specificity accuracy. Reduced overrating from 54 to 21 cases. -- Validation step ("review your specific_facts, remove NOT-list items") — caught model self-correction -- 13 calibration examples, each targeting a specific observed failure mode — examples outperformed rules -- Explicit Incident↔Strategy tiebreaker — completely eliminated a 20-case confusion pattern -- `specific_facts` chain-of-thought in the schema — forces the model to enumerate evidence before assigning specificity - -**What didn't work:** -- Adding more rules (v1.2) — confused models, caused regression from 95%→88% category accuracy -- Changing category definitions to structural "TEST:" format (v2.6) — regression -- "COMMON MISTAKES" section (v2.7) — improved consensus but reduced unanimity -- Attempting a Management↔RMP tiebreaker in the prompt (v2.5) — made confusion worse (this was ultimately resolved through the v3.0 codebook ruling instead) - -**Critical lesson: 40-sample pilots were misleadingly optimistic.** Results that looked good at n=40 fell apart at n=500. We standardized on 500-sample pilots for all prompt evaluation. - -### The Iteration Trajectory - -Five 40-sample pilots (v1.0, v1.1, v1.2, v2.1, v2.2-n40) followed by six 500-sample pilots (v2.2-v2.7): - -| Version | n | Both Unan | Key Change | Top Confusion Axis | -|---------|---|-----------|-----------|-------------------| -| v2.2 | 500 | 51.4% | First 500-sample baseline | Incident↔Strategy (20 cases) | -| v2.3 | 500 | 59.2% | Tightened Sector-Adapted, expanded IS/NOT lists | Inc↔Strat reduced | -| v2.4 | 500 | 66.8% | Validation step, schema constraint on specific_facts | Mgmt↔RMP emerging | -| **v2.5** | **500** | **70.8%** | Incident↔Strategy tiebreaker, QV calibration examples | **Inc↔Strat eliminated**; Mgmt↔RMP now #1 (17 cases) | -| v2.6 | 500 | 67.8% | Changed defs to "TEST:" format — **regression** | — | -| v2.7 | 500 | 67.6% | Added COMMON MISTAKES section — **regression** | — | - -The most dramatic single improvement: v2.5's Incident↔Strategy tiebreaker ("DESCRIBES what happened → Incident; ONLY discusses cost/materiality → Strategy") completely eliminated what had been the #1 confusion axis at v2.2 (20 cases → 0). This is a case where a single well-targeted rule outperformed broad prompt restructuring. - -v2.5 was locked as the production prompt. v2.6 and v2.7 demonstrated that the prompt had reached its practical ceiling — further structural changes caused regressions. The remaining disagreements (Management↔RMP, specificity boundaries) turned out to be codebook ambiguities and model-capacity issues, not prompt failures. - -### The Original Panel and the Nano Problem - -The initial Stage 1 panel was: -- `google/gemini-3.1-flash-lite-preview` -- `openai/gpt-5.4-nano` -- `x-ai/grok-4.1-fast` - -GPT-5.4-nano was chosen for its low cost and the assumption that even a small model could handle structured classification with a good enough prompt. This assumption was wrong. - -**The problem: nano wasn't thinking.** During pilot testing, we discovered nano produced **zero reasoning tokens 64% of the time**. When it did reason, the output was minimal (avg 34,356 total reasoning tokens across 500 paragraphs, vs grok's 336,993). Without reasoning, nano's classifications were essentially pattern-matching on surface features — it couldn't apply the multi-step decision logic the codebook requires (enumerate facts, filter against IS/NOT lists, count QV-eligible items, apply threshold). - -**The symptoms:** -- **Erratic specificity** — nano was simultaneously too conservative on some axes ([1,3,3] disagreements — 21 cases where nano said Generic when gemini+grok said Firm-Specific) and too liberal on others ([3,3,4] — 11 cases where nano said Quantified when the others said Firm-Specific). No prompt change fixed this because it's a model-level capacity issue: without reasoning tokens, the decision test can't execute properly. -- **Lowest pairwise agreement** — gemini×grok agreed on 95.6% of categories and 91.2% of specificity. gemini×nano: 87.4% category, 83.8% specificity. Nano was the consistent outlier. -- **Dragging down unanimity** — the gemini+grok pair was strong, but nano's disagreements broke unanimity on hundreds of paragraphs that would otherwise have been clean. - -Despite 12 prompt iterations (v1.0→v2.7) that improved overall metrics significantly, nano's behavior never stabilized. The prompt was at its practical ceiling for a model that wouldn't reason. - -### Smoke Testing: model-probe.ts - -Before running an expensive benchmark, we built `model-probe.ts` to test 9 candidate models on a single paragraph for basic structured output compliance: -- gemini-3.1-flash-lite-preview, grok-4.1-fast, gpt-4.1-mini, gpt-4.1-nano, claude-haiku-4.5, gemini-3.1-flash-preview, deepseek-chat-v3-0324:free, llama-4-maverick, qwen3-235b-a22b - -This caught schema-level incompatibilities (wrong field names, missing fields, invalid enum values) before we spent money on 500-paragraph bench runs. - -### Model Benchmark: 6 Candidates to Replace Nano - -After locking prompt v2.5, we built `model-bench.ts` to formally evaluate nano replacements. Each candidate was benchmarked against the 500-sample pilot set and compared to the existing gemini+grok annotations. - -| Model | Cost/ann | Reasoning Tokens | vs Majority (both) | Cat Outlier | Spec Outlier | Nano→X Delta | -|-------|----------|-----------------|---------------------|-------------|-------------|-------------| -| seed-2.0-lite | $0.00227 | 658 | **88.8%** | 2.2% | 3.8% | +11.6pp | -| **mimo-v2-flash** | **$0.00048** | **1,346** | **86.0%** | **5.0%** | **4.0%** | **+8.8pp** | -| glm-4.5-air | $0.00136 | 854 | 76.2% | 8.8% | 9.6% | +0.8pp | -| minimax-m2.5 | $0.00106 | 590 | 73.8% | 7.9% | 12.7% | -1.0pp | -| mistral-small-2603 | $0.00015 | **0** | 66.8% | 9.2% | 17.6% | -6.8pp | -| nemotron-3-super-120b | $0.00152 | 942 | 57.9% | **21.3%** | **20.7%** | **-16.9pp** | - -**Key findings:** - -- **Reasoning tokens are the strongest predictor of accuracy.** Mistral-small produced literally zero reasoning tokens — not a single one. Its average output was only 136 tokens (vs mimo's 1,463). It had a 17.6% specificity outlier rate. This confirmed that the nano problem wasn't prompt-specific: models that don't reason can't do this task. - -- **Price ≠ quality.** Nemotron was the most expensive candidate at $0.00152/annotation with 942 reasoning tokens (it *was* thinking), but thinking badly — 21.3% category outlier rate, worst of any candidate. Only 497/500 completed (3 failures). Replacing nano with nemotron would have been catastrophic: -16.9pp unanimity. - -- **The two mediocre options.** GLM-4.5-air (+0.8pp) and minimax-m2.5 (-1.0pp) neither helped nor hurt. Not worth the switch. - -- **Seed-2.0-lite was technically the best** at 88.8% agreement with majority, but cost 4.7x more than mimo ($0.00227 vs $0.00048) and was 2x slower (21.5s vs 11.4s latency). For 50K+ paragraphs at scale, this cost differential was significant. - -### The Winner: mimo-v2-flash - -Mimo won the slot on value: -1. **Cheapest viable option** — $0.00048/annotation (3x cheaper than most candidates) -2. **Most reasoning tokens** — 1,346 avg (highest of all 6, more than seed-2.0-lite) -3. **Lowest outlier rate** — 5.0% category, 4.0% specificity -4. **+8.8pp unanimity improvement** over nano -5. **93.4% category agreement with grok** — strongest pairwise alignment of any candidate - -**Roadblock: Mimo schema quirks.** Mimo produced non-standard outputs: capitalized confidence labels ("High" instead of "high"), numeric confidence values (0.9 instead of "high"), and flat string arrays instead of structured `{fact, type}` objects for specific_facts. Rather than trying to fix this with prompting (which would waste tokens and might break other behavior), we fixed it with Zod schema transforms — `.transform()` to normalize casing and map numbers to labels, `.union()` to accept both structured and flat fact formats. This took ~30 minutes to implement and handled all edge cases automatically. - -A dedicated `mimo-pilot.ts` script modeled the full "replace nano with mimo" scenario before committing to the panel change. - -**Final Stage 1 panel:** -- `google/gemini-3.1-flash-lite-preview` -- `xiaomi/mimo-v2-flash` ← replaced `openai/gpt-5.4-nano` -- `x-ai/grok-4.1-fast` - -### Production Run Results - -Completed 2026-03-28. **150,009 annotations** (50,003 paragraphs × 3 models), **$115.88 total cost**, **0 failures**. - -| Metric | Value | -|--------|-------| -| Both-unanimous | 35,204 (70.7%) | -| Majority agreement | 14,182 (28.5%) | -| Unresolved (3-way split) | 409 (0.8%) | -| Total cost | $115.88 | -| Failures | 0 | - ---- - -## Phase 5: Post-Stage 1 Analysis — Discovering Systematic Patterns - -After the production run, we conducted a deep distributional analysis of disagreement patterns. This analysis fundamentally changed our approach to Stage 2. - -### Model Bias Discovery - -Each model has systematic, quantifiable biases: - -| Model | Category Outlier Rate | Specificity Outlier Rate | Key Bias | -|-------|----------------------|--------------------------|----------| -| Mimo | **48.1%** | 32.5% | Over-classifies as Third-Party Risk; under-rates Spec 4 (74.3% of Spec 4 outlier cases) | -| Gemini | 30.9% | **45.7%** | Over-classifies as Management Role (81.1% in Mgmt↔RMP disputes); inflates specificity | -| Grok | 21.0% | 21.8% | Most moderate; slight RMP bias | - -These biases are not random — they're predictable by model and confusion axis. This opened the possibility of model-calibrated majority voting (using the known biases to assess when the majority is likely correct). - -### Key Distributional Findings - -1. **Management Role is the disaster category** — only 51.5% unanimous (every other category is 62-79%). Nearly half of all Management Role paragraphs need resolution. -2. **Spec 4 (Quantified-Verifiable) is the disaster specificity** — only 37.6% unanimous. Models can't agree on what counts as "quantified." -3. **Stage 1 confidence is completely useless** — 95.4% of paragraphs report all-high category confidence. Zero all-low cases. The cheap models are systematically overconfident. -4. **Specificity is effectively a 3-level scale** — Spec 2 (Sector-Adapted) is rarely disputed (82.1% unanimous). The contested boundaries are [1,3] (3,742 disputes) and [3,4] (2,898 disputes) with almost nothing at [1,2] or [2,3]. -5. **Longer paragraphs are harder** — Q5 word count (>134 words): 64.1% unanimous vs Q1 (≤51 words): 76.3%. -6. **Small companies (1-3 paragraphs) are noise-prone** — 50.0% unanimous, 10.5% unresolved. Almost all are SPACs or shell companies with non-standard disclosures. - -### Top Disagreement Axes - -| Axis | Disputes | Pattern | -|------|----------|---------| -| Management Role ↔ RMP | 2,290 | Paragraph describes processes but names CISO/CIO | -| RMP ↔ Third-Party Risk | 1,475 | Mimo over-classifies vendor mentions as Third-Party | -| None/Other ↔ Strategy Integration | 1,094 | Materiality disclaimers — genuinely ambiguous in codebook | -| Board Governance ↔ Management Role | 867 | Paragraphs at the board-management interface | -| Spec [1,3] boundary | 3,742 | NOT-list items counted as specific facts | -| Spec [3,4] boundary | 2,898 | Gemini counts roles as QV-eligible; Mimo downgrades | - -### Insight: Reading the Actual Paragraphs - -We sampled 20 paragraphs across the 4 hardest dispute types and read them in full. Patterns emerged: - -- **Management↔RMP:** Every example follows the same structure — a process-focused paragraph that names a CISO/CIO in the opening attribution. The paragraph's content is about what the program does, not who the person is. The v3.0 "person-vs-function" ruling directly addresses this. -- **None/Other↔Strategy:** All 5 sampled paragraphs are "no material incidents" boilerplate. Every single one. The materiality disclaimer ruling resolves this entirely. -- **Spec [3,4]:** Gemini counts "20 years of experience" + "CISO" as 2 QV facts → Spec 4. Grok/Mimo correctly exclude named roles from QV counting → Spec 3. The rule exists in the prompt but Gemini ignores it. -- **Small company unresolved:** All SPACs or blank check companies with "we have no operations" disclaimers. The SPAC ruling handles these. - ---- - -## Phase 6: Stage 2 — Judge Model Evaluation - -### Gold Label Construction - -Built a 50-paragraph gold set using 3 independent Sonnet agents: -- Agent A: paragraphs 0-24 -- Agent B: paragraphs 25-49 -- Agent C: all 50 as cross-check -- Adjudicator agent resolved 11 disputes with detailed reasoning -- Inter-annotator agreement: 94% category, 84% specificity, 78% both - -**Lesson learned: majority vote ≠ ground truth.** Initially scored judges against Stage 1 majority, which made gemini-3-flash look great (86% category match). Scoring against gold labels revealed it added zero value — it was rubber-stamping the majority. Always evaluate against adjudicated gold labels. - -### Judge Model Benchmarking (8 candidates) - -| Model | Mode | n | Cat | Spec | Both | Fails | Cost/call | -|-------|------|---|-----|------|------|-------|-----------| -| Majority vote | — | 50 | 78.0% | 80.0% | 60.0% | 0% | $0 | -| gpt-5.4-mini | structured | 50 | 88.0% | 80.0% | 68.0% | 0% | $0.0046 | -| GLM-5 v2 | structured | 48 | 87.5% | 89.6% | 77.1% | 4% | $0.0078 | -| GLM-5 v4 | structured+req_params | 44 | 90.9% | 88.6% | 79.5% | 12% | $0.0083 | -| GLM-5 v3 | tool calling | 50 | 84.0% | 82.0% | 72.0% | 0% | $0.0070 | - -### Roadblock: GLM-5 Structured Output Failures - -GLM-5 had the best accuracy (77-80% both-correct) but a 6-12% structured output failure rate. The model intermittently wraps JSON in markdown code blocks. - -**Investigation:** Built diagnostic scripts (`judge-diag.ts`, `judge-diag-batch.ts`) to isolate the issue. Tested all 9 failing paragraphs × 2 attempts each. Found 72% success rate, all from the same model variant (`z-ai/glm-5-20260211`). The best OpenRouter provider (Ambient) has a 6% base error rate. This is a model-level behavior, not provider-specific. - -**Attempted fixes:** -- Bumped validation retries from 1 to 3 → reduced failures from 18% to ~4-12% -- Tool calling mode → 0% failures but accuracy dropped ~7pp (72% both). Enum constraints not enforced, `undefined` categories appear. -- `provider: { require_parameters: true }` in OpenRouter → no effect -- Exacto routing → no effect - -**Resolution:** Accepted as a model-level constraint. Production strategy will use the best model with retry logic and fall back to a reliable model (gpt-5.4-mini) for persistent failures. - -### Judge Prompt Iteration (v1 → v2) - -Built a dynamic judge prompt (`buildJudgePrompt()`) with: -- **Disagreement diagnosis:** Tells the judge exactly what's in dispute and the vote distribution -- **Targeted disambiguation rules:** 7 category guidance blocks + 2 specificity guidance blocks, dynamically included only when relevant to the specific dispute -- **Structured analysis steps:** Critique each annotator → enumerate IS-list facts → determine dominant purpose → decide -- **Confidence calibration:** HIGH/MEDIUM/LOW mapped to codebook clarity, used as training weights -- **Anti-bias:** Fisher-Yates shuffle of annotator order - -**Results:** Category accuracy improved +10pp over majority vote for both models. Specificity improved +9.8pp for GLM-5 but stayed flat for gpt-5.4-mini. The disambiguation rules work well for category but specificity needs the codebook v3.0 changes. - -### Key Finding: Judge Confidence Is Highly Predictive - -| Confidence | GLM-5 Both-Correct | gpt-5.4-mini Both-Correct | -|------------|--------------------|----| -| High | 82-84% | 80.6% | -| Medium | 25-50% | 35.7% | - -This enables confidence-stratified training data: high-confidence judge labels get full weight; medium/low are downweighted or excluded. - ---- - -## Phase 7: Revised Data Quality Strategy - -The post-Stage 1 analysis and judge benchmarking led to a fundamental reassessment of our approach. - -### The Key Realization - -The best judge (77% both-correct) barely beats the raw majority vote (78% category, 80% specificity). Judging all 14,591 disputed paragraphs at 77% accuracy doesn't meaningfully improve on the majority. The judge's real value is concentrated in two places: -1. The 409 unresolved paragraphs where no majority exists -2. Cases where we have specific reason to doubt the majority - -### The Revised Plan - -**Phase 0: Codebook rulings (completed)** — Three rulings that resolve thousands of disputes at zero inference cost: materiality disclaimers → Strategy Integration, SPACs → None/Other, person-vs-function test for Management↔RMP. - -**Phase 1: Model-calibrated majority resolution** — For the 14,182 majority-agreement paragraphs, apply calibration using known model biases. When the known-biased model is the outlier on a known axis → trust majority. Flag anomalous cases for judge resolution. Expected to auto-resolve ~10,000-12,000 paragraphs. - -**Phase 2: Human gold set (1,200 paragraphs)** — Assignment requires 1,200 human-labeled paragraphs. Building a quiz-gated labeling web tool that enforces codebook knowledge before each session. Stratified sampling to ensure all categories, specificity levels, and confusion axes are represented. This becomes the calibration metric for all further work. - -**Phase 3: Judge prompt iteration** — Update judge prompt to mirror codebook v3.0 rulings. Add worked examples from the 11 gold adjudications. Iterate against expanded gold set. Target: 85%+ both-correct. - -**Phase 4: Production judge run** — Judge only the ~3,000-5,000 genuinely hard cases (unresolved + flagged majority + "both" disputes). Two models for cross-validation on the hardest cases. - -**Phase 5: Training data assembly** — Confidence-stratified tiers: - -| Tier | Source | Est. Accuracy | Paragraphs | Treatment | -|------|--------|--------------|------------|-----------| -| T1 | Both-unanimous | ~97% | 35,204 | Full weight | -| T2 | Calibrated majority | ~85-90% | ~9,000-12,000 | Full weight | -| T3 | Judge high-confidence | ~84% | ~2,000-3,000 | Full weight | -| T4 | Judge medium-confidence | ~40% | ~500-1,000 | Downweight (0.5) or soft labels | -| T5 | Judge low / failure / excluded | ??? | ~500-1,000 | Exclude | - -Expected total: ~46,000-48,000 paragraphs at ~93-95% label accuracy. - ---- - -## Phase 8: Human Labeling Webapp (Labelapp) - -### Why Build a Webapp? - -The project requires 1,200 human-labeled paragraphs as a gold holdout set — the calibration metric for everything downstream. Six student annotators, three per paragraph, 600 per person. The labels need to be reliable enough to benchmark the GenAI pipeline and validate the final classifier. - -The alternative was everyone tagging in a shared JSON file or spreadsheet. That would almost certainly produce poor data quality. The failure modes are well-documented in annotation literature and we'd hit all of them: - -- **Inconsistent category names.** Free-text entry in a spreadsheet means "Risk Management Process" vs "Risk Mgmt" vs "RMP" vs "3" — all referring to the same class but requiring manual reconciliation. -- **Skipped or double-labeled paragraphs.** No enforced assignment tracking means annotators can accidentally skip paragraphs or label the same one twice without anyone noticing until export. -- **No codebook enforcement.** The labeling codebook has 7 categories, 4 specificity levels, 5 decision rules, and 3 codebook rulings (v3.0). Without quiz gating, annotators can start labeling without understanding the materiality disclaimer ruling, the person-vs-function test, or the QV counting threshold — exactly the boundaries where annotation quality lives or dies. -- **No feedback loop.** In a spreadsheet, an annotator who misunderstands the SPAC ruling labels 600 paragraphs before anyone catches it. A webapp with warmup feedback catches misunderstanding in the first 5 paragraphs. -- **No timing data.** For the writeup, we need per-paragraph labeling times to report annotator effort and identify paragraphs that are disproportionately hard. A spreadsheet gives you nothing; even a basic timer gives you wall-clock time corrupted by idle periods. - -A purpose-built labeling tool turns all of these failure modes into solved problems. Constrained radio buttons eliminate typos. Server-side assignment tracking prevents skips and duplicates. Quiz gating enforces codebook knowledge. Warmup paragraphs with gold feedback catch misunderstandings early. Active timing with idle detection gives clean data for the writeup. - -### The Onboarding Funnel - -Every annotation session follows the same enforced path: - -1. **Login** → annotator selects their name, enters password. Session cookie (HMAC-SHA256 signed, 8-hour expiry). -2. **Dashboard** → shows progress, links to training materials or labeling. -3. **Quiz** → 8 questions (2 per type), random draw from a bank of ~30. Four question types target the exact codebook boundaries that cause the most disagreement in the GenAI pipeline: - - **Person-vs-function** (Management Role vs RMP) — the #1 disagreement axis (2,290 disputes in Stage 1) - - **Materiality disclaimers** (Strategy Integration vs None/Other) — resolved ~1,094 disputes via codebook ruling - - **QV fact counting** (Specificity 3 vs 4) — the hardest specificity boundary - - **SPAC exception** (None/Other for shell companies) - - Pass threshold: 7/8 correct. Immediate feedback with codebook explanation after each answer. Failed → review mistakes → retry. -4. **Warmup** → 5 pre-selected paragraphs with known gold labels. Identical UI to real labeling, but after submit, the annotator sees the gold answer + explanation. This catches systematic misunderstandings before they contaminate 600 labels. -5. **Labeling** → the real thing. 600 assigned paragraphs per annotator. - -The quiz questions are not random trivia — they're targeted at the exact confusion axes that the GenAI pipeline struggles with. If an annotator can't reliably distinguish Management Role from RMP, their labels on that axis are noise. Better to catch that before they start than after. - -### Labeling Interface Design - -The labeling UI prioritizes speed and consistency: - -- **Paragraph display:** Full text with filing metadata badges (company, ticker, filing type, date, SEC item) in the header bar. -- **Constrained input:** Radio buttons for both category (7 options) and specificity (4 options). No free-text entry for classifications. -- **Keyboard shortcuts:** 1-7 for category, Q/W/E/R for specificity, N to focus notes, Enter to submit. An experienced annotator never touches the mouse. -- **Codebook sidebar:** Floating button opens a slide-out panel with all category definitions, IS/NOT lists, specificity levels, and decision rules. Always one click away — annotators don't need to switch to a separate document. -- **Progress bar:** Shows completed/total in the header. Annotators know where they stand. -- **Notes field:** Optional free-text for edge cases or uncertainty. Useful for adjudication — if an annotator flags "this could be either Management Role or RMP, went with RMP because the person-vs-function test says..." that reasoning helps the adjudicator. - -### Sampling Strategy - -The 1,200 paragraphs are not randomly sampled. Random sampling from 50K paragraphs would over-represent the easy cases (Board Governance at Specificity 1 is unambiguous) and under-represent the hard cases that actually test annotation quality. - -Instead, the sampling is stratified by the disagreement patterns discovered in the Stage 1 analysis (Phase 5): - -| Stratum | Count | Why | -|---------|-------|-----| -| Management ↔ RMP split votes | 120 | #1 disagreement axis — validates the person-vs-function ruling | -| None/Other ↔ Strategy splits | 80 | Materiality disclaimer boundary | -| Specificity [3,4] splits | 80 | QV counting — the hardest specificity boundary | -| Board ↔ Management splits | 80 | Board/management interface | -| Rare category guarantee | 120 | ≥15 per category, extra for Incident Disclosure (sparse) | -| Proportional stratified random | 720 | Fill remaining from category × specificity cells | - -This ensures the gold set is informative where it matters most: at the decision boundaries where both humans and models are most likely to disagree. - -### Assignment: Balanced Incomplete Block Design (BIBD) - -Each paragraph gets exactly 3 of 6 annotators. The assignment uses a balanced incomplete block design: - -- C(6,3) = 20 unique triples. Assign 60 paragraphs to each triple. -- Each annotator appears in C(5,2) = 10 triples → 10 × 60 = 600 paragraphs per person. -- Every annotator pair shares equal paragraph overlap → pairwise Cohen's Kappa is statistically valid across all 15 pairs. - -This is important for the writeup: we can report inter-rater reliability as a full pairwise matrix, not just an average that hides weak pairs. - -### Active Timer and Idle Detection - -The initial implementation tracked raw wall-clock `duration_ms` per label — `Date.now()` when the paragraph loaded, minus `Date.now()` at submit. This is corrupted by any idle time (annotator walks away, checks email, gets coffee). - -We added `useActiveTimer`, a React hook that tracks active vs idle time using mouse/keyboard/scroll/focus events with a 30-second idle threshold. When no activity is detected for 30 seconds, the timer pauses and the header shows an amber "idle" indicator. Both `duration_ms` (wall-clock) and `active_ms` (idle-excluded) are submitted with every label. - -For the writeup, `active_ms` is the metric to report — it reflects actual cognitive effort per paragraph. `duration_ms` is retained for completeness. Pre-existing labels (before the timer change) have `active_ms = NULL` and are excluded from timing analysis. - -### Infrastructure Decisions - -**Stack:** Next.js (App Router) + Drizzle ORM + Postgres + Tailwind + shadcn/ui. Deployed via Docker with a Postgres sidecar. - -**Migrations:** Switched from `drizzle-kit push --force` (schema diffing at startup) to file-based Drizzle migrations (`drizzle-kit generate` + `drizzle-kit migrate`). A `scripts/ensure-migration-baseline.ts` script handles the transition for existing databases by seeding the migration journal with the baseline hash. - -**Monorepo:** The labelapp triggered converting the repo to a Bun workspace monorepo with shared Zod schemas (`packages/schemas/`). This ensures the labelapp's category/specificity enums are identical to the GenAI pipeline's — no possibility of a mismatch between what the models label and what the humans label. - -### Adjudication - -After all 3 annotators label a paragraph: -- **3/3 agree** on both dimensions → consensus (no intervention needed) -- **2/3 agree** on both dimensions → majority rules -- **Otherwise** → flagged for admin adjudication - -The admin page shows disputed paragraphs with all 3 labels side-by-side, annotator notes, and Stage 1 consensus for reference. The adjudicator picks a label, enters a custom one, or marks it for team discussion. Adjudications are stored separately from labels for audit trail. - -### Key Technical Artifacts - -| Artifact | Location | -|----------|----------| -| Implementation plan | `docs/labelapp-plan.md` | -| Agent guide | `labelapp/AGENTS.md` | -| Database schema | `labelapp/db/schema.ts` | -| Active timer hook | `labelapp/hooks/use-active-timer.ts` | -| Labeling UI | `labelapp/app/label/page.tsx` | -| Quiz questions | `labelapp/lib/quiz-questions.ts` | -| Warmup paragraphs | `labelapp/lib/warmup-paragraphs.ts` | -| BIBD assignment generator | `labelapp/lib/assignment.ts` | -| IRR metrics (Kappa, Alpha) | `labelapp/lib/metrics.ts` | -| Stratified sampling | `labelapp/lib/sampling.ts` | -| Baseline migration | `labelapp/drizzle/0000_baseline.sql` | -| Migration transition script | `labelapp/scripts/ensure-migration-baseline.ts` | -| Docker entrypoint | `labelapp/entrypoint.sh` | - -### Opus Golden Labeling - -With the human gold set nearing completion, we added a parallel labeling pass using Claude Opus 4.6 as an additional expert annotator. The motivation is empirical: the GenAI pipeline's Stage 1 consensus + Stage 2 judge combination has shown strong alignment with the codebook throughout development, and Opus represents a significant capability jump over the models used in Stages 1 and 2. Having an independent Opus annotation for every gold-set paragraph gives us a third perspective alongside the human labels and the existing pipeline labels — useful for adjudication, for measuring human-vs-model agreement, and as an upper bound on what automated annotation can achieve. - -**Implementation:** Rather than routing through OpenRouter (which would cost ~$27-80 depending on the model), we used the Claude Agent SDK (`@anthropic-ai/claude-agent-sdk`) to call Opus 4.6 through the existing Claude Code subscription. The Agent SDK's `query()` function accepts a custom system prompt and structured output schema, so we configured it as a fully isolated classifier: no tools, no hooks, no settings, no session persistence — just a system prompt and a JSON schema response. - -**Key design decisions:** - -1. **Full codebook as system prompt.** The Stage 1/2 pipeline uses a condensed v2.5 operational prompt (~4KB). For Opus, we feed the entire labeling codebook (`docs/LABELING-CODEBOOK.md`, ~42KB) plus the operational prompt plus the JSON output schema. Opus has the context window and reasoning depth to actually use the worked examples, borderline cases, and decision rules that cheaper models would ignore. - -2. **Reasoning traces saved.** Opus's adaptive thinking produces step-by-step codebook application (e.g., "Count QV-eligible facts: specific date (2020), 24 years (quantified)... two hard verifiable facts → Quantified-Verifiable"). These are saved in the `golden.thinking` field alongside each annotation — valuable both for adjudication and for understanding where the codebook's boundaries create ambiguity. - -3. **Raw confidence preserved.** Opus returns numeric confidence (0-1) rather than the categorical high/medium/low that cheaper models produce. We save the raw values (`golden.rawCategoryConfidence`, `golden.rawSpecificityConfidence`) before coercing them through the existing `Confidence` transform. This gives a finer-grained signal for weighting or analysis. - -4. **Serial execution at 1 req/s.** The Claude Code subscription has rate limits, so the batch runs serially with a 1-second delay between requests. At ~4 paragraphs/minute (including Opus thinking time), the full 1,200-paragraph set completes in ~5 hours. Crash-safe JSONL checkpoint resume means it can be interrupted and restarted without re-running completed paragraphs. - -**Output:** `data/annotations/golden/opus.jsonl` — standard `Annotation` records (compatible with the existing pipeline) plus a `golden` block containing thinking traces, raw confidence values, and the model's specific fact extractions. The `provenance.promptVersion` is tagged `v2.5+codebook` to distinguish from standard Stage 1/2 annotations. - ---- - -## Phase 9: Pre-Training Strategy — DAPT + TAPT - -### The Decision: Own Filings Over PleIAs/SEC - -For domain-adaptive pre-training (DAPT), we needed a corpus of clean SEC filing text. Two options: - -1. **PleIAs/SEC** (373K full 10-K texts on HuggingFace, going back years, CC0 license) — massive but uncleaned, and a single training pass on ~18B tokens would take weeks on a single RTX 3090. -2. **Our own ~9,000 cached filings** (FY2023-2024, HTML already downloaded during extraction) — smaller but recent, relevant, and we already have the HTML cleaning pipeline. - -We chose option 2. The reasoning: - -- **Recency > volume.** Item 1C didn't exist before FY2023. The cybersecurity disclosure vocabulary, boilerplate patterns, and regulatory framing are all new to this filing cycle. Pre-2023 filings teach the model general SEC language, which ModernBERT already knows from its general pre-training. The marginal value of historical filings is low for our specific task. -- **The scaling laws paper says stop early.** SEC filing scaling laws (arXiv:2512.12384) show the largest DAPT gains in the first 200M tokens, with diminishing returns after. Our 9,000 full filings yield ~450M tokens — already in the sweet spot. -- **We control the cleaning quality.** Our `stripHtml()` pipeline handles all the HTML artifacts we fought during extraction (XBRL tags, entity encoding, page breaks, inline element word splits). PleIAs/SEC is a black box — we'd need to audit it anyway. -- **Feasibility on a 3090.** 450M tokens: ~2-3 days. 18B tokens: weeks. Single GPU means we need to be strategic about compute allocation. - -The DAPT corpus preparation is simple: run the existing `stripHtml()` on cached filing HTML (full text, skipping the Item 1C section extraction step) and output clean text as sharded JSONL. - -### Adding TAPT: "Don't Stop Pretraining" - -Gururangan et al. (2020) "Don't Stop Pretraining" demonstrated that task-adaptive pre-training (TAPT) — continued MLM on the unlabeled task data specifically — gives consistent gains on top of DAPT, especially when the task distribution differs from the broader domain. - -Item 1C is a very specific subset of SEC filings. It has its own vocabulary (CISO, NIST CSF, tabletop exercises, materiality assessments), structure (governance → management → process → strategy is a common paragraph sequence), and boilerplate patterns that differ substantially from the rest of a 10-K. TAPT teaches the model this specific distribution before we ask it to classify. - -The cost is negligible: our 72K paragraphs from `paragraphs-clean.jsonl` are already clean text (~5-10M tokens). TAPT takes 2-3 hours on a 3090 — essentially free compared to DAPT. - -### The Training Pipeline - -``` -ModernBERT-large (base, 395M params) - → DAPT on 9K full 10-K filings (~450M tokens, ~2-3 days) → SEC-ModernBERT-large - → TAPT on 72K Item 1C paragraphs (~10M tokens, ~2-3 hours) → SEC-cyBERT-large - → Fine-tune on labeled data with dual classification heads → Final classifier -``` - -This gives us clean ablation rows: base → +DAPT → +TAPT → +SCL, isolating the contribution of each step. - ---- - -## Phase 10: Data Quality Audit and Corpus Remediation +## Phase 3: Data Quality Audit and Corpus Remediation ### The Discovery -While preparing the DAPT corpus, we discovered that the paragraph data was less clean than we assumed. The extraction pipeline had been built to handle the worst HTML artifacts (word splits, XBRL tags, page breaks), but two systematic issues had been silently corrupting the training data: +While preparing the DAPT corpus, we discovered two systematic issues silently corrupting the data: -1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. `stripHtml()` preserved that newline, and the paragraph segmenter dropped the single-word fragment. Result: paragraphs like "sole executive officer and director is responsible for..." instead of "Our sole executive officer..." — 4.7% of all paragraphs. +1. **Orphan words.** HTML source wraps text at fixed column width. When a `` tag consumes most of a line, only the first word fits before the source newline. 4.7% of all paragraphs affected. +2. **Inlined section headings.** 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. -2. **Inlined section headings.** The paragraph segmenter didn't strip sub-section headings ("Risk Management and Strategy", "Board Oversight") from paragraph body text. These headings became the first "sentence" of the paragraph. Result: 22% of paragraphs had section titles prepended to body text — a near-perfect predictor of `content_category` that creates shortcut learning risk. +### Generator Investigation -### The Generator Investigation - -Initial quality metrics showed 45% of filings in an "UNKNOWN" generator bucket. This felt wrong — SEC HTML comes from identifiable tools. We investigated and identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files using meta tags, comments, namespace declarations, CSS patterns, and CIK-based filing agent lookup. - -The investigation revealed that the worst-quality generator, **EFiling/EDGAR Agent (GoFiler/Novaworks XDX)**, had been hidden in the UNKNOWN bucket. It accounts for 13.5% of all filings but produces 36.8% orphan word rate (8x corpus average), the lowest paragraphs-per-filing (5.7 vs 7.7 avg), and 5.9% fragment rate. The second worst, **CompSci Transform** (6% of filings), had a 14.8% orphan word rate. - -By contrast, the clean generators — Workiva (24.3%), Donnelley (15.8%), and Inline XBRL (16.4%) — all had <1% orphan word rates. Over 70% of paragraphs came from clean generators. The problem was concentrated, not uniform. - -Full generator reference: `docs/EDGAR-FILING-GENERATORS.md`. Full audit findings: `docs/DATA-QUALITY-AUDIT.md`. +Identified **14 distinct filing generators** covering 99.99% of 14,759 HTML files. The worst generator (EFiling/EDGAR Agent) accounted for 13.5% of filings but 36.8% orphan word rate (8x corpus average). Clean generators (Workiva, Donnelley, Inline XBRL) all had <1% rates. Full reference: `docs/EDGAR-FILING-GENERATORS.md`. ### Six Surgical Patches -All fixes follow the same principle: `paragraphs-clean.jsonl` is **frozen** — never modified. All fixes go through separate `.patched.jsonl` files. Annotations link by paragraph UUID, which never changes. Every patch is documented with scope, method, and validation. +All fixes follow the principle: `paragraphs-clean.jsonl` is **frozen**. All fixes go through `.patched.jsonl` files linked by paragraph UUID. -| Patch | Method | Paragraphs | Annotated | -|-------|--------|-----------|-----------| -| 1-2. Orphan word restoration | HTML lookback: find paragraph text in stripped HTML, extract preceding word | 2,233 | 1,537 | -| 3. Heading strip (space separator) | Pattern match against 71 known Item 1C sub-headings | 7,514 | 5,013 | -| 4. Heading strip (colon separator) | "Heading Text: Sentence..." patterns | 370 | 227 | -| 5. Heading strip (period/dash/caps) | Extended separator detection | 184 | 133 | -| 6. HTML-confirmed headings | Bold/underline/h-tag extraction from source HTML, validated against paragraph starts | 343 | 270 | -| **Total** | | **8,411 headings + 2,233 orphans** | **~7,100 of 49,795 (14.3%)** | - -The heading detection required five progressive passes because no single heuristic caught all separator styles. The HTML-confirmed pass (Patch 6) used a 32-worker parallel extraction script to scan 6,341 filings in 1.7 seconds, caching styled headings per filing for reuse. - -### Orphan Word Re-Annotation - -The orphan word patches weren't just cosmetic. Analysis revealed **label bias** in orphan-word paragraphs: -- Strategy Integration 1.55x over-represented (16.1% vs 10.4% baseline) -- Management Role 0.49x under-represented -- Board Governance 0.60x under-represented - -Missing subject words like "Our", "We", "The" strip governance context that models rely on for classification. This suggested the original annotations on these paragraphs might be systematically wrong. - -**Decision: re-run Stage 1 on patched text.** Cost: $3.30 for 4,611 annotations (1,537 paragraphs × 3 models), completed in ~9 minutes at 60 concurrency with zero failures. - -**Results:** -- **119 paragraphs (7.7%)** changed consensus category — confirming the bias was real -- **37 paragraphs (2.4%)** changed consensus specificity -- **152 total (9.9%)** changed on at least one dimension -- mimo-v2-flash was most sensitive (14.6% category changes); gemini least affected (6.0%) -- 18 original conflicts resolved, 22 new conflicts introduced — roughly a wash on Stage 2 savings -- Top transitions: Management Role ↔ Risk Management Process (55/51 each direction), Strategy Integration → None/Other (46), Third-Party Risk → Risk Management Process (34) - -The re-run annotations are stored separately in `data/annotations/stage1-orphan-rerun.jsonl` — the original `stage1.jsonl` is untouched. For training, the re-run annotations replace the originals for the affected 1,537 paragraphs. - -### No-Cyber-Keyword Paragraphs: A False Alarm - -The quality audit flagged 528 paragraphs (348 annotated) with no cybersecurity keywords at all — suspicious for Item 1C content. Initial expectation: these are section bleed from adjacent filing sections, probably labeled None/Other. - -**Actual finding:** 65.2% (227 paragraphs) were labeled as real categories — mostly Risk Management Process (44.8%) and Management Role (10.6%). And the labels were **correct.** The paragraphs discuss security topics using synonymous terms: "risk assessment", "access to systems", "theft of intellectual property", "safeguards", "internal notifications" — all legitimate cybersecurity content that doesn't use the literal word "cybersecurity." The keyword filter was too narrow, not the paragraphs. All 348 are kept. - -### Heading-Stripped Paragraphs: Labels Still Valid - -For the ~5,643 annotated paragraphs where headings were stripped, existing labels are retained without re-annotation. The heading was a shortcut learning signal (near-perfect predictor of category), but annotators classified the body text, not the heading. Stripping the heading from training data removes a leaky feature without invalidating the label. - -### Embedded Bullet Lists: The Cascade Failure - -A spot-check of a Bancorp 34, Inc. paragraph revealed a class of structural corruption we hadn't detected. The paragraph read as a 114-word run-on: - -> establishing and maintaining a comprehensive program to oversee and manager external connections and third-party relationships with access to the institution's technology assets maintaining an incident response program intended to enable us to mitigate the impact of, and recover from, any cyberattacks, and facilitate communication to internal and external experienced a single cybersecurity event in June of 2023... - -The source HTML (filed via EFiling/XDX) had three clearly separate elements: two `
` disclosing a $25,000 cybersecurity incident. The HTML structure was unambiguous — separate table rows with spacers between them. - -**Root cause: a three-part cascade failure in the extraction pipeline.** - -1. **Bullet character not recognized.** The HTML used `·` (middle dot in Symbol font) instead of `•` (standard bullet). `stripHtml()` doesn't decode it, so the bullet-aware merge logic in the segmenter never fires. -2. **Lowercase continuation merge.** Each bullet starts lowercase ("establishing...", "maintaining..."), so the segmenter treats them as continuation fragments of the previous block. -3. **Short-block append.** Individual bullets fall below the 20-word minimum, so they get appended to the previous paragraph. - -The result: two process-description bullet items and an incident disclosure fused into one incoherent paragraph. Despite this, all 3 Stage 1 models unanimously labeled it Incident Disclosure / Specificity 4 — the $25K incident detail dominated the merged text. - -We identified two classes of this failure: - -1. **Semicolon-separated merges (1,941 paragraphs):** The semicolons from the original list survived, but the bullet characters were stripped. Detectable by heuristic (3+ semicolons, lowercase after each, no bullet markers). -2. **Invisible merges (222 paragraphs):** Even the semicolons were stripped, leaving text that simply runs together with no trace of the original list structure. The Bancorp 34 example falls in this category — "to internal and external experienced a single cybersecurity event" is an impossible English sentence that a regex cannot distinguish from legitimate prose. These were detected by a secondary heuristic (lowercase-start, not orphan-patched, 60+ words), but this is an undercount — some invisible merges start with uppercase text. - -All 2,163 were reclassified to "degraded" tier. These aren't worth patching — splitting merged bullets requires per-paragraph HTML structure analysis and re-annotation of every resulting fragment. Instead, they'll be downweighted (0.5x) during fine-tuning to reduce overfitting to degraded text patterns while preserving their content signal. - -### Sample Weighting for Fine-Tuning - -The quality tier system maps directly to training sample weights: - -| Tier | Weight | Rationale | -|------|--------|-----------| -| clean | 1.0 | No issues | -| headed | 1.0 | Heading removed, body text intact | -| minor | 1.0 | Orphan word restored | -| degraded | 0.5 | Labels likely correct, but text structure doesn't match clean inference-time inputs | - -This is implemented via a `sample_weight` column in the training dataset. The HuggingFace Trainer supports per-sample loss weighting — each sample's cross-entropy loss is multiplied by its tier weight before backpropagation. Degraded paragraphs still contribute to learning, but their influence is halved relative to clean data. - -### Data Integrity Framework - -The audit produced a formal data integrity framework: - -1. `paragraphs-clean.jsonl` is frozen — the reproducibility anchor -2. All fixes go through `.patched.jsonl` — same schema, same IDs, updated text and hash -3. Annotations link by UUID — stable across patches -4. Never re-run extraction from HTML — cascade effects from merge logic cause thousands of ripple-effect changes -5. Every patch is documented with scope, method, validation, and annotation impact -6. Quality metadata is separate from text data — per-paragraph quality scores in a separate file +| Patch | Method | Paragraphs | +|-------|--------|-----------| +| 1-2. Orphan word restoration | HTML lookback extraction | 2,233 | +| 3-6. Heading strip (4 passes) | Pattern match + HTML-confirmed | 8,411 | ### Quality Tier System -Each paragraph gets a quality tier based on detected issues: - | Tier | Criteria | Count | % | |------|----------|-------|---| | clean | No detected issues | 58,165 | 80.7% | @@ -692,592 +108,140 @@ Each paragraph gets a quality tier based on detected issues: | degraded | Embedded bullets, invisible merges, fragments, truncations | 4,331 | 6.0% | | minor | Had orphan word (now fixed) | 2,147 | 3.0% | -All "headed" and "minor" paragraphs have been patched — the tier records what *was* wrong for traceability. "Degraded" paragraphs are downweighted (0.5x) during fine-tuning. +Degraded paragraphs downweighted 0.5x during fine-tuning. --- -## Phase 11: DAPT Corpus Preparation +## Phase 4: Pre-Training — DAPT + TAPT -### Corpus Cleaning +### DAPT: Domain-Adaptive Pre-Training -The DAPT corpus is built from 14,759 cached 10-K HTML filings processed through `stripHtml()` + `cleanForDapt()`. Three rounds of cleaning were required: +Chose our own ~9,000 cached filings over PleIAs/SEC (373K on HuggingFace): +- Recency > volume — Item 1C didn't exist before FY2023 +- Diminishing returns past 250M tokens (Ponnock 2025) +- We control cleaning quality +- Feasible on a single RTX 3090 -**Round 1** revealed XBRL data blobs (8.7% of docs, up to 33% of document text), page number artifacts, and exhibit listing boilerplate. Added targeted stripping for `iso4217:`, `xbrli:`, CIK-number sequences, and `F-N` page markers. +**Corpus:** 14,568 docs, ~1.056B tokens. Subsampled to newest 500M tokens. -**Round 2** removed URLs (39% of docs → 0.3%) and XBRL exhibit listing lines ("Inline XBRL Taxonomy Extension Calculation Linkbase Document" — present in 85% of filings). Initial investigation claimed these were "legitimate prose mentions of XBRL." Spot-checking showed every single remaining match was exhibit index boilerplate. Stripped any line containing "XBRL" unless it also contained cybersecurity/risk/governance terms. +**Key optimizations:** Flash Attention 2 (47s→27s/step), torch.compile (halved activation memory), corpus subsampling (29h→13.5h). -**Round 3** was a verification pass confirming the remaining 7.4% of docs with "XBRL" traces are legitimate prose co-occurrences with security terms. +**Results:** Eval loss 0.7250, perplexity 1.65. 1 epoch, ~14.5h on RTX 3090. Checkpoint: `checkpoints/dapt/modernbert-large/final/`. -The page number regex initially had a branch matching `[- ]\d{1,3}[- ]` that produced 100% false positives — it was matching negative financial figures (`-1%`) in sensitivity analysis tables. Only the `F-\d+` pattern was genuine. The false-positive branch was removed. +### TAPT: Task-Adaptive Pre-Training -### Corpus Statistics (Final) +72K Item 1C paragraphs (~10M tokens). 5 epochs with whole-word masking at seq_len=512. -| Metric | Value | -|--------|-------| -| Full corpus | 14,568 docs, ~1.056B tokens | -| Training subset | ~7,200 docs (newest 500M tokens, FY2024-2025) | -| Training sequences (seq_len=8192) | ~60K | -| Steps per epoch (eff. batch=32) | ~1,950 | -| Actual training time | ~13.5 hours (RTX 3090, 27s/step) | +**Bugs fought:** 4 bugs in `transformers` whole-word masking for BPE tokenizers, Python 3.14 incompatibility. Custom `WholeWordMaskCollator` built from scratch. -### Sequence Length Decision +**Results:** Loss 1.46→1.08, eval loss 1.0754, perplexity 2.11. 50 minutes on RTX 3090. Checkpoint: `checkpoints/tapt/modernbert-large/final/`. -ModernBERT was pre-trained at 8192 tokens (Warner et al., 2024). We match this during DAPT to ensure all positional embedding and attention weights — including ModernBERT's alternating local/global attention pattern — receive gradient updates. At seq_len=2048, positions 2048-8191 would get no updates, and the global attention layers (every 3rd layer, RoPE theta 160K) would never see long-range context during DAPT. +### Training Pipeline -### Epoch Decision - -We train for 1 epoch (single pass), following the empirical consensus: - -- **Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020):** Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across 2-8B token corpora. Sufficient for consistent downstream gains across all four domains tested. -- **Ponnock (2025), arXiv:2512.12384:** Found SEC-specific DAPT shows "diminishing marginal returns beyond roughly 250M tokens" within a single epoch. Our 1B token corpus is well past the diminishing-returns threshold. - -### Hyperparameters Aligned with Prior ModernBERT DAPT Work - -We aligned hyperparameters with the ModernBERT paper and two published DAPT efforts: - -- **MLM probability (30%):** Matches ModernBERT pre-training (Warner et al., 2024). -- **Weight decay (1e-5):** Matches ModernBERT pre-training and both BioClinical-ModernBERT (Sounack et al., 2025) and Patent-ModernBERT (Luo et al., 2025). The commonly-cited 0.01 is a BERT/RoBERTa default that doesn't apply to ModernBERT. -- **Learning rate (5e-5):** Conservative because we start from the published post-decay checkpoint. BioClinical and Patent-ModernBERT used 3e-4 but started from pre-decay stable-phase checkpoints that the ModernBERT authors released specifically for continued pre-training. - -### Training Optimizations - -Initial training ran at ~47s/step (projected ~56 hours for 1B tokens). Through iterative optimization we brought this down to ~13.5 hours: - -1. **Flash Attention 2** (Dao, 2024) — installed via precompiled wheel after upgrading to PyTorch 2.11+cu130 (CUDA 13.0 to match the driver). Without FA2, ModernBERT fell back to O(n²) eager attention at 8192 seq_len. This cut s/step from ~47s to ~27s. - -2. **torch.compile** — JIT-compiles non-attention ops into fused CUDA kernels. With external FA2, Dynamo hits graph breaks at every attention layer, so there was **no compute speedup**. However, fusing the surrounding ops (FFN, layer norms, residuals) unexpectedly **halved activation memory** (18.2GB → 11.9GB at batch=2) by eliminating intermediate tensor allocations. - -3. **Batch size increase** — torch.compile's memory savings freed enough VRAM to increase from batch=2 to batch=4. At seq_len=8192 the GPU is already compute-saturated, so larger batches didn't meaningfully improve s/step (~27s in all configurations). The benefit was marginal reduction in gradient accumulation overhead. - -4. **Corpus subsampling** — the single biggest wall-time reduction. Ponnock (2025) showed diminishing returns past 250M tokens for SEC DAPT. Subsampling from 1.06B to 500M tokens (newest filings) halved training from ~29h to ~13.5h. - -5. **Fused AdamW + non-reentrant gradient checkpointing + tf32** — minor optimizations (~1-2% combined). Fused optimizer merges parameter updates into a single kernel. Non-reentrant checkpointing enables torch.compile compatibility. - -**What didn't work:** Increasing batch size beyond 2 provided no s/step improvement because the 3090 is compute-saturated at seq_len=8192 (attention is O(n²) FLOPs even with FA2). SDPA (PyTorch's native attention) couldn't replace external FA2 without OOMing due to different memory allocation patterns. torch.compile couldn't accelerate the attention bottleneck because FA2's custom CUDA kernels are opaque to Dynamo's graph tracer. - -**The fundamental constraint** is hardware: the RTX 3090's 35.6 bf16 TFLOPS sets a hard ceiling on throughput at 8192 seq_len. An AWS g7e.2xlarge (RTX PRO 6000 Blackwell, 236 bf16 TFLOPS, 96GB VRAM) could complete the same run in ~3.7 hours for ~$5 on spot pricing — the 96GB VRAM allows dropping gradient checkpointing entirely (eliminating activation recomputation) and running batch=16. - -Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROCEDURE.md`. - -### Early Training Results - -| Step | Loss | grad_norm | LR | Epoch | Note | -|------|------|-----------|-----|-------|------| -| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Warmup phase | -| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | Steady decline | -| 1800 | 0.7253 | 0.073 | 1.48e-6 | 0.97 | LR near zero, loss plateaued | -| **Final** | **0.7250** | **0.043** | **5.7e-8** | **1.00** | **Eval loss: 0.7250, perplexity: 1.65** | - -The loss dropped from 0.80 → 0.72 — a gentle 10% decline over one epoch. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; DAPT taught it SEC-specific token co-occurrence patterns ("NIST CSF", "materiality assessment", "tabletop exercise"), not language fundamentals. grad_norm remained stable at 0.04-0.07 throughout with zero instability. Total training time: ~14 hours across two sessions on an RTX 3090 (resumed from checkpoint-1280). - -The DAPT checkpoint is saved at `checkpoints/dapt/modernbert-large/final/` and is ready for TAPT. - -### TAPT Configuration - -The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. This changes several training decisions vs. DAPT. Config file: `python/configs/tapt/modernbert.yaml`. - -| Parameter | DAPT | TAPT | Rationale for change | -|-----------|------|------|---------------------| -| `max_seq_length` | 8192 | 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). Using 8192 would mean 98.5% padding — pure waste. See seq_len discussion below. | -| `num_train_epochs` | 1 | 5 | Gururangan et al. (2020) ran 100 epochs on 50-500K token TAPT corpora. We match total token exposure: 5 × 10M = 50M tokens ≈ upper bound of their TAPT exposure. | -| `whole_word_mask` | false | true | Masks entire words instead of subword pieces. Prevents trivially solvable masking patterns (e.g., masked `cyber` next to unmasked `security`). The model already knows subword composition from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). | -| `per_device_train_batch_size` | 4 | 32 | Short sequences free VRAM. Tested: batch=32 uses 22.7 GB with torch.compile (vs. OOM at batch=48). | -| `gradient_accumulation_steps` | 8 | 1 | Effective batch = 32 in both cases. No accumulation needed since batch=32 fits directly. | -| `gradient_checkpointing` | true | false | Not needed at seq_len=512 — activations are small. Gradient checkpointing would slow training 30-40% for no memory benefit. | -| `save_strategy` / `eval_strategy` | steps (256) | epoch | 5 epochs; checkpoint and evaluate after each one. | -| `validation_split` | 0.02 | 0.05 | Larger val split for a 50x smaller dataset — need enough samples for stable eval loss. | - -**Sequence length (512 vs. 8192):** The concern with a shorter seq_len is degrading the model's long-range attention capabilities. Three factors make this a non-issue for TAPT: - -1. **The data is short.** Paragraphs average 127 tokens. There is no long-range structure to learn — the information simply isn't there. -2. **Scale of exposure.** TAPT is 50M token-exposures (5 epochs × 10M). ModernBERT was pre-trained on ~2T tokens; DAPT added 500M. 50M is 0.0025% of original pre-training — far too small to cause catastrophic forgetting of patterns established over trillions of tokens. -3. **RoPE positions are independent.** ModernBERT uses rotary position embeddings. Positions 0-511 compute identically whether max_length is 512 or 8192. Training at 512 updates the same parameters; positions 512-8191 remain as-is from DAPT, not degraded. - -**Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked. - -**Training time:** ~2,139 steps/epoch × 5 epochs = ~10,695 total steps. 50 minutes on the RTX 3090 at ~3.56 steps/s (averaged over full run including torch.compile warmup). - -### TAPT Results - -| Metric | Value | -|--------|-------| -| Epochs | 5 | -| Total steps | 10,695 | -| Training time | 50 minutes | -| Initial loss | 1.46 | -| Final train loss (avg) | 0.6428 | -| Final eval loss | 1.0754 | -| Final perplexity | 2.11 | -| Throughput | 114 samples/s, 3.56 steps/s | - -Loss dropped from 1.46 → 1.08 over 5 epochs. For comparison, DAPT ended at eval loss 0.72 with standard subword masking at the same 30% rate — the gap reflects the harder whole-word masking objective (no subword hints), not a weaker model. The model learns to predict masked domain terms ("CISO", "materiality", "tabletop") from surrounding paragraph context alone, which is exactly the inductive bias TAPT is designed to create. - -The TAPT checkpoint is saved at `checkpoints/tapt/modernbert-large/final/` and is ready for fine-tuning. - -### TAPT Launch — Whole-Word Masking Bugs - -Launching TAPT required fighting through four bugs in `transformers`' `DataCollatorForLanguageModeling` when `whole_word_mask=True`, plus a Python 3.14 incompatibility that forced a version rollback. - -**Bug 1: `offset_mapping` stripped before reaching the collator.** The Trainer's default `remove_unused_columns=True` drops any dataset column not in the model's `forward()` signature. Since `offset_mapping` is a collator input (not a model input), it was silently removed, causing the collator to receive a 0-dimensional array and crash with `IndexError: too many indices for array`. Fix: set `remove_unused_columns=False` when whole-word masking is enabled. - -**Bug 2: `offset_mapping` can't survive `tokenizer.pad()`.** Even with the column present, the collator's `torch_call()` passes all features — including `offset_mapping` — through `tokenizer.pad()`, which tries to tensorize the variable-length nested lists and crashes with `ValueError`. The collator pops `offset_mapping` *after* padding, but padding already failed. Fix: subclass `DataCollatorForLanguageModeling` to strip `offset_mapping` before padding. - -**Bug 3: `offset_mapping` word boundary detection is broken for BPE tokenizers.** This was the most insidious bug — training ran but loss was ~6-8 (near-random, vs. expected ~1.5-2.0). The upstream `_calc_word_ids_and_prob_mask` detects word boundaries by checking if `token_start != prev_token_end` in the offset mapping. But BPE tokenizers (like ModernBERT's) absorb leading spaces into tokens, making ALL offsets contiguous: `"The" → (0,3), " company" → (3,11)`. Since 3 == 3, the algorithm treats the entire sequence as one giant "word." When 30% masking is applied to these mega-groups, it masks enormous contiguous spans, making prediction nearly impossible. - -**Fix:** Replaced `offset_mapping` entirely with the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type (BPE, WordPiece, SentencePiece). The `WholeWordMaskCollator` in `python/src/dapt/train.py` implements whole-word masking from scratch: extracts `word_ids` before padding, selects `mlm_probability` fraction of unique word IDs per sequence, and masks all tokens belonging to selected words. - -**Python 3.14 incompatibility.** Two separate issues forced a rollback to Python 3.13: -1. Python 3.14 changed the multiprocessing start method from `fork` to `forkserver`, requiring picklable dataloader collators (closures crash with `PicklingError`). -2. Python 3.14 changed `pickle.Pickler._batch_setitems` to take 3 arguments, breaking `dill` (used by `datasets` for config hashing). This was unfixable — even `dill` 0.4.1 and `datasets` 4.8.4 crashed. The breakage is deep in the `datasets` builder machinery and hit every codepath (`load_dataset`, `Dataset.from_list`, `dataset.map`). - -Rolled `pyproject.toml` from `requires-python = ">=3.14"` to `">=3.13,<3.14"` and updated the flash-attn wheel URL from cp314 to cp313. +``` +ModernBERT-large (base, 395M params) + → DAPT on 9K full 10-K filings (~500M tokens, ~14.5h) → SEC-ModernBERT-large + → TAPT on 72K Item 1C paragraphs (~10M tokens, ~50min) → SEC-cyBERT-large + → Fine-tune on labeled data with dual classification heads → Final classifier +``` --- -## Cost and Time Ledger +## Phase 5: Truncated Filing Exclusion -### Tooling - -All code was written collaboratively with **Claude Code** (Anthropic's agentic coding CLI). Claude Code was used throughout the project for pipeline development, prompt engineering, data analysis, script writing, documentation, and strategic planning. The tool dramatically accelerated iteration speed — writing analysis scripts, debugging extraction edge cases, and exploring the annotation data interactively — but all decisions were made by the team with Claude Code as an implementation partner. - -### API Cost Ledger - -| Phase | Cost | Annotations | Notes | -|-------|------|-------------|-------| -| Stage 1 prompt iteration (pilots) | $7.03 | 9,597 | 12+ versions: 5 × 40-sample + 6 × 500-sample | -| Stage 1 model bench (6 candidates) | $3.41 | 2,993 | seed, mimo, glm-4.5-air, minimax, mistral, nemotron | -| Mimo pilot (dedicated comparison) | $0.24 | 500 | `mimo-pilot.ts` — replace-nano scenario modeling | -| Stage 1 run #1 (with nano) | $112.42 | 150,009 | Full production run with gpt-5.4-nano. Completed, but nano's quality was unacceptable (0 reasoning tokens 64% of the time). Gemini+grok annotations ($91.18) preserved in `stage1-gemini-grok.jsonl`; only nano's annotations ($21.24) were discarded. Full original in `stage1.jsonl.bak`. | -| Stage 1 run #2 (mimo only) | $24.69 | 50,003 | Ran only mimo to replace nano. Merged with preserved gemini+grok annotations to form final `stage1.jsonl` ($115.88 total value, $24.69 new spend). | -| Judge model bench (8 candidates) | $5.97 | 505 | GLM-5 (4 configs), gpt-5.4-mini, gpt-5.4, sonnet-4.6, gemini-3-flash, grok-4.20, mimo-v2-pro, kimi-k2.5 | -| Orphan word re-annotation | $3.30 | 4,611 | Re-ran Stage 1 on 1,537 patched paragraphs × 3 models. 7.7% changed consensus category. | -| **Total API spend** | **$159** | **~218K unique** | Nano waste: $21.24 | - -Only nano's portion ($21.24) of the first run was wasted — the gemini and grok annotations were preserved and merged with the new mimo annotations. Still, $21.24 thrown away on a model that wasn't thinking. The lesson: benchmark model candidates rigorously *before* committing to a production run. The 40-sample pilots showed nano was the weakest link but were misleadingly optimistic about the magnitude of the problem. - -### Time Ledger - -| Phase | Hours | Notes | -|-------|-------|-------| -| Data acquisition + HTML cleaning | ~6h | Extraction pipeline, HTML artifact handling, dedup, 8-K discovery. The messiest phase — SEC filing HTML variability required extensive regex heuristics and iteration. | -| Stage 1 annotation run #1 (nano) | ~5h | Production run wall clock (~300 min). Completed but results were below quality bar. | -| Stage 1 annotation run #2 (mimo) | ~1h | Only needed mimo annotations at higher concurrency (gemini+grok reused). | -| Prompt iteration + model benchmarking | ~4h | 12+ prompt versions, 6 model candidates, pilot analysis | -| Post-Stage 1 analysis + Stage 2 planning | ~5h | Distributional analysis, model bias discovery, codebook v3.0 rulings, judge benchmarking, strategy revision | -| Data quality audit + remediation | ~4h | Generator investigation, 6 patches, orphan re-annotation, quality tier system, docs | -| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates | -| Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure | -| DAPT pre-training | ~14.5h GPU | 1 epoch on 500M tokens, RTX 3090. Two sessions (resumed from checkpoint-1280). | -| TAPT debugging + pre-training | ~2h dev + ~50min GPU | 4 bugs in transformers whole-word masking + Python 3.14 rollback. Training: 5 epochs on 72K paragraphs, 50 min. | -| Human labeling (1,200 paragraphs, 6 annotators) | 21.5h active | $0 (team labor) | -| Post-labeling analysis + gold set tooling | ~3h | $0 | -| **Total to date** | **~76.5h** | Includes ~15.3h GPU + 21.5h human labeling | - -### Remaining Work (estimated) - -| Phase | Est. Hours | Est. Cost | -|-------|-----------|-----------| -| GenAI holdout benchmark (6 models × 1,200) | ~1h | ~$15-43 | -| Opus golden re-run (1,200 paragraphs) | ~1h | $0 (subscription) | -| Gold set adjudication (13+ signals/paragraph) | ~4h | $0 | -| Training data assembly | ~2h | $0 | -| Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | -| Evaluation + comparison + write-up | ~6-8h | $0 | +72 filings (~0.8%) where section boundary detection cut off mid-sentence. Excluded from training splits — filings where the last paragraph doesn't match terminal punctuation are filtered. --- -## Model Census — Every Model We Tried + --- -## Phase 13: GenAI Holdout Benchmark +## Phase 6: The v2 Reboot — Why We Started Over -### Benchmark Panel +### What v1 Taught Us -With human labeling complete, the next step is running 6+ GenAI models from 3+ suppliers on the same 1,200 holdout paragraphs — both as an assignment requirement and to generate the 13+ annotation signals needed for gold set adjudication. +The v1 pipeline produced 150K Stage 1 annotations, a 10-model benchmark, human labels from 6 annotators, and extensive gold adjudication. It worked — but evaluation revealed structural problems that no amount of prompt iteration could fix: -The benchmark panel uses the v3.0 prompt (with codebook rulings) and runs via OpenRouter: +1. **Specificity Level 2 was too narrow.** Our codebook defined Level 2 as "names a recognized standard" — but the professor's construct says "references industry." Domain-specific practices (penetration testing, vulnerability scanning, SIEM) were classified as Level 1. Level 2 ended up at 3.9% of the holdout (47 samples) — too few for reliable per-class F1. -| Model | Supplier | Cost/call | Latency | Structured Output | -|-------|----------|-----------|---------|-------------------| -| openai/gpt-5.4 | OpenAI | $0.009 | 5s | Native | -| moonshotai/kimi-k2.5 | Moonshot | $0.006 | 33s | Native | -| google/gemini-3.1-pro-preview | Google | $0.006 | 3s | Native | -| z-ai/glm-5 | Zhipu | $0.006 | ~40s | Native (exacto routing) | -| minimax/minimax-m2.7 | MiniMax | $0.002 | 11s | Raw text + fence stripping | -| xiaomi/mimo-v2-pro | Xiaomi | $0.006 | 32s | Native (exacto routing) | +2. **Level 4 required 2+ QV facts.** The construct lists types of qualifying facts, not a minimum count. The artificial threshold created a narrow class and forced annotators into a counting exercise. -Plus Claude Opus 4.6 via Agent SDK (subscription, no per-call cost) with full codebook as system prompt. +3. **The BG/MR/RMP triangle was patched, not fixed.** Six decision rules and ten borderline cases accumulated as patches on unchanged definitions. Models processed increasingly complex instructions with diminishing returns. -Combined with the 3 Stage 1 models already on file: **10 models from 8 suppliers**. +4. **The holdout was adversarial by design.** Stratified to over-sample confusion-axis paragraphs — great for stress-testing the codebook, terrible for evaluation. Combined with narrow Level 2, this structurally depressed F1. -**Minimax structured output workaround:** MiniMax m2.7 wraps JSON responses in markdown code fences (` ```json ... ``` `), which the Vercel AI SDK's `Output.object()` parser cannot handle. Rather than using tool calling (which drops accuracy ~7pp based on GLM-5 testing) or a fallback retry (2x cost), minimax models skip structured output entirely and use raw text generation with regex fence stripping before Zod validation. The enum values are correct with the full v3.0 prompt; only the fences are the issue. +5. **Human specificity agreement was poor.** Krippendorff's α = 0.546 on specificity (target: 0.67). The narrow Level 2 definition made it hard for anyone to agree. -### Opus Golden Re-Run +### The Decision -The Opus golden labeling was re-run on the correct 1,200 holdout paragraphs. A previous run had annotated a different set of 1,200 paragraphs due to `.sampled-ids.json` being overwritten (previous labels preserved at `data/annotations/golden/opus.wrong-sample.jsonl`). The re-run uses parallelized Agent SDK workers (configurable concurrency) with serialized file writes for crash safety. +Rather than continue patching, we decided to: +- Revise the codebook with systemic changes (broaden Level 2, loosen Level 4, reframe category rules) +- Take a new random stratified holdout (equal per category class, not overindexed on hard cases) +- Re-run Stage 1 with the improved codebook/prompt +- Have humans re-label the new holdout +- Re-run the benchmark panel +- Then train + +The v1 data pipeline, corpus, DAPT checkpoint, and TAPT checkpoint are all unchanged and carried forward. Only the labeling and evaluation are redone. + +### What Changed in v2 + +**Codebook (LABELING-CODEBOOK.md):** +- Level 2 broadened from "names a standard" to "uses cybersecurity domain terminology" (the ERM test) +- Level 4 threshold lowered from 2+ to 1+ QV-eligible fact (the external verifiability test) +- Category primary test changed to "What question does this paragraph answer?" +- MR headline changed from "who a specific person is" to "how management is organized to handle cybersecurity" +- Person-removal test reframed as confirmation tool, not primary rule +- Materiality rules cleaned up (assessment vs. speculation distinction became a clean rule, not a ruling) +- IS/NOT lists restructured for new Level 2 boundary +- Codebook + Ethos split: rules in LABELING-CODEBOOK.md, reasoning in CODEBOOK-ETHOS.md + +**Holdout:** +- Random stratified sample: ~170 per category class × 7 ≈ 1,190 +- Secondary constraint: minimum ~100 per specificity level +- NOT overindexed on confusion-axis cases +- Separate ~200-paragraph dev set for prompt iteration (excluded from holdout) + +### Cost of the Reboot + +| Item | Estimated Cost | +|------|---------------| +| Stage 1 re-run (full corpus) | ~$120 | +| Benchmark re-run (holdout) | ~$45 | +| Prompt iteration | ~$10 | +| Human re-labeling | $0 (team labor) | +| **Total additional API** | **~$175** | + +Against the ~$200 already spent on v1 API calls. The DAPT/TAPT compute (~15h GPU) is not re-done. --- -## Key Technical Artifacts +## v1 Reference -| Artifact | Location | Description | -|----------|----------|-------------| -| Labeling codebook | `docs/LABELING-CODEBOOK.md` | Authoritative reference, v3.0 with codebook rulings | -| Stage 1 annotations | `data/annotations/stage1.jsonl` | 150,009 annotations (120 MB) | -| Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` | 72,045 paragraphs with filing metadata | -| Gold labels | `data/bench/judges/gold-final.json` | 50 adjudicated gold labels | -| Gold adjudications | `data/bench/judges/gold-adjudicated.json` | 11 detailed adjudication decisions with reasoning | -| Human labels (raw) | `data/gold/human-labels-raw.jsonl` | 3,600 labels with timing, notes, session IDs | -| Human label metrics | `data/gold/metrics.json` | Full IRR: per-dimension alpha, pairwise kappa matrices, per-category/stratum rates | -| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` | 1,200 holdout paragraphs with Stage 1 consensus metadata | -| Diagnostic charts | `data/gold/charts/` | 16 analysis charts (kappa heatmaps, confusion matrices, distributions, etc.) | -| Analysis script | `scripts/analyze-gold.py` | Comprehensive cross-source analysis (human × Stage 1 × Opus) | -| Annotation prompt | `ts/src/label/prompts.ts` | SYSTEM_PROMPT (v3.0) + buildJudgePrompt() | -| Annotation runner | `ts/scripts/stage1-run.ts` | Resume-safe, configurable concurrency | -| Orphan re-annotation | `ts/scripts/rerun-orphan-stage1.ts` | Re-ran 1,537 patched paragraphs, $3.30 | -| Re-annotation diff | `ts/scripts/diff-orphan-annotations.ts` | Category/specificity change analysis | -| No-cyber analysis | `ts/scripts/analyze-no-cyber.ts` | Label distribution on 348 flagged paragraphs | -| Data quality audit | `docs/DATA-QUALITY-AUDIT.md` | Full audit: generators, patches, quality tiers | -| Generator reference | `docs/EDGAR-FILING-GENERATORS.md` | 14 vendors with signatures and quality profiles | -| Analysis scripts | `ts/scripts/stage1-analyze.ts`, `segment-analysis.ts`, `model-bias-analysis.ts`, `dispute-crosstab.ts`, `sample-disputes.ts` | Deep analytics on annotation data | -| Judge benchmarking | `ts/scripts/judge-bench.ts` | Supports structured/tool modes, gold label comparison | -| Judge diagnostics | `ts/scripts/judge-diag.ts`, `judge-diag-batch.ts` | GLM-5 failure investigation | -| Model benchmarking | `ts/scripts/model-bench.ts` | Stage 1 candidate evaluation | -| Golden annotation (Opus) | `ts/src/label/golden.ts` | Agent SDK runner for gold set, saves reasoning traces | -| Golden annotations | `data/annotations/golden/opus.jsonl` | Opus 4.6 labels + thinking + raw confidence (re-run on correct holdout) | -| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` | 6 models × 1,200 paragraphs, v3.0 prompt | -| Stale golden (wrong sample) | `data/annotations/golden/opus.wrong-sample.jsonl` | Original Opus run on wrong 1,200 paragraphs (preserved) | +The complete v1 narrative — Stage 1 prompt engineering (12+ iterations), model benchmarking (21+ models, 12 providers), human labeling webapp, gold set adjudication (13-signal cross-analysis), codebook iterations v1.0–v3.5 — is preserved at `docs/NARRATIVE-v1.md`. ---- - -## Phase 14: 13-Signal Analysis & F1 Strategy - -### Benchmark Complete - -All 6 benchmark models + Opus completed 1,200 annotations each. Total benchmark cost: $45.63. Every paragraph in the holdout now has exactly 13 independent annotations: 3 human + 3 Stage 1 + 1 Opus + 6 benchmark. - -Model performance sorted by leave-one-out "both" accuracy (each source vs majority of other 12): Opus 4.6 (84.0%), Kimi K2.5 (83.3%), Gemini Pro (82.3%), GPT-5.4 (82.1%), GLM-5 (81.4%), MIMO Pro (81.4%), Grok Fast (80.0%). Best human: Xander at 76.9%. Worst: Aaryan at 15.8%. - -### The "Is Opus Special?" Question - -We tested whether Opus's apparent dominance was an artifact of using it as the reference. Answer: no. In leave-one-out analysis, Opus has the lowest "odd one out" rate at 7.4% — it disagrees with the remaining 12 sources less than any other source. But the top 6 GenAI models are within 3pp of each other — any could serve as reference with similar results. The 13-signal majority is 99.5% identical to the 10-GenAI majority; adding 3 human votes barely shifts consensus because 10 outvotes 3. - -### Adjudication Tiers - -The 13-signal consensus enables tiered adjudication: -- **Tier 1 (63.0%):** 756 paragraphs where 10+/13 agree on both dimensions. Auto-gold, zero human work. -- **Tier 2 (18.0%):** 216 paragraphs where human majority and GenAI majority agree. Cross-validated. -- **Tier 3 (2.2%):** 26 paragraphs where humans split but GenAI converges. -- **Tier 4 (16.8%):** 202 paragraphs with universal disagreement. Expert adjudication needed. - -81% of the holdout can be adjudicated automatically. The 202 Tier 4 paragraphs are dominated by MR↔RMP confusion (the #1 axis everywhere) and are the natural error analysis corpus. - -### Specificity: GenAI Is More Consistent Than Humans - -GenAI spec unanimity is 60.1% vs human spec unanimity of 42.2% (+18pp). Specificity calibration plots show that GPT-5.4, Gemini Pro, and Kimi K2.5 closely track Opus across all 4 specificity levels. MiniMax M2.7 is the only model with systematic specificity bias (−0.26 vs Opus). Among humans, Aaryan's +1.30 bias dwarfs all other sources. - -### F1 Strategy - -The assignment requires macro F1 > 0.80 on category. Based on the data: -- The best GenAI models agree with human majority ~83-87% on category -- Training on 35K+ unanimous Stage 1 labels with DAPT+TAPT should approach this ceiling -- The swing categories for macro F1 are MR (~65-80%), TPR (~70-90%), N/O (~60-85%) -- Focal loss for class imbalance + SCL for boundary separation + ensemble for robustness - -Key risk: the stratified holdout over-samples hard cases, depressing F1 vs a random sample. Mitigation: report F1 on both the full holdout and a proportional subsample. The delta quantifies model degradation at decision boundaries. - -### Cost Ledger Update - -| Phase | Cost | Time | -|-------|------|------| -| Stage 1 (150K annotations) | $115.88 | ~30 min | -| Orphan re-annotation | $3.30 | ~9 min | -| Benchmark (6 models × 1,200) | $45.63 | ~1h | -| Opus golden (1,200) | $0 (subscription) | ~30 min | -| Human labeling | $0 (class assignment) | 21.5h active | -| Post-labeling analysis | ~3h | | -| **Total API** | **$164.81** | | - ---- - -## Phase 15: Codebook v3.5 — The Prompt Drift Discovery - -### The Problem - -Cross-analysis of human vs GenAI labels on the holdout revealed a systematic, directional disagreement on three axes: - -1. **SI↔N/O (23:0 asymmetry):** When humans and GenAI disagreed on this axis, humans ALWAYS called it SI and GenAI called it N/O. Never the reverse. Root cause: the labelapp trained humans that any language connecting cybersecurity to business materiality — even forward-looking ("could materially affect") — is SI at Specificity 1. Stage 1 models (v2.5 prompt) lacked this rule entirely. Even v3.0 benchmark models, which had the backward-looking materiality rule, were conservative about forward-looking variants. - -2. **MR↔RMP (253 paragraphs, 38:13 asymmetry):** GenAI systematically calls MR paragraphs RMP. The v3.0 "person-vs-function test" helps but leaves genuinely mixed paragraphs (both person and process as grammatical subjects) unresolved. These near-even splits need a deterministic tiebreaker chain. - -3. **BG↔MR (149 paragraphs, 33:6 asymmetry):** GenAI systematically under-calls BG. The problem is governance chain paragraphs that describe the board receiving reports from management — is this about the board's oversight function or the officer's reporting duty? - -### The Audit - -A Stage 1 audit found ~1,076 paragraphs (649 unanimous + 383 majority N/O) with materiality language that should be SI under the broadened rule. 1.3% of the corpus overall — but potentially concentrated on exactly the boundary cases the holdout over-samples. On the holdout, mimo-v2-flash was actually the most accurate Stage 1 model on this axis, dissenting toward SI 263 times when the other two said N/O. - -The MR↔RMP and BG↔MR axes are cleaner in Stage 1 unanimity — only 0.2% of unanimous BG labels are problematic, and the MR/RMP tiebreaker mainly affects disputed labels (already going to Stage 2). The v2.5→v3.5 gap is primarily an SI↔N/O problem. - -### Initial v3.5 Rulings (Round 1) - -Three rulings, all driven by the 13-signal cross-analysis: - -**Rule 6 broadened (SI↔N/O):** ALL materiality language → SI, not just backward-looking disclaimers. Forward-looking ("could materially affect"), conditional ("reasonably likely to"), and negative assertions ("have not experienced material incidents") are all Strategy Integration at Specificity 1. - -**Rule 2 expanded (BG↔MR):** Added the board-line test with governance hierarchy layers and a dominant-subject test for cross-layer paragraphs. - -**Rule 2b expanded (MR↔RMP):** Three-step decision chain: subject test → person-removal test → qualifications tiebreaker. - -These rulings were tested by re-running all 7 benchmark models (6 OpenRouter + Opus) on 359 confusion-axis holdout paragraphs with the v3.5 prompt ($18, stored separately from v3.0 data). - -### The Prompt Drift Lesson - -Running Stage 1 (150K annotations) before human labeling created a subtle but significant problem: the codebook evolved through v2.5 → v3.0 → v3.5, but the training data is frozen at v2.5. Each codebook revision was driven by empirical analysis of disagreement patterns — which required the Stage 1 data AND human labels to exist first. The dependency is circular: you can't know what rules are needed until you see where annotators disagree, but you can't undo the labels already collected. - -### Iteration: 6 Rounds on 26 Regression Paragraphs ($1.02) - -The initial v3.5 re-run revealed that the rulings over-corrected. We identified 26 "regression" paragraphs — cases where v3.0 matched human majority but v3.5 did not — and iterated the prompt using GPT-5.4 on these 26 paragraphs ($0.17/round) to diagnose and fix each over-correction. - -**Round 1 (v3.5a) — 5/26.** Catastrophic. All three rulings over-fired simultaneously. SI was called on every paragraph with the word "material." BG was called whenever a committee was named. MR was called whenever a person was a grammatical subject. The rulings were correct in intent but models interpreted them too aggressively. - -**Round 2 (v3.5b) — 13/25.** Three fixes: (A) Replaced the BG "dominant-subject test" with a "purpose test" — if the paragraph describes oversight structure, it's BG; mere committee mentions don't flip the category. (B) Made MR↔RMP Step 1 non-decisive — a person being the grammatical subject is a signal, not a conclusion; always proceed to Step 2 (person-removal test). (C) Added cross-reference exception for SI. Improvement: +8. - -**Round 3 (v3.5c) — 20/26.** The cross-reference exception eliminated the 5 most egregious SI over-predictions — paragraphs like "For a description of risks that may materially affect us, see Item 1A" that v3.5a called SI but are obviously N/O. These were pure pointers with materiality language embedded in the cross-reference text, not materiality assessments. +7. - -**Round 4 (v3.5d) — 22/26.** The critical insight: not all materiality language is a materiality *assessment*. Reading the 6 remaining errors revealed a spectrum: - -- "Cybersecurity risks have not materially affected our business strategy" → **Assessment** (conclusion about actual impact) → SI ✓ -- "Risks are reasonably likely to materially affect us" → **Assessment** (SEC Item 106(b)(2) standard) → SI ✓ -- "Cybersecurity threats could have a material adverse effect on our business" → **Speculation** (generic risk warning in every 10-K) → NOT SI ✗ -- "Managing material risks associated with cybersecurity" → **Adjective** ("material" means "significant") → NOT SI ✗ -- "...which could result in material adverse effects" at the end of an RMP paragraph → **Consequence clause** (doesn't override primary purpose) → NOT SI ✗ - -The tightened rule: only backward-looking conclusions and SEC-qualified forward-looking ("reasonably likely to") trigger SI. Generic "could have a material adverse effect" does not. This distinction — assessment vs. speculation — resolved 3 errors without breaking any correct calls. +2. - -We also verified each error against human annotator votes. All 6 remaining errors had the human majority correct (checked by reading the actual paragraph text and codebook rules). Interestingly, on 3 of the 6, the project lead's own label was the dissenting human vote — he had been the one calling these SI, validating that the over-calling pattern was a real and consistent interpretation difference, not random noise. - -**Round 5 (v3.5e) — 19/25.** Regression. We attempted to add an explicit BG↔RMP example ("CISO assists the ERMC in monitoring...→ RMP") to the disambiguation guidance. This caused 3 previously-correct paragraphs to flip to BG — the example made models hyper-aware of committee mentions and triggered BG more broadly. Lesson: **targeted examples can backfire when the pattern is too specific.** The model generalizes from the example in unpredictable ways. - -**Round 6 (v3.5f) — 21/26.** Reverted the Round 5 BG↔RMP example. Kept the N/O↔RMP "actual measures" clarification from Round 5 (if a paragraph describes specific security measures the company implemented, it's RMP even in risk-factor framing). This stabilized at 21-22/26, with the 2-paragraph swing attributable to LLM non-determinism at temperature=0. - -### The 4 Irreducible Errors - -The remaining errors after Round 4/6 fall into two patterns: - -**BG over-call on process paragraphs (2 errors):** A paragraph describing monitoring methods (threat intelligence, security tools, detection capabilities) where a management committee (ERMC) is woven throughout as the entity being assisted. Content is clearly RMP but the committee mention triggers BG. These are genuinely dual-coded — the monitoring IS part of the committee's function. Human majority says RMP (2-1 in both cases). - -**N/O over-call on borderline RMP paragraphs (2 errors):** Paragraphs that describe risk management activities ("assessing, identifying, and managing material risks") but are framed as risk-factor discussions with threat enumeration. The SI tightening correctly stopped calling them SI, but they overcorrected to N/O instead of RMP. The N/O↔RMP boundary depends on whether the paragraph describes what the company DOES (→ RMP) vs. what risks it faces (→ N/O). These paragraphs do both. - -All 4 have human 2-1 splits — reasonable annotators disagree on these. Further prompt iteration risks over-fitting to these 4 specific paragraphs at the cost of breaking the other 355 correctly-classified ones. - -### The SI Rule: Assessment vs. Speculation - -The most important finding from the iteration is the distinction between materiality *assessments* and materiality *language*: - -| Pattern | Classification | Reasoning | -|---------|---------------|-----------| -| "have not materially affected our business strategy" | **SI** | Backward-looking conclusion — the company is reporting on actual impact | -| "reasonably likely to materially affect" | **SI** | Forward-looking with SEC qualifier — Item 106(b)(2) disclosure | -| "have not experienced material cybersecurity incidents" | **SI** | Negative assertion — materiality conclusion about past events | -| "could have a material adverse effect" | **NOT SI** | Generic speculation — appears in every 10-K, not an assessment | -| "managing material risks" | **NOT SI** | Adjective — "material" means "significant," not a materiality assessment | -| "For risks that may materially affect us, see Item 1A" | **NOT SI** | Cross-reference — pointing elsewhere, not making a conclusion | -| "...which could result in material losses" (at end of RMP paragraph) | **NOT SI** | Consequence clause — doesn't override the paragraph's primary purpose | - -This distinction reduced the Stage 1 correction set from ~1,014 to 308 paragraphs. The original broad flag ("any paragraph with the word 'material'") caught ~700 paragraphs that were correctly labeled N/O by Stage 1 — they contained generic "could have a material adverse effect" boilerplate that is NOT a materiality assessment. Only 180 paragraphs contain actual backward-looking or SEC-qualified assessments that v2.5 miscoded. - -### Final v3.5 Gold Re-Run - -After locking the prompt at v3.5f, all 7 models (Opus + 6 benchmark) were re-run on the 359 confusion-axis holdout paragraphs with the final prompt (~$18). v3.0 data preserved in original paths (`bench-holdout/`, `golden/`). v3.5f results stored separately (`bench-holdout-v35/`, `golden-v35/`). The v3.0→v3.5 comparison — per model, per axis — is itself a publishable finding about how prompt engineering systematically shifts classification boundaries in frontier LLMs. - -### The SI↔N/O Paradox — Resolved - -The v3.5f re-run showed a troubling result: SI↔N/O accuracy *dropped* 6pp vs v3.0 (60% vs 66%), with the H=SI/M=N/O asymmetry worsening from 20 to 25 cases. The initial hypothesis was that models became globally conservative when told to distinguish assessment from speculation. - -A paragraph-by-paragraph investigation of all 27 SI↔N/O errors revealed the opposite: **the models are correct, and the humans are systematically wrong.** - -Of the 25 H=SI / M=N/O cases: -- ~20 are pure "could have a material adverse effect" speculation, cross-references to Item 1A, or generic threat enumeration — none containing actual materiality assessments. All 6 models unanimously call N/O. -- ~3 are genuinely ambiguous (SPACs with assessment language, past disruption without explicit materiality language). -- ~2 are edge cases (negative assertions embedded at end of BG paragraphs). - -Of the 2 H=N/O / M=SI cases: -- Both contain clear negative assertions ("not aware of having experienced any prior material data breaches", "did not experience any cybersecurity incident during 2024") — textbook SI. All 6 models unanimously call SI. - -**Root cause of human error:** Annotators systematically treat ANY mention of "material" + "business strategy" + "financial condition" as SI — even when wrapped in pure speculation ("could," "if," "may"). The codebook's assessment-vs-speculation distinction is correct; humans weren't consistently applying it. - -**Codebook Case 9 contradiction fixed:** The investigation also discovered that Case 9 ("could potentially have a material impact" → SI) directly contradicted Rule 6 ("could = speculation, not assessment"). Case 9 has been corrected: the "could" example is now N/O, with explanation of why "reasonably likely to materially affect" (SEC qualifier) ≠ "could potentially" (speculation). - -Two minor prompt clarifications were added (consequence clause refinement for negative assertions, investment/resource SI signal) and tested on 83 SI↔N/O paragraphs ($0.55). Net effect: within stochastic noise — confirming the prompt was already correct. - -### Implications for Training - -- **Gold adjudication on SI↔N/O:** Trust model consensus over human majority. When 6/6 models unanimously agree and the paragraph contains only speculative language → use model label. Apply SI deterministically via regex for backward-looking assessments and SEC qualifiers. Expected impact: SI↔N/O accuracy rises from ~60% to ~95%+ against corrected gold labels. -- **Stage 2 judge** must use v3.5 prompt. This is where the codebook evolution actually matters for training data quality. -- **Stage 1 corrections re-flagged:** Tightened criteria reduced flagged paragraphs from 1,014 to 308 (180 materiality assessments + 128 SPACs). The 706 excluded paragraphs contained generic "could" boilerplate that was correctly labeled N/O by v2.5. -- **Gold adjudication on other axes:** On MR↔RMP and BG↔MR, v3.5 improves alignment with humans by ~4pp on hard cases but the improvement is more modest on easy cases. -- **MiniMax exclusion:** MiniMax M2.7 is a statistical outlier (z=−2.07 in inter-model agreement) and the most volatile model across prompt versions (40.7% category change rate). Data retained per assignment requirements but excluded from gold scoring majority. - -### Cost Ledger Update - -| Phase | Cost | Time | -|-------|------|------| -| v3.5 initial re-run (7 × 359) | ~$18 | ~10 min | -| v3.5 iteration (6 × 26 × GPT-5.4) | $1.02 | ~15 min | -| v3.5f final re-run (7 × 359) | ~$18 | ~10 min | -| SI↔N/O investigation (37 + 83 × GPT-5.4) | $0.55 | ~1 min | -| **v3.5 subtotal** | **~$37.57** | | -| **Running total API** | **~$202.57** | | - ---- - -## Lessons Learned - -### On Prompt Engineering -- Calibration examples beat rules. Each example targets a specific observed failure mode. -- Pilots must be large enough (500+). 40-sample pilots were misleadingly optimistic. -- More rules ≠ better. After the core structure is right, additional rules cause regression. -- The `specific_facts` chain-of-thought schema (forcing models to enumerate evidence before deciding) was the single most impactful structural change. -- **Rules over-correct before they converge.** The v3.5 iteration showed a consistent pattern: a new rule fixes the target problem but creates 2-3 new errors on adjacent cases. Each fix required a counter-fix. "Materiality language → SI" fixed the 23:0 asymmetry but created cross-reference false positives and speculation false positives that each required their own exception. Six rounds of test-fix-test were needed to reach equilibrium. -- **Targeted examples backfire.** Adding a specific example to a disambiguation rule ("CISO assists the ERMC in monitoring → RMP") caused regression elsewhere — models generalize from examples in unpredictable ways. General principles ("content matters more than names") are safer than specific examples in disambiguation guidance. -- **Assessment vs. language is a fundamental distinction.** The word "material" appears in thousands of SEC paragraphs but carries different force in different grammatical contexts. "Have not materially affected" (conclusion) vs. "could have a material adverse effect" (speculation) vs. "material risks" (adjective) are three different speech acts. Models don't naturally distinguish these without explicit guidance. -- **Check the humans — they can be systematically wrong.** On SI↔N/O, human annotators systematically over-called SI on any paragraph mentioning "material" + "business strategy," even when the language was pure speculation. The 25:2 asymmetry initially looked like model failure but was actually human failure to apply the assessment-vs-speculation distinction. When all 6 frontier models unanimously disagree with a 2/3 human majority, investigate before assuming the humans are right. The models' consistency (unanimous agreement across architectures and providers) is itself strong evidence. - -### On Model Selection -- Reasoning tokens are the strongest predictor of accuracy, not price or model size. -- Schema compliance varies — fix with Zod transforms, not prompt changes. -- Test both structured output AND tool calling for any candidate. They are not equivalent. - -### On Evaluation -- **Never evaluate against majority vote.** Build gold labels. Majority vote as ground truth makes models that rubber-stamp the majority look good. -- **Judge confidence is highly predictive** of accuracy. Use it to weight training samples. -- **Stage 1 confidence is useless** — cheap models are systematically overconfident (95%+ all-high). - -### On Data Quality at Scale -- The biggest wins come from understanding *where* and *why* models disagree, not from blanket improvements. -- Systematic model biases are quantifiable and predictable. Use them as signal, not noise. -- Codebook ambiguity causes more disagreement than model limitations. Three codebook rulings resolved more disputes than any prompt change. -- Not all labels need the same treatment. Confidence-stratified assembly beats uniform labeling. -- **Freeze originals, patch separately.** The single best data integrity decision was never modifying `paragraphs-clean.jsonl`. All fixes go through `.patched.jsonl` with the same UUIDs. This makes every change auditable, reversible, and safe to apply incrementally. Without this, the 6-patch iteration would have been terrifying. -- **Tag everything you can.** Generator metadata, quality tiers, and anomaly flags cost almost nothing to compute but make targeted remediation possible. Without generator tags, the 36.8% orphan rate in EFiling/XDX would have been invisible — diluted into a 4.7% corpus average. -- **Re-annotation is cheap and validating.** Re-running Stage 1 on 1,537 patched paragraphs cost $3.30 and took 9 minutes. It confirmed that 7.7% of consensus labels were wrong due to the data issue — an empirical validation that the patch was necessary, not just cosmetic. - -### On Training Infrastructure -- **Whole-word masking in `transformers` is broken for BPE tokenizers.** The upstream `DataCollatorForLanguageModeling(whole_word_mask=True)` uses `offset_mapping` to detect word boundaries by checking for gaps in character offsets. This fails silently for BPE tokenizers that absorb leading spaces — all offsets are contiguous, so the entire sequence becomes one "word." Loss appears to train but sits at ~6-8 (near-random). The fix is to use the tokenizer's `word_ids()` method, which correctly identifies word boundaries for any tokenizer type, and implement masking yourself. -- **Python 3.14 is not ready for ML.** Both `dill` (via `datasets`) and PyTorch's multiprocessing (`fork` → `forkserver`) have breaking incompatibilities. Rolling back to 3.13 was the only viable path. -- **Flash Attention is mandatory for long sequences.** Without FA2, ModernBERT at seq_len=8192 ran at ~47s/step on an RTX 3090. With FA2, the same configuration ran at ~25s/step — and enabled further optimizations (batch size increase, torch.compile) that pushed it further. -- **Align hyperparameters with the base model's pre-training config.** ModernBERT was trained with weight_decay=1e-5 and 30% MLM probability. Using the BERT/RoBERTa default of 0.01 weight decay would have been wrong. Both published ModernBERT DAPT papers (BioClinical, Patent) independently validated these values. -- **torch.compile + gradient_checkpointing together is more than the sum of its parts.** On ModernBERT, this combination resolves a memory anomaly specific to FA2 during MLM training (AnswerDotAI/ModernBERT#172), freeing VRAM for larger batch sizes. -- **Precompiled wheels save hours.** Building flash-attn from source requires matching CUDA toolkit versions, which is fragile. Precompiled wheels for the exact {python, torch, CUDA} combination avoid this entirely. -- **torch.compile's value can be memory, not speed.** When the bottleneck is opaque custom CUDA kernels (like FA2), torch.compile can't accelerate them. But it can still fuse the *surrounding* ops, dramatically reducing activation memory. In our case, compile provided 0% speedup but 35% memory reduction — enough to double the batch size. -- **Corpus subsampling is the biggest lever on consumer hardware.** When you're compute-bound, no software optimization can beat "process less data." The scaling laws literature (Ponnock 2025) provides empirical justification for stopping early. -- **At long sequence lengths, the GPU saturates at small batches.** Increasing batch from 2→4 at seq_len=8192 provided no s/step improvement on an RTX 3090 — the matmul dimensions are already large enough to fill all 82 SMs. This is the opposite of short-sequence fine-tuning where batch size scaling is the primary throughput lever. +Key v1 deliverables carried forward: +- 72,045-paragraph corpus with quality tiers +- DAPT checkpoint (eval loss 0.7250, perplexity 1.65) +- TAPT checkpoint (eval loss 1.0754, perplexity 2.11) +- Model census: 21+ models evaluated across 12 providers +- Human labeling webapp (labelapp) — will be updated for v2 codebook +- Empirical evidence for every v2 codebook decision --- diff --git a/docs/STATUS.md b/docs/STATUS.md index af27f5b..0c845b8 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -1,221 +1,195 @@ -# Project Status — 2026-04-02 (evening) +# Project Status — 2026-04-03 (v2 Reboot) -## What's Done +**Deadline:** 2026-04-24 (21 days) + +## What's Done (Carried Forward from v1) ### Data Pipeline -- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings -- [x] 14 filing generators identified, quality metrics per generator -- [x] 6 surgical patches applied (orphan words + heading stripping) +- [x] 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings +- [x] 14 filing generators identified, 6 surgical patches applied - [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%) -- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight) +- [x] 72 truncated filings identified and excluded - [x] All data integrity rules formalized (frozen originals, UUID-linked patches) -### GenAI Labeling (Stage 1) -- [x] Prompt v2.5 locked after 12+ iterations -- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast -- [x] 150,009 annotations completed ($115.88, 0 failures) -- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl` -- [x] Codebook v3.0 with 3 major rulings +### Pre-Training +- [x] DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090 +- [x] TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090 +- [x] Custom `WholeWordMaskCollator` (upstream broken for BPE) +- [x] Checkpoints: `checkpoints/dapt/` and `checkpoints/tapt/` -### DAPT + TAPT Pre-Training -- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped) -- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090. -- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/` -- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08. -- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/` -- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers) -- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility) -- [x] Procedure documented in `docs/DAPT-PROCEDURE.md` +### v1 Labeling (preserved, not used for v2 training) +- [x] 150K Stage 1 annotations (v2.5 prompt, $115.88) +- [x] 10-model benchmark (8 suppliers, $45.63) +- [x] Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546 +- [x] Gold adjudication: 13-signal cross-analysis, 5-tier adjudication +- [x] Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds) +- [x] All v1 data preserved at original paths + `docs/NARRATIVE-v1.md` -### Human Labeling — Complete -- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3) -- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators -- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/` -- [x] Comprehensive IRR analysis → `data/gold/charts/` +### v2 Codebook (this session) +- [x] LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test +- [x] CODEBOOK-ETHOS.md: full reasoning, worked edge cases +- [x] NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started +- [x] STATUS.md: this document -| Metric | Category | Specificity | Both | -|--------|----------|-------------|------| -| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% | -| Krippendorff's α | 0.801 | 0.546 | — | -| Avg Cohen's κ | 0.612 | 0.440 | — | +--- -### Prompt v3.0 -- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP -- [x] Prompt version bumped from v2.5 → v3.0 +## What's Next (v2 Pipeline) -### GenAI Holdout Benchmark — Complete -- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs -- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix) -- [x] Total benchmark cost: $45.63 +### Step 1: Codebook Finalization ← CURRENT +- [x] Draft v2 codebook with systemic changes +- [x] Draft codebook ethos with full reasoning +- [ ] Get group approval on v2 codebook (share both docs) +- [ ] Incorporate any group feedback -| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus | -|-------|----------|------|---------------|----------------| -| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% | -| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% | -| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% | -| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% | -| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% | -| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% | -| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — | +### Step 2: Prompt Iteration (dev set) +- [ ] Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout) +- [ ] Update Stage 1 prompt to match v2 codebook +- [ ] Run 2-3 models on dev set, analyze results +- [ ] Iterate prompt against judge panel until reasonable consensus +- [ ] Update codebook with any rulings needed (should be minimal if rules are clean) +- [ ] Re-approval if codebook changed materially +- **Estimated cost:** ~$5-10 +- **Estimated time:** 1-2 sessions -Plus Stage 1 panel already on file = **10 models, 8 suppliers**. +### Step 3: Stage 1 Re-Run +- [ ] Lock v2 prompt +- [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 models) +- [ ] Distribution check: verify Level 2 grew to ~20%, category distribution healthy +- [ ] If distribution is off → iterate codebook/prompt before proceeding +- **Estimated cost:** ~$120 +- **Estimated time:** ~30 min execution -### 13-Signal Cross-Source Analysis — Complete -- [x] 30 diagnostic charts generated → `data/gold/charts/` -- [x] Leave-one-out analysis (no model privileged as reference) -- [x] Adjudication tier breakdown computed +### Step 4: Holdout Selection +- [ ] Draw stratified holdout from new Stage 1 labels + - ~170 per category class × 7 ≈ 1,190 + - Random within each stratum (NOT difficulty-weighted) + - Secondary constraint: minimum ~100 per specificity level + - Exclude dev set paragraphs +- [ ] Draw separate AI-labeled extension set (up to 20K) if desired +- **Depends on:** Step 3 complete + distribution check passed -**Adjudication tiers (13 signals per paragraph):** +### Step 5: Labelapp Update +- [ ] Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test) +- [ ] Update warmup paragraphs with v2 examples +- [ ] Update codebook sidebar content +- [ ] Load new holdout paragraphs into labelapp +- [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph) +- [ ] Test the full flow (quiz → warmup → labeling) +- **Depends on:** Step 4 complete -| Tier | Count | % | Rule | -|------|-------|---|------| -| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold | -| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated | -| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review | -| 4 | 202 | 16.8% | Universal disagreement → expert review | +### Step 6: Parallel Labeling +- [ ] **Humans:** Tell annotators to start labeling v2 holdout +- [ ] **Models:** Run full benchmark panel on holdout (10+ models, 8+ suppliers) + - Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast) + - Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7) + - Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model) +- **Estimated model cost:** ~$45 +- **Estimated human time:** 2-3 days (600 paragraphs per annotator) +- **Depends on:** Step 5 complete -**Leave-one-out ranking (each source vs majority of other 12):** +### Step 7: Gold Set Assembly +- [ ] Compute human IRR (target: category α > 0.75, specificity α > 0.67) +- [ ] Gold = majority vote (where all 3 disagree, model consensus tiebreaker) +- [ ] Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O) +- **Depends on:** Step 6 complete (both humans and models) -| Rank | Source | Cat % | Spec % | Both % | -|------|--------|-------|--------|--------| -| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | -| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | -| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | -| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | -| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 | -| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 | +### Step 8: Stage 2 (if needed) +- [ ] Bench Stage 2 adjudication accuracy against gold +- [ ] If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs +- [ ] If Stage 2 adds minimal value → document finding, skip production run +- **Estimated cost:** ~$20-40 if run +- **Depends on:** Step 7 complete -**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate). +### Step 9: Training Data Assembly +- [ ] Unanimous Stage 1 labels → full weight +- [ ] Calibrated majority labels → full weight +- [ ] Judge high-confidence (if Stage 2 run) → full weight +- [ ] Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5 +- [ ] Nuke 72 truncated filings +- **Depends on:** Step 8 complete -### Codebook v3.5 & Prompt Iteration — Complete -- [x] Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6) -- [x] v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain -- [x] v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18) -- [x] 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26 -- [x] SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment) -- [x] Cross-reference exception: materiality language in cross-refs = N/O -- [x] BG threshold: one-sentence committee mention doesn't flip to BG -- [x] Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs) -- [x] Prompt locked at v3.5, codebook updated, version history documented -- [x] SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation -- [x] Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O) -- [x] Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments +### Step 10: Fine-Tuning +- [ ] Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss} +- [ ] Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal) +- [ ] Ordinal regression (CORAL) for specificity +- [ ] SCL for boundary separation (optional, if time permits) +- **Estimated time:** 12-20h GPU +- **Depends on:** Step 9 complete -| Data asset | Location | -|-----------|----------| -| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359) | -| v3.5 Opus annotations | `data/annotations/golden-v35/opus.jsonl` (359) | -| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (308) | -| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) | +### Step 11: Evaluation & Paper +- [ ] Macro F1 on holdout (target: > 0.80 for both heads) +- [ ] Per-class F1 breakdown +- [ ] Full GenAI benchmark table (10+ models × holdout) +- [ ] Cost/time/reproducibility comparison +- [ ] Error analysis on hardest cases +- [ ] IGNITE slides (20 slides, 15s each) +- [ ] Python notebooks for replication (assignment requirement) +- **Depends on:** Step 10 complete -### Gold Set Adjudication v1 — Complete -- [x] Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec) -- [x] Old Aaryan labels preserved in `data/gold/human-labels-aaryan-v1.jsonl` -- [x] Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O -- [x] 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92) -- [x] 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions) +--- -### Gold Set Adjudication v2 — Complete (T5 deep analysis) -- [x] Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs -- [x] Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion) -- [x] Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4) -- [x] v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%) -- [x] **Text-based BG vote removal**: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources -- [x] **10 new codebook tiebreaker overrides**: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test -- [x] **Specificity hybrid**: human unanimous → human label, human split → model majority. 195 specificity labels updated -- [x] All changes validated experimentally (one variable at a time, acceptance criteria checked) -- [x] T5: 92 → 85, gold≠human: 151 → 144 +## Timeline Estimate -| Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ | -|--------|----------------------|----------------------|---| -| Xander | 91.0% | 91.5% | +0.5% | -| Opus | 88.6% | 89.1% | +0.5% | -| GPT-5.4 | 87.4% | 88.5% | +1.1% | -| GLM-5 | 86.0% | 86.5% | +0.5% | -| Elisabeth | 85.8% | 86.5% | +0.7% | -| MIMO | 85.8% | 86.2% | +0.5% | -| Meghan | 85.3% | 86.0% | +0.7% | -| Kimi | 84.5% | 84.9% | +0.4% | -| Gemini | 84.0% | 84.6% | +0.6% | -| Joey | 80.7% | 80.2% | -0.5% | -| Aaryan | 75.2% | 74.2% | -1.0% | -| Anuj | 69.3% | 69.7% | +0.3% | +| Step | Days | Cumulative | +|------|------|-----------| +| 1. Codebook approval | 1 | 1 | +| 2. Prompt iteration | 2 | 3 | +| 3. Stage 1 re-run | 0.5 | 3.5 | +| 4. Holdout selection | 0.5 | 4 | +| 5. Labelapp update | 1 | 5 | +| 6. Parallel labeling | 3 | 8 | +| 7. Gold assembly | 1 | 9 | +| 8. Stage 2 (if needed) | 1 | 10 | +| 9. Training data assembly | 0.5 | 10.5 | +| 10. Fine-tuning | 3-5 | 13.5-15.5 | +| 11. Evaluation + paper | 3-5 | 16.5-20.5 | -| Data asset | Location | -|-----------|----------| -| Adjudicated gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) | -| Old Aaryan labels | `data/gold/human-labels-aaryan-v1.jsonl` (600) | -| Adjudication charts | `data/gold/charts/gold-*.png` (4 charts) | -| Adjudication script | `scripts/adjudicate-gold.py` (v2) | -| Experiment harness | `scripts/adjudicate-gold-experiment.py` | -| T5 analysis docs | `docs/T5-ANALYSIS.md` | +**Buffer:** 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly. -## What's Next (in dependency order) +--- -### 1. (Optional) Manual review of remaining 85 T5-plurality paragraphs -- 85 paragraphs resolved by signal plurality — lowest confidence tier -- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity) -- 62 have weak plurality (4-5/9) — diminishing returns -- Could improve gold set by ~1-3% if reviewed, but diminishing returns +## Rubric Checklist (Assignment) -### 2. Stage 2 re-eval on training data -- Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample -- Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs) -- Also run standard Stage 2 judge on existing disagreements with v3.5 prompt +### C (f1 > .80): the goal +- [ ] Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening +- [x] Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout) +- [x] Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2) +- [x] Documentation — extensive +- [ ] Python notebooks for replication -### 3. Training data assembly -- Unanimous Stage 1 labels (35,204 paragraphs) → full weight -- Calibrated majority labels (~9-12K) → full weight -- Judge high-confidence labels (~2-3K) → full weight -- Quality tier weights: clean/headed/minor=1.0, degraded=0.5 +### B (3+ of 4): already have all 4 +- [x] Cost, time, reproducibility — dollar amounts for every API call +- [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2) +- [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR +- [x] Compelling use case — SEC cyber disclosure quality assessment -### 4. Fine-tuning + ablations -- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting} -- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal) -- Focal loss / class-weighted CE for category imbalance -- Ordinal regression (CORAL) for specificity +### A (3+ of 4): have 3, working on 4th +- [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination +- [x] Mitigation strategy — v1→v2 codebook evolution, experimental validation +- [ ] Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline) +- [x] Comparison to amateur labels — annotator before/after, human vs model agreement analysis -### 5. Evaluation + paper -- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category) -- Full GenAI benchmark table (10 models × 1,200 holdout) -- Cost/time/reproducibility comparison -- Error analysis on Tier 4 paragraphs (A-grade criterion) -- IGNITE slides (20 slides, 15s each) - -## Parallel Tracks - -``` -Track A (GPU): DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval - ↑ -Track B (API): Opus re-run ✓─┐ │ - ├→ v3.5 re-run ✓ → SI paradox ✓ ───┐ │ -Track C (API): 6-model bench ✓┘ │ │ - Gold adjud. ✓ ┤ │ -Track E (API): v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘ - -Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓ -``` +--- ## Key File Locations | What | Where | |------|-------| +| v2 codebook | `docs/LABELING-CODEBOOK.md` | +| v2 codebook ethos | `docs/CODEBOOK-ETHOS.md` | +| v2 narrative | `docs/NARRATIVE.md` | +| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` | +| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` | +| Strategy notes | `docs/STRATEGY-NOTES.md` | +| Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` (72,045) | | Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) | -| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) | -| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) | -| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) | -| Human label metrics | `data/gold/metrics.json` | -| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) | -| Diagnostic charts | `data/gold/charts/*.png` (30 charts) | -| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) | -| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) | -| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) | -| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) | +| v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) | +| v1 gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) | +| v1 human labels | `data/gold/human-labels-raw.jsonl` (3,600) | +| v1 benchmark annotations | `data/annotations/bench-holdout/*.jsonl` | | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` | | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` | -| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 × 359) | -| v3.5 Opus golden | `data/annotations/golden-v35/opus.jsonl` (359) | -| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (1,014) | -| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) | -| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) | -| Data dump script | `labelapp/scripts/dump-all.ts` | +| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` | +| Stage 1 prompt | `ts/src/label/prompts.ts` | +| Annotation runner | `ts/src/label/annotate.ts` | +| Labelapp | `labelapp/` |