new codebook and ethos

This commit is contained in:
Joey Eamigh 2026-04-04 15:01:20 -04:00
parent d653ed9a20
commit 1f2d748a1d
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
7 changed files with 3038 additions and 2025 deletions

View File

@ -12,7 +12,11 @@ Bun workspace monorepo. Three packages:
|------|-------| |------|-------|
| Shared schemas (Zod) | `packages/schemas/src/` | | Shared schemas (Zod) | `packages/schemas/src/` |
| Labeling codebook (source of truth for all category/specificity definitions) | `docs/LABELING-CODEBOOK.md` | | Labeling codebook (source of truth for all category/specificity definitions) | `docs/LABELING-CODEBOOK.md` |
| Codebook ethos (reasoning behind every codebook decision) | `docs/CODEBOOK-ETHOS.md` |
| Project narrative (decisions, roadblocks, lessons) | `docs/NARRATIVE.md` | | Project narrative (decisions, roadblocks, lessons) | `docs/NARRATIVE.md` |
| Project status & todo list | `docs/STATUS.md` |
| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` |
| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` |
| Implementation plan for labelapp | `docs/labelapp-plan.md` | | Implementation plan for labelapp | `docs/labelapp-plan.md` |
| Labelapp-specific agent guide | `labelapp/AGENTS.md` | | Labelapp-specific agent guide | `labelapp/AGENTS.md` |
| Docker compose (Postgres) | `docker-compose.yaml` (root) | | Docker compose (Postgres) | `docker-compose.yaml` (root) |

319
docs/CODEBOOK-ETHOS.md Normal file
View File

@ -0,0 +1,319 @@
# Codebook Ethos — Design Reasoning & Edge Case Analysis
This document explains the reasoning behind every decision in the labeling codebook. It is the training companion for human annotators and the design record for the project. If you read this document and disagree with anything in the codebook, flag it — we want to resolve disagreements here, not at labeling time.
---
## Why This Document Exists
The codebook (LABELING-CODEBOOK.md) tells you WHAT to do. This document tells you WHY. The distinction matters because:
1. **Models need clean instructions.** The codebook is designed to go directly into an LLM system prompt. Extra explanation creates context pollution and can cause models to overfit to edge case reasoning rather than applying general rules.
2. **Humans need understanding.** A human annotator who understands the reasoning behind a rule will correctly handle novel edge cases that the rule doesn't explicitly cover. A human who only knows the rule will freeze on ambiguity or make inconsistent judgment calls.
3. **Decisions need documentation.** Every bright line in the codebook represents a deliberate choice. Documenting the reasoning makes those choices auditable, revisable, and defensible in the final paper.
---
## Why v2? What Changed from v1
The v1 codebook (preserved at `docs/LABELING-CODEBOOK-v1.md`) was built over 12+ prompt iterations and served through 150K Stage 1 annotations, a 6-person human labeling round, and a 10-model benchmark. It worked — but it had structural problems that became visible only at evaluation time:
### Problem 1: Specificity Level 2 was too narrow
The professor's construct defines Level 2 as "Sector-adapted — references industry but no firm-specific details." Our v1 codebook interpreted this as "names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.)." That interpretation was too literal. Things like penetration testing, vulnerability scanning, SIEM, phishing simulations — these are all cybersecurity industry practices that a security professional instantly recognizes as domain-specific. Our codebook classified them as Level 1 (generic boilerplate), which squeezed Level 2 down to 3.9% of the holdout (47 samples).
At 47 samples, ±3 correct/incorrect swings F1 by ~0.06. The measurement is too noisy for reliable per-class evaluation.
**v2 fix:** Level 2 is now "Domain-Adapted" — uses cybersecurity domain terminology recognizable to a security professional, not just named standards. The projected distribution shifts from ~44/4/37/14 to ~25/20/37/18. Every class has real mass.
### Problem 2: Level 4 required 2+ QV facts (counting problem)
The professor's construct says: "(4) Quantified and verifiable — includes specific metrics, dollar amounts, incident timelines, or third-party audit references." That's a list of qualifying facts, not a "count two" rule. Our v1 codebook added the 2-fact threshold, which created a narrow Level 4 (14.1%) and forced annotators into a counting exercise that was error-prone and contentious.
**v2 fix:** 1+ QV-eligible fact → Level 4. No counting. The bright line is: "Can an external party independently verify this claim?" One verifiable dollar amount, one named third-party firm, one specific date — any of these is already more informative than a paragraph without them.
### Problem 3: The BG/MR/RMP triangle was patched, not fixed
v1 accumulated six decision rules and ten borderline cases — many were patches for systemic ambiguity rather than clean rules. The v3.0 person-vs-function test and v3.5 three-step decision chain were good ideas, but they were bolted on as rulings to an unchanged set of definitions. Models had to process increasingly complex instructions with diminishing returns.
**v2 fix:** The "What question does this paragraph answer?" test replaces the patchwork. MR's headline is now "How is management organized to handle cybersecurity?" — broader than "who a specific person is" (which missed paragraphs about management structure without named individuals) and clearer than a multi-step mechanical test. The person-removal test survives as a confirmation tool, not the primary rule.
### Problem 4: The holdout was adversarial by design
v1's holdout was stratified to OVER-SAMPLE confusion-axis paragraphs. This was great for codebook development (stress-testing rules on hard cases) but terrible for evaluation (inflating error rates and depressing F1). Combined with the narrow Level 2, this created a structurally unfavorable evaluation set.
**v2 fix:** Random stratified sample — equal per category class, random within each stratum. Hard cases are represented at their natural frequency, not overweighted.
---
## Category Reasoning
### Why "What Question Does This Paragraph Answer?"
Previous approaches tried to classify based on surface features: grammatical subjects, keyword presence, mechanical tests. These worked for clear-cut cases but failed on the governance chain (Board → Committee → Officer → Program) that appears in thousands of SEC filings.
The "what question?" test works because it asks about communicative PURPOSE, not surface features. A paragraph that chains "The Audit Committee oversees... our CISO reports quarterly... the program includes penetration testing" has keywords from all three of BG, MR, and RMP. The question test cuts through: what is this paragraph TRYING TO TELL YOU? It's trying to tell you how oversight works. → BG.
This is also the test that humans naturally apply. When you read a paragraph and "just know" it's about governance vs. process, you're implicitly asking what the paragraph's purpose is. The codebook now makes that implicit test explicit.
### The Board Governance / Management Role Boundary
**The core issue:** SEC Item 106(c) has two parts — (c)(1) covers board oversight and (c)(2) covers management's role. Many filings interleave them in a single paragraph.
**The rule:** Governance-chain paragraphs default to BG. They become MR only when management's organizational role is the primary content.
**Why this default?** Because the governance chain exists TO DESCRIBE OVERSIGHT. When a paragraph says "The Audit Committee oversees our cybersecurity program. Our CISO reports quarterly to the Committee on threat landscape and program effectiveness," the paragraph is explaining how oversight works. The CISO is the mechanism through which the board gets information — the paragraph is about the board's oversight structure, not about the CISO as a person or management's organizational role.
MR captures something different: it answers "how is management organized?" This includes:
- Who holds cybersecurity responsibilities and how those responsibilities are divided
- What qualifies those people (credentials, experience, background)
- How management-level structures work (steering committees, reporting lines between officers)
- The identity and background of specific individuals
A paragraph about the CISO's 20 years of experience, CISSP certification, and team of 12 → MR. A paragraph about the board receiving quarterly reports from the CISO → BG. Same person mentioned, different purpose.
**The directionality heuristic (confirmation tool, not primary rule):**
- Board → Management (describing governance structure flowing down) → BG
- Management → Board (describing reporting relationship flowing up) → usually BG (the board is still the focus as the recipient)
- Management → Management (how roles are divided, who reports to whom in management) → MR
- Either mentioned, but most content is about actual processes → RMP
### The Management Role / Risk Management Process Boundary
**The core issue:** This was the #1 disagreement axis in v1 (2,290 disputes). The pattern is always the same: a paragraph names a CISO/CIO/CTO in the opening clause, then describes what the cybersecurity program does. Is it about the person or the program?
**The person-removal test:** Remove all person-specific content. If a substantive description remains → RMP. This works because:
- If the paragraph is ABOUT the program, removing the person who oversees it leaves the program description intact
- If the paragraph is ABOUT the person, removing their details leaves nothing meaningful
**Why this test and not a noun count or keyword list:** We tried mechanical approaches in v1 (step-by-step decision chains, grammatical subject tests). They worked for easy cases but made hard cases harder — annotators had to run through a mental flowchart instead of reading the paragraph naturally. The person-removal test is a single thought experiment that maps to what humans already do intuitively.
**The remaining hard case — management committee with process details:**
> "Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs."
Person-removal test: remove committee membership → "monthly to review cybersecurity risks, assess emerging threats, and oversee vulnerability management and incident response programs." Still has content, but it's thin — the committee structure IS the primary content. → MR.
If the paragraph instead spent three more sentences describing how the vulnerability management program works → RMP (process becomes dominant). The test scales with paragraph length naturally.
### The Strategy Integration / None/Other Boundary
**The core issue:** v1 had 1,094 disputes on this axis, almost all from materiality disclaimers. The sentence "risks have not materially affected our business strategy, results of operations, or financial condition" appears in thousands of filings. Is it SI (a materiality assessment) or N/O (boilerplate)?
**The rule:** It's SI. Even though the language is generic, the company IS fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. Category captures WHAT the paragraph discloses (a materiality assessment). Specificity captures HOW specific it is (generic boilerplate = Level 1). These are independent dimensions.
**The "could" vs. "have not" distinction:** This is a linguistic bright line, not a judgment call.
- "Have not materially affected" → past tense, definitive statement → assessment → SI
- "Are reasonably likely to materially affect" → SEC's required forward-looking language → assessment → SI
- "Could have a material adverse effect" → conditional, hypothetical → speculation → N/O (or classify by other content)
The keyword is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold. "Could" is the generic risk-factor language that appears in every 10-K regardless of actual risk level.
**Cross-references with materiality language:** "For risks that may materially affect us, see Item 1A" is N/O. The paragraph's purpose is pointing elsewhere. The word "materially" describes what Item 1A discusses, not the company's own conclusion. But: "Risks have not materially affected us. See Item 1A" is SI — the first sentence IS an assessment, and the cross-reference is subordinate.
---
## Specificity Reasoning
### Why Broaden Level 2: The ERM Test
The v1 definition of Level 2 ("names a specific recognized standard") was too narrow because it conflated "domain-specific" with "names a formal standard." A paragraph that says "we conduct penetration testing and vulnerability assessments" is clearly more informative than "we have processes to manage cybersecurity risks" — the first uses domain vocabulary, the second uses generic business language. But v1 classified both as Level 1.
The v2 test: **"Would this term appear naturally in a generic enterprise risk management document?"** This captures the construct's intent — "references industry" means using the industry's vocabulary, not just citing its standards.
**Why "incident response plan" stays at Level 1:** IRP is used across all risk management domains — cybersecurity, physical security, natural disasters, supply chain disruptions. A non-security ERM professional would use this term naturally. By contrast, "penetration testing" is uniquely cybersecurity — you don't penetration-test a supply chain or a natural disaster response.
**Why "security awareness training" is Level 2:** This is borderline. A businessperson might say "we train employees on security." But the specific phrase "security awareness training" is a recognized cybersecurity program type. The term itself references a domain-specific practice, even though it's become common. A non-security person would say "we train our employees" (Level 1), not "we provide security awareness training" (Level 2). The difference IS the domain vocabulary.
**Why "tabletop exercises" stays at Level 1:** Tabletop exercises are used in emergency management, business continuity, and general risk management — not just cybersecurity. "Cybersecurity tabletop exercises simulating ransomware scenarios" → Level 2 (the qualifier makes it domain-specific). But bare "tabletop exercises" could refer to any risk domain.
### Why 1+ QV Fact: The External Verifiability Test
The v1 rule was 2+ QV facts. This created problems:
1. **Counting is error-prone.** Annotators and models disagree on what counts. Is "CISO" a QV fact? Is "quarterly" a fact? The counting itself became a source of disagreement.
2. **The construct doesn't require counting.** The professor's Level 4 definition lists types of qualifying facts, not a minimum count.
3. **One verifiable fact IS quantified and verifiable.** A paragraph that says "We maintain $100M in cyber insurance coverage" is genuinely more informative and verifiable than one without dollar amounts. The 2-fact threshold was artificial.
The v2 test asks: **Can an external party independently verify at least one claim in this paragraph?** One specific number, one named third-party firm, one named certification held by an individual — any of these crosses the threshold.
**Why named roles (CISO) are NOT QV:** A role title tells you something about the company's structure (firm-specific, Level 3) but is not a quantified claim an outsider can verify. "Our CISO" is identification. "Our CISO holds CISSP certification" adds a verifiable claim (CISSP holders are in a public registry). The role gets you to Level 3; the certification pushes to Level 4.
**Why named individuals alone are NOT QV:** "Our CISO, Jane Smith" is firm-specific (Level 3). You could look her up, but the NAME itself isn't a quantified claim about cybersecurity posture. "Jane Smith, who has 20 years of cybersecurity experience" adds a verifiable quantity. The name identifies; the experience quantifies.
**The certification trilogy — a critical distinction:**
1. "Our program is aligned with ISO 27001" → **Level 2** (references a standard, no firm-specific claim)
2. "We are working toward ISO 27001 certification" → **Level 3** (firm-specific intent, but no verifiable achievement)
3. "We maintain ISO 27001 certification" → **Level 4** (verifiable claim — you can check if a company holds this certification)
The difference between "aligned with" and "maintain certification" is the difference between aspiration and audited fact.
---
## Worked Edge Cases
### Case 1: The Governance Chain
> "The Board of Directors, through its Audit Committee, oversees the Company's cybersecurity risk management program. The Audit Committee receives regular updates from the CISO on the results of penetration testing and vulnerability assessments."
**"What question?" test:** "How does the board oversee cybersecurity?" → **BG**
**Specificity:** "penetration testing," "vulnerability assessments" = domain terminology → **Level 2**
**Why not RMP?** The process details (pen testing, vuln assessments) are subordinate to the reporting structure. The paragraph exists to tell you that the Audit Committee oversees things and receives reports — the program details are examples of WHAT is reported.
### Case 2: CISO Attribution + Program Description
> "Our CISO oversees our cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework."
**Person-removal test:** "cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework" → complete program description → **RMP**
**Specificity:** Domain terms (pen testing, vuln scanning) + named standard (NIST CSF) → **Level 2**
**Why not MR?** The paragraph tells you nothing about the CISO as a person — no qualifications, no experience, no reporting line, no team. The CISO is an attribution tag, like a byline on a news article. The content is the program.
### Case 3: CISO Qualifications
> "Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer. She leads a team of 12 dedicated cybersecurity professionals."
**"What question?" test:** "How is management organized / who is this person?" → **MR**
**Specificity:** CISSP/CISM (named certifications, QV), 20 years (specific number, QV), 12 professionals (headcount, QV) — any one of these → **Level 4**
**Why not RMP?** Every sentence is about the person: their title, credentials, experience, reporting line, team. Remove the person-specific content and nothing remains.
### Case 4: CFO/VP Role Allocation (No Named Individuals)
> "Our CFO and VP of IT jointly oversee our cybersecurity program. The CFO is responsible for risk governance and insurance, while the VP of IT manages technical operations. They report to the board quarterly on cybersecurity matters."
**"What question?" test:** "How is management organized?" → **MR**
**Person-removal test:** Remove all role content → "report to the board quarterly on cybersecurity matters" → barely anything → **MR confirmed**
**Specificity:** VP of IT = cybersecurity-specific title → **Level 3** (firm-specific)
**Why this is MR without named individuals:** MR isn't "who a specific person is" — it's "how management is organized." This paragraph describes role allocation and reporting structure. The roles are named, the responsibilities are divided, the governance chain is defined. This is organizational disclosure.
### Case 5: Management Committee with Process Details
> "Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs."
**"What question?" test:** "How is management organized?" → **MR**
**Person-removal test:** Remove committee membership → thin but the activities remain → borderline
**Tiebreak:** The paragraph's FRAME is the committee — it introduces the committee and describes what it does. The activities listed (review, assess, oversee) are verbs of management oversight, not operational descriptions of HOW those programs work. → **MR, Specificity 3** (named committee + composition = firm-specific)
**When this flips to RMP:** If the paragraph spent most of its length describing how the vulnerability management program works (tools, methodology, frequency, findings), with the committee mentioned only as context → RMP.
### Case 6: Materiality Assessment (Backward-Looking)
> "Risks from cybersecurity threats have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."
**Materiality test:** Company stating a conclusion → **SI**
**Specificity:** Boilerplate language (every company says this) → **Level 1**
**Why this is SI and not N/O:** The company is fulfilling its SEC obligation to assess materiality. The fact that the language is generic makes it low-specificity, but the CATEGORY is about what the paragraph discloses (a materiality assessment), not how specific it is.
### Case 7: Materiality Speculation
> "Cybersecurity risks could have a material adverse effect on our business, financial condition, and results of operations."
**Materiality test:** "Could" = speculation, not a conclusion → **N/O**
**Specificity:** N/O always gets **Level 1**
**Why this is N/O and not SI:** This is generic risk-factor language that appears in virtually every 10-K, regardless of whether the company has ever experienced a cybersecurity incident. The company is not stating a conclusion about its cybersecurity posture — it's acknowledging that cybersecurity risks exist. This carries zero informational content about THIS company's cybersecurity situation.
### Case 8: Forward-Looking Assessment (SEC Qualifier)
> "We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."
**Materiality test:** "Reasonably likely to materially affect" = SEC's Item 106(b)(2) threshold → **SI**
**Specificity:** Boilerplate → **Level 1**
**Why "reasonably likely" is different from "could":** "Reasonably likely" is the SEC's required assessment language. A company using this phrase is making a forward-looking materiality assessment, not idly speculating. It's still boilerplate (Spec 1), but it IS an assessment (SI).
### Case 9: Cross-Reference with vs. without Assessment
> **N/O:** "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A, 'Risk Factors.'"
> → The paragraph points elsewhere. "May materially affect" describes what Item 1A discusses. → **N/O, Level 1**
> **SI:** "We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors."
> → The first sentence IS an assessment. The cross-reference is subordinate. → **SI, Level 1**
The test: does the paragraph MAKE a materiality conclusion, or only REFERENCE one that exists elsewhere?
### Case 10: SPAC / No-Operations Company
> "We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any."
**→ N/O, Level 1.** The board mention is perfunctory ("generally responsible... if any"). The company explicitly has no program. The absence of a program is not a disclosure of a program, and an incidental governance mention in the context of "we have nothing" does not constitute substantive board governance disclosure.
### Case 11: Named Tool as QV Fact
> "We utilize CrowdStrike Falcon for endpoint detection and response across our enterprise."
**Category:** "What does the program do?" → **RMP**
**Specificity:** CrowdStrike Falcon = named product = QV-eligible fact → **Level 4**
**Why this is Level 4:** A company naming its specific EDR tool is genuinely more transparent and verifiable than "we use endpoint detection tools." You could confirm this claim. This is exactly what the construct means by "quantified and verifiable."
### Case 12: Single Named Tool (v1 was Level 3, v2 is Level 4)
Under v1's 2-fact rule, a paragraph with only one named product was Level 3. Under v2's 1-fact rule, it's Level 4. This is intentional — the 2-fact threshold was artificial. One verifiable external reference IS "quantified and verifiable."
### Case 13: Insurance with Dollar Amount
> "We maintain cybersecurity insurance coverage with $100 million in aggregate coverage and a $5 million deductible per incident."
**"What question?" test:** "How does cybersecurity affect the business?" → **SI** (insurance is a financial/business-impact response)
**Specificity:** $100M and $5M = dollar amounts (QV) → **Level 4**
### Case 14: Regulatory Compliance — Three Variants
> **N/O:** "The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy."
> → A truism. No disclosure of what the company DOES. → **N/O, Level 1**
> **RMP, Level 2:** "We maintain compliance with PCI DSS, HIPAA, and GDPR through regular audits and monitoring of our security controls."
> → Names specific standards + describes compliance activities → **RMP, Level 2**
> **RMP, Level 4:** "We passed our PCI DSS Level 1 audit in March 2024, conducted by Trustwave."
> → Names standard + specific date + named third-party auditor → **RMP, Level 4**
### Case 15: "Under the Direction of" Attribution
> "Under the direction of our CISO, the Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring."
**Person-removal test:** "The Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring." → Complete program description → **RMP, Level 2**
### Case 16: ERM Integration
> "Our cybersecurity risk management program is integrated into our overall enterprise risk management framework."
**Category:** This describes a program characteristic → **RMP**
**Specificity:** "Enterprise risk management" and "integrated" are generic business language → **Level 1**
**Why not Level 2:** "Enterprise risk management" is a general business concept, not cybersecurity domain terminology. The ERM test: would this sentence appear in a generic ERM document? Yes, it could describe integrating ANY risk program into ERM. → Level 1.
### Case 17: "Dedicated Cybersecurity Team"
> "We have a dedicated cybersecurity team that is responsible for managing our cybersecurity risks."
**Category:** RMP (what the team does — manages cyber risks)
**Specificity:** "Dedicated cybersecurity team" = domain-adapted organizational approach → **Level 2**
**Why Level 2 and not Level 3:** Many companies claim "dedicated" teams. The term describes a general organizational approach (having people dedicated to cybersecurity), not a fact unique to THIS company. Compare: "a dedicated team of 12 cybersecurity professionals" → Level 4 (the headcount is QV). The word "dedicated" itself doesn't differentiate.
### Case 18: Multiple Category Paragraph — Incident + Cost
> "On January 15, 2024, we detected unauthorized access to our customer support portal. We estimate the total cost of remediation at approximately $8.5 million."
**Both ID and SI content.** Which dominates? The incident (what happened) is the frame; the cost is a detail within the incident narrative. → **ID, Level 4** (January 15, 2024 + $8.5M = QV facts)
If the paragraph were primarily financial analysis with one sentence mentioning an incident → SI.
### Case 19: Negative Incident Assertion
> "We have not experienced any material cybersecurity incidents during the reporting period."
**Materiality test:** Negative assertion with materiality framing → **SI, Level 1**
**Why SI and not N/O:** The company is STATING A CONCLUSION about the absence of material incidents. This is a materiality assessment even though it's negative.
**Why not ID:** No incident is described. The paragraph assesses business impact (no material incidents), not incident details.
---
## What We Preserved from v1
Not everything changed. The following were validated through 150K annotations, 10-model benchmarks, and human labeling:
1. **7 content categories** mapped to SEC rule structure — the construct is sound
2. **4 specificity levels** as an ordinal scale — the graduated concept works
3. **IS/NOT list pattern** — the single most effective prompt engineering technique from v1. Lists beat rules for specificity.
4. **Validation step** — "review your facts, remove NOT-list items" catches model self-correction
5. **Materiality assessment vs. speculation** — linguistic bright line, well-calibrated in v3.5
6. **SPAC/no-operations rule** — resolved cleanly
7. **TP vs RMP distinction** — "who is being assessed?" test works
8. **ID for actual incidents only** — hypothetical language doesn't trigger ID
These are proven components. v2 changes the boundaries and definitions around them, not the components themselves.

View File

@ -0,0 +1,871 @@
# Labeling Codebook — SEC Cybersecurity Disclosure Quality
This codebook is the authoritative reference for all human and GenAI labeling. Every annotator (human or model) must follow these definitions exactly. The LLM system prompt is generated directly from this document.
---
## Classification Design
**Unit of analysis:** One paragraph from an SEC filing (Item 1C of 10-K, or Item 1.05/8.01/7.01 of 8-K).
**Classification type:** Multi-class (single-label), NOT multi-label. Each paragraph receives exactly one content category.
**Each paragraph receives two labels:**
1. **Content Category** — single-label, one of 7 mutually exclusive classes
2. **Specificity Level** — ordinal integer 1-4
**None/Other policy:** Required. Since this is multi-class (not multi-label), we need a catch-all for paragraphs that don't fit the 6 substantive categories. A paragraph receives None/Other when it contains no cybersecurity-specific disclosure content (e.g., forward-looking statement disclaimers, section headers, general business language).
---
## Dimension 1: Content Category
Each paragraph is assigned exactly **one** content category. If a paragraph spans multiple categories, assign the **dominant** category — the one that best describes the paragraph's primary communicative purpose.
### Board Governance
- **SEC basis:** Item 106(c)(1)
- **Covers:** Board or committee oversight of cybersecurity risks, briefing frequency, board member cybersecurity expertise
- **Key markers:** "Audit Committee," "Board of Directors oversees," "quarterly briefings," "board-level expertise," "board committee"
- **Assign when:** The grammatical subject performing the primary action is the board or a board committee
**Example texts:**
> *"The Board of Directors oversees the Company's management of cybersecurity risks. The Board has delegated oversight of cybersecurity and data privacy matters to the Audit Committee."*
> → Board Governance, Specificity 3 (names Audit Committee — firm-specific delegation)
> *"Our Board of Directors recognizes the critical importance of maintaining the trust and confidence of our customers and stakeholders, and cybersecurity risk is an area of increasing focus for our Board."*
> → Board Governance, Specificity 1 (could apply to any company — generic statement of intent)
> *"The Audit Committee, which includes two members with significant technology and cybersecurity expertise, receives quarterly reports from the CISO and conducts an annual deep-dive review of the Company's cybersecurity program, threat landscape, and incident response readiness."*
> → Board Governance, Specificity 3 (names specific committee, describes specific briefing cadence and scope)
### Management Role
- **SEC basis:** Item 106(c)(2)
- **Covers:** The specific *person* filling a cybersecurity leadership position: their name, qualifications, career history, credentials, tenure, reporting lines, management committees responsible for cybersecurity
- **Key markers:** "Chief Information Security Officer," "reports to," "years of experience," "management committee," "CISSP," "CISM," named individuals, career background
- **Assign when:** The paragraph tells you something about *who the person is* — their background, credentials, experience, or reporting structure. A paragraph that names a CISO/CIO/CTO and then describes what the cybersecurity *program* does is NOT Management Role — it is Risk Management Process with an incidental role attribution. The test is whether the paragraph is about the **person** or about the **function**.
**The person-vs-function test:** If you removed the role holder's name, title, qualifications, and background from the paragraph and the remaining content still describes substantive cybersecurity activities, processes, or oversight → the paragraph is about the function (Risk Management Process), not the person (Management Role). Management Role requires the person's identity or credentials to be the primary content, not just a brief attribution of who runs the program.
**Example texts:**
> *"Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer and is responsible for leading our cybersecurity program."*
> → Management Role, Specificity 3 — The paragraph is about the person: their credentials, experience, and reporting line. (named role, certifications, reporting line — all firm-specific)
> *"Management is responsible for assessing and managing cybersecurity risks within the organization."*
> → Management Role, Specificity 1 (generic, no named roles or structure)
> *"Our CISO, Sarah Chen, leads a dedicated cybersecurity team of 35 professionals and presents monthly threat briefings to the executive leadership team. Ms. Chen joined the Company in 2019 after serving as Deputy CISO at a Fortune 100 financial services firm."*
> → Management Role, Specificity 4 — The paragraph is about the person: their name, team size, background, prior role. (named individual, team size, specific frequency, prior employer — multiple verifiable facts)
> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, penetration testing, and incident response planning aligned with the NIST CSF framework."*
> → **Risk Management Process**, NOT Management Role — The CISO is mentioned once as attribution, but the paragraph is about what the program does. Remove "Our CISO oversees" and the paragraph still makes complete sense as a process description.
### Risk Management Process
- **SEC basis:** Item 106(b)
- **Covers:** Risk assessment methodology, framework adoption (NIST, ISO, etc.), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration
- **Key markers:** "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management," "tabletop exercises," "incident response plan," "SOC," "SIEM"
- **Assign when:** The paragraph primarily describes the company's internal cybersecurity processes, tools, or methodologies
**Example texts:**
> *"We maintain a cybersecurity risk management program that is integrated into our overall enterprise risk management framework. Our program is designed to identify, assess, and manage material cybersecurity risks to our business."*
> → Risk Management Process, Specificity 1 (generic, could apply to any company)
> *"Our cybersecurity program is aligned with the NIST Cybersecurity Framework and incorporates elements of ISO 27001. We conduct regular risk assessments, vulnerability scanning, and penetration testing as part of our continuous monitoring approach."*
> → Risk Management Process, Specificity 2 (names frameworks but no firm-specific detail)
> *"We operate a 24/7 Security Operations Center that uses Splunk SIEM and CrowdStrike Falcon endpoint detection. Our incident response team conducts quarterly tabletop exercises simulating ransomware, supply chain compromise, and insider threat scenarios."*
> → Risk Management Process, Specificity 4 (named tools, named vendor, specific exercise frequency and scenarios — verifiable)
### Third-Party Risk
- **SEC basis:** Item 106(b)
- **Covers:** Vendor/supplier risk oversight, external assessor engagement, contractual security requirements, supply chain risk management
- **Key markers:** "third-party," "service providers," "vendor risk," "external auditors," "supply chain," "SOC 2 report," "contractual requirements"
- **Assign when:** The central topic is oversight of external parties' cybersecurity, not the company's own internal processes
**Example texts:**
> *"We face cybersecurity risks associated with our use of third-party service providers who may have access to our systems and data."*
> → Third-Party Risk, Specificity 1 (generic risk statement)
> *"Our vendor risk management program requires all third-party service providers with access to sensitive data to meet minimum security standards, including SOC 2 Type II certification or equivalent third-party attestation."*
> → Third-Party Risk, Specificity 2 (names SOC 2 standard but no firm-specific detail about which vendors or how many)
> *"We assessed 312 vendors in fiscal 2024 through our Third-Party Risk Management program. All Tier 1 vendors (those with access to customer PII or financial data) are required to provide annual SOC 2 Type II reports. In fiscal 2024, 14 vendors were placed on remediation plans and 3 vendor relationships were terminated for non-compliance."*
> → Third-Party Risk, Specificity 4 (specific numbers, specific actions, specific criteria — all verifiable)
### Incident Disclosure
- **SEC basis:** 8-K Item 1.05 (and 8.01/7.01 post-May 2024)
- **Covers:** Description of cybersecurity incidents — nature, scope, timing, impact assessment, remediation actions, ongoing investigation
- **Key markers:** "unauthorized access," "detected," "incident," "remediation," "impacted," "forensic investigation," "breach," "compromised"
- **Assign when:** The paragraph primarily describes what happened in a cybersecurity incident
**Example texts:**
> *"We have experienced, and may in the future experience, cybersecurity incidents that could have a material adverse effect on our business, results of operations, and financial condition."*
> → Incident Disclosure, Specificity 1 (hypothetical — no actual incident described. Note: if this appears in Item 1C rather than an 8-K, consider None/Other instead since it's generic risk language)
> *"On January 15, 2024, we detected unauthorized access to our customer support portal. The threat actor exploited a known vulnerability in a third-party software component. Upon detection, we activated our incident response plan, contained the intrusion, and engaged Mandiant for forensic investigation."*
> → Incident Disclosure, Specificity 4 (specific date, specific system, named forensic firm, specific attack vector — all verifiable)
> *"In December 2023, the Company experienced a cybersecurity incident involving unauthorized access to certain internal systems. The Company promptly took steps to contain and remediate the incident, including engaging third-party cybersecurity experts."*
> → Incident Disclosure, Specificity 3 (specific month, specific action — but no named firms or quantified impact)
### Strategy Integration
- **SEC basis:** Item 106(b)(2)
- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents
- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations"
- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves
- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment.
**Example texts:**
> *"Cybersecurity risks, including those described above, have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."*
> → Strategy Integration, Specificity 1 (boilerplate materiality statement — nearly identical language appears across thousands of filings, but it IS a materiality assessment)
> *"We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors."*
> → Strategy Integration, Specificity 1 — The materiality assessment is the substantive content. The cross-reference is noise and does not pull the paragraph to None/Other.
> *"We maintain cybersecurity insurance coverage as part of our overall risk management strategy to help mitigate potential financial losses from cybersecurity incidents."*
> → Strategy Integration, Specificity 2 (mentions insurance but no specifics)
> *"We increased our cybersecurity budget by 32% to $45M in fiscal 2024, representing 0.8% of revenue. We maintain cyber liability insurance with $100M in aggregate coverage through AIG and Chubb, with a $5M deductible per incident."*
> → Strategy Integration, Specificity 4 (dollar amounts, percentages, named insurers, specific deductible — all verifiable)
### None/Other
- **Covers:** Forward-looking statement disclaimers, section headers, cross-references to other filing sections, general business language that mentions cybersecurity incidentally, text erroneously extracted from outside Item 1C/1.05
- **No specificity scoring needed:** Always assign Specificity 1 for None/Other paragraphs (since there is no cybersecurity disclosure to rate)
- **SPACs and shell companies:** Companies that explicitly state they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. Paragraphs like "We have not adopted any cybersecurity risk management program. Our board is generally responsible for oversight" are None/Other — the board mention is perfunctory, not substantive governance disclosure.
- **Distinguishing from Strategy Integration:** A pure cross-reference ("See Item 1A, Risk Factors") with no materiality assessment is None/Other. But if the paragraph includes an explicit materiality conclusion ("have not materially affected our business strategy"), it becomes Strategy Integration even if a cross-reference is also present. The test: does the paragraph make a substantive claim about cybersecurity's impact on the business? If yes → Strategy Integration. If it only points elsewhere → None/Other.
**Example texts:**
> *"This Annual Report on Form 10-K contains forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended."*
> → None/Other, Specificity 1
> *"Item 1C. Cybersecurity"*
> → None/Other, Specificity 1 (section header only)
> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"*
> → None/Other, Specificity 1 (cross-reference, no disclosure content)
> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our board of directors is generally responsible for oversight of cybersecurity risks, if any."*
> → None/Other, Specificity 1 — No substantive disclosure. The board mention is incidental; the company explicitly has no program to disclose.
> *"We do not consider that we face significant cybersecurity risk and have not adopted any formal processes for assessing cybersecurity risk."*
> → None/Other, Specificity 1 — Absence of a program is not a program description.
---
## Category Decision Rules
### Rule 1: Dominant Category
If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose.
### Rule 2: Board vs. Management (the board-line test)
**Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus.
| Layer | Category | Key signals |
|-------|----------|-------------|
| Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) |
| Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials |
| Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" |
**When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose?
- **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus.
- **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**.
- **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content.
| Signal | Category |
|--------|----------|
| Board/committee is the grammatical subject | Board Governance |
| Board delegates responsibility to management | Board Governance |
| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) |
| Management role's qualifications, experience, credentials described | Management Role |
| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) |
| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) |
| "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) |
| Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) |
### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain)
This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result.
**Step 1 — Subject test:** What is the paragraph's grammatical subject?
- Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop.
- Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content.
**Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure?
- **YES****Risk Management Process** (the process stands on its own; people are incidental)
- **NO****Management Role** (the paragraph is fundamentally about who these people are)
- Borderline → continue to Step 3
**Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals?
- **YES****Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible)
- **NO****Risk Management Process** (no person-specific content beyond a title attribution)
| Signal | Category |
|--------|----------|
| The person's background, credentials, tenure, experience, education, career history | Management Role |
| The person's name is given | Management Role (strong signal) |
| Reporting lines as primary content (who reports to whom, management committee structure) | Management Role |
| Role title mentioned as attribution ("Our CISO oversees...") followed by process description | **Risk Management Process** |
| Activities, tools, methodologies, frameworks as the primary content | **Risk Management Process** |
| The paragraph would still make sense if you removed the role title and replaced it with "the Company" | **Risk Management Process** |
**Key principle:** Naming a cybersecurity leadership title (CISO, CIO, CTO, VP of Security) does not make a paragraph Management Role. The title is often an incidental attribution — the paragraph names who is responsible then describes what the program does. If the paragraph's substantive content is about processes, activities, or tools, it is Risk Management Process regardless of how many times a role title appears. Management Role requires the paragraph's content to be about the *person* — who they are, what makes them qualified, how long they've served, what their background is.
### Rule 3: Risk Management vs. Third-Party
| Signal | Category |
|--------|----------|
| Company's own internal processes, tools, teams | Risk Management Process |
| Third parties mentioned as ONE component of internal program | Risk Management Process |
| Vendor oversight is the CENTRAL topic | Third-Party Risk |
| External assessor hired to test the company | Risk Management Process (they serve the company) |
| Requirements imposed ON vendors | Third-Party Risk |
### Rule 4: Incident vs. Strategy
| Signal | Category |
|--------|----------|
| Describes what happened (timeline, scope, response) | Incident Disclosure |
| Describes business impact of an incident (costs, revenue, insurance claim) | Strategy Integration |
| Mixed: "We detected X... at a cost of $Y" | Assign based on which is dominant — if cost is one sentence in a paragraph about the incident → Incident Disclosure |
### Rule 5: None/Other Threshold
Assign None/Other ONLY when the paragraph contains no substantive cybersecurity disclosure content. If a paragraph mentions cybersecurity even briefly in service of a disclosure obligation, assign the relevant content category.
**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure.
### Rule 6: Materiality Language → Strategy Integration
Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes:
- **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition"
- **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect"
- **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents"
**The test:** Is the company STATING A CONCLUSION about materiality?
- "Risks have not materially affected our business strategy" → YES, conclusion → SI
- "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI
- "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content)
- "Managing material risks associated with cybersecurity" → NO, adjective → not SI
The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment.
**Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1).
**What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks").
**The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers.
---
## Borderline Cases
### Case 1: Framework mention + firm-specific fact
> *"We follow NIST CSF and our CISO oversees the program."*
The NIST mention → Level 2 anchor. The CISO reference → firm-specific. **Apply boundary rule 2→3: "Does it mention anything unique to THIS company?" Yes (CISO role exists at this company) → Level 3.**
### Case 2: Named role but generic description
> *"Our Chief Information Security Officer is responsible for managing cybersecurity risks."*
Names a role (CISO) → potentially Level 3. But the description is completely generic. **Apply judgment: the mere existence of a CISO title is firm-specific (not all companies have one). → Level 3.** If the paragraph said "a senior executive is responsible" without naming the role → Level 1.
### Case 3: Specificity-rich None/Other
> *"On March 15, 2025, we filed a Current Report on Form 8-K disclosing a cybersecurity incident. For details, see our Form 8-K filed March 15, 2025, accession number 0001193125-25-012345."*
Contains specific dates and filing numbers, but the paragraph itself contains no disclosure content — it's a cross-reference. → **None/Other, Specificity 1.** Specificity only applies to disclosure substance, not to metadata.
### Case 4: Hypothetical incident language in 10-K
> *"We may experience cybersecurity incidents that could disrupt our operations."*
This appears in Item 1C, not an 8-K. It describes no actual incident. → **Risk Management Process or Strategy Integration (depending on context), NOT Incident Disclosure.** Incident Disclosure is reserved for descriptions of events that actually occurred.
### Case 5: Dual-category paragraph
> *"The Audit Committee oversees our cybersecurity program, which is led by our CISO who holds CISSP certification and reports quarterly to the Committee."*
Board (Audit Committee oversees) + Management (CISO qualifications, reporting). The opening clause sets the frame: this is about the Audit Committee's oversight, and the CISO detail is subordinate. → **Board Governance, Specificity 3.**
### Case 6: Management Role vs. Risk Management Process — the person-vs-function test
> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning. The program is aligned with the NIST CSF framework and integrated into our enterprise risk management process."*
The CISO is named as attribution, but the paragraph is about what the program does — assessments, scanning, response planning, framework alignment, ERM integration. Remove "Our CISO oversees" and it still makes complete sense as a process description. → **Risk Management Process, Specificity 2** (NIST CSF framework, no firm-specific facts beyond that).
> *"Our CISO has over 20 years of experience in cybersecurity and holds CISSP and CISM certifications. She reports directly to the CIO and oversees a team of 12 security professionals. Prior to joining the Company in 2019, she served as VP of Security at a Fortune 500 technology firm."*
The entire paragraph is about the person: experience, certifications, reporting line, team size, tenure, prior role. → **Management Role, Specificity 4** (years of experience + team headcount + named certifications = multiple QV-eligible facts).
### Case 7: Materiality disclaimer — Strategy Integration vs. None/Other
> *"We have not identified any cybersecurity incidents or threats that have materially affected our business strategy, results of operations, or financial condition. However, like other companies, we have experienced threats from time to time. For more information, see Item 1A, Risk Factors."*
Contains an explicit materiality assessment ("materially affected... business strategy, results of operations, or financial condition"). The cross-reference and generic threat mention are noise. → **Strategy Integration, Specificity 1.**
> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"*
No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1.**
### Case 8: SPAC / no-operations company
> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program or formal processes. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any. We have not encountered any cybersecurity incidents since our IPO."*
Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**
### Case 9: Materiality language — assessment vs. speculation (v3.5 revision)
> *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."*
The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.**
> *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."*
Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.**
> *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."*
Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.)
> *"We face various risks related to our IT systems."*
No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.**
**The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk.
### Case 10: Generic regulatory compliance language
> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*
This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**
The key distinctions:
- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted)
- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6)
- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity
---
## Dimension 2: Specificity Level
Each paragraph receives a specificity level (1-4) indicating how company-specific the disclosure is. Apply the decision test in order — stop at the first "yes."
### Decision Test
1. **Count hard verifiable facts ONLY** (specific dates, dollar amounts, headcounts/percentages, named third-party firms, named products/tools, named certifications). TWO or more? → **Quantified-Verifiable (4)**
2. **Does it contain at least one fact from the IS list below?** → **Firm-Specific (3)**
3. **Does it name a recognized standard** (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA)? → **Sector-Adapted (2)**
4. **None of the above?** → **Generic Boilerplate (1)**
None/Other paragraphs always receive Specificity 1.
### Level Definitions
| Level | Name | Description |
|-------|------|-------------|
| 1 | Generic Boilerplate | Could paste into any company's filing unchanged. No named entities, frameworks, roles, dates, or specific details. |
| 2 | Sector-Adapted | Names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.) but contains nothing unique to THIS company. General practices (pen testing, vulnerability scanning, tabletop exercises) do NOT qualify — only named standards. |
| 3 | Firm-Specific | Contains at least one fact from the IS list that identifies something unique to THIS company's disclosure. |
| 4 | Quantified-Verifiable | Contains TWO or more hard verifiable facts (see QV-eligible list). One fact = Firm-Specific, not QV. |
### ✓ IS a Specific Fact (any ONE → at least Firm-Specific)
- **Cybersecurity-specific titles:** CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, HSE Director overseeing cybersecurity, Chief Digital Officer (when overseeing cyber), Cybersecurity Director
- **Named non-generic committees:** Technology Committee, Cybersecurity Committee, Risk Committee, ERM Committee (NOT "Audit Committee" — that exists at every public company)
- **Specific team/department compositions:** "Legal, Compliance, and Finance" (but NOT just "a cross-functional team")
- **Specific dates:** "In December 2023", "On May 6, 2024", "fiscal 2025"
- **Named internal programs with unique identifiers:** "Cyber Incident Response Plan (CIRP)" (must have a distinguishing name/abbreviation — generic "incident response plan" does not qualify)
- **Named products, systems, tools:** Splunk, CrowdStrike Falcon, Azure Sentinel, ServiceNow
- **Named third-party firms:** Mandiant, Deloitte, CrowdStrike, PwC
- **Specific numbers:** headcounts, dollar amounts, percentages, exact durations ("17 years", "12 professionals")
- **Certification claims:** "We maintain ISO 27001 certification" (holding a certification is more than naming a standard)
- **Named universities in credential context:** "Ph.D. from Princeton University" (independently verifiable)
### ✗ IS NOT a Specific Fact (do NOT use to justify Firm-Specific)
- **Generic governance:** "the Board", "Board of Directors", "management", "Audit Committee", "the Committee"
- **Generic C-suite:** CEO, CFO, COO, President, General Counsel — these exist at every company and are not cybersecurity-specific
- **Generic IT leadership (NOT cybersecurity-specific):** "Head of IT", "IT Manager", "Director of IT", "Chief Compliance Officer", "Associate Vice President of IT" — these are general corporate/IT titles, not cybersecurity roles per the IS list
- **Unnamed entities:** "third-party experts", "external consultants", "cybersecurity firms", "managed service provider"
- **Generic cadences:** "quarterly", "annual", "periodic", "regular" — without exact dates
- **Boilerplate phrases:** "cybersecurity risks", "material adverse effect", "business operations", "financial condition"
- **Standard incident language:** "forensic investigation", "law enforcement", "regulatory obligations", "incident response protocols"
- **Vague quantifiers:** "certain systems", "some employees", "a number of", "a portion of"
- **Common practices:** "penetration testing", "vulnerability scanning", "tabletop exercises", "phishing simulations", "security awareness training"
- **Generic program names:** "incident response plan", "business continuity plan", "cybersecurity program", "Third-Party Risk Management Program", "Company-wide training" — no unique identifier or distinguishing abbreviation
- **Company self-references:** the company's own name, "the Company", "the Bank", subsidiary names, filing form types
- **Company milestones:** "since our IPO", "since inception" — not cybersecurity facts
### QV-Eligible Facts (count toward the 2-fact threshold for Quantified-Verifiable)
✓ Specific dates (month+year or exact date)
✓ Dollar amounts, headcounts, percentages
✓ Named third-party firms (Mandiant, CrowdStrike, Deloitte)
✓ Named products/tools (Splunk, Azure Sentinel)
✓ Named certifications held by individuals (CISSP, CISM, CEH)
✓ Years of experience as a specific number ("17 years", "over 20 years")
✓ Named universities in credential context
**Do NOT count toward QV** (these trigger Firm-Specific but not QV):
✗ Named roles (CISO, CIO)
✗ Named committees
✗ Named frameworks (NIST, ISO 27001) — these trigger Sector-Adapted
✗ Team compositions, reporting structures
✗ Named internal programs
✗ Generic degrees without named university ("BS in Management")
### Validation Step
Before finalizing specificity, review the extracted facts. Remove any that appear on the NOT list. If no facts remain after filtering → Generic Boilerplate (or Sector-Adapted if a named standard is present). Do not let NOT-list items inflate the specificity rating.
---
## LLM Response Schema
The exact Zod schema passed to `generateObject`. This is the contract between the LLM and our pipeline.
```typescript
import { z } from "zod";
export const ContentCategory = z.enum([
"Board Governance",
"Management Role",
"Risk Management Process",
"Third-Party Risk",
"Incident Disclosure",
"Strategy Integration",
"None/Other",
]);
export const SpecificityLevel = z.union([
z.literal(1),
z.literal(2),
z.literal(3),
z.literal(4),
]);
export const Confidence = z.enum(["high", "medium", "low"]);
export const LabelOutput = z.object({
content_category: ContentCategory
.describe("The single most applicable content category for this paragraph"),
specificity_level: SpecificityLevel
.describe("1=generic boilerplate, 2=sector-adapted, 3=firm-specific, 4=quantified-verifiable"),
category_confidence: Confidence
.describe("high=clear-cut, medium=some ambiguity, low=genuinely torn between categories"),
specificity_confidence: Confidence
.describe("high=clear-cut, medium=borderline adjacent levels, low=could argue for 2+ levels"),
reasoning: z.string()
.describe("Brief 1-2 sentence justification citing specific evidence from the text"),
});
```
**Output example:**
```json
{
"content_category": "Risk Management Process",
"specificity_level": 3,
"category_confidence": "high",
"specificity_confidence": "medium",
"reasoning": "Names NIST CSF (sector-adapted) and describes quarterly tabletop exercises specific to this company's program, pushing to firm-specific. Specificity borderline 2/3 — tabletop exercises could be generic or firm-specific depending on interpretation."
}
```
---
## System Prompt
> **Note:** The system prompt below is the v1.0 template from the initial codebook. The production Stage 1 prompt is **v2.5** (in `ts/src/label/prompts.ts`), which incorporates the IS/NOT lists, calibration examples, validation step, and decision test from this codebook. The Stage 2 judge prompt (`buildJudgePrompt()` in the same file) adds dynamic disambiguation rules and confidence calibration. **This codebook is the source of truth; the prompt mirrors it.**
The v1.0 template is preserved below for reference. See `ts/src/label/prompts.ts` for the current production prompt.
```
You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings) under SEC Release 33-11216.
For each paragraph, assign exactly two labels:
(a) content_category — the single most applicable category:
- "Board Governance": Board/committee oversight of cyber risk, briefing cadence, board member cyber expertise. SEC basis: Item 106(c)(1).
- "Management Role": CISO/CTO/CIO identification, qualifications, reporting lines, management committees. SEC basis: Item 106(c)(2).
- "Risk Management Process": Risk assessment methods, framework adoption (NIST, ISO), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration. SEC basis: Item 106(b).
- "Third-Party Risk": Vendor/supplier security oversight, external assessor requirements, contractual security standards, supply chain risk. SEC basis: Item 106(b).
- "Incident Disclosure": Description of actual cybersecurity incidents — nature, scope, timing, impact, remediation. SEC basis: 8-K Item 1.05.
- "Strategy Integration": Material impact on business strategy/financials, cyber insurance, investment/resource allocation. SEC basis: Item 106(b)(2).
- "None/Other": Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content.
If a paragraph spans multiple categories, assign the DOMINANT one — the category that best describes the paragraph's primary communicative purpose.
(b) specificity_level — integer 1 through 4:
1 = Generic Boilerplate: Could apply to any company unchanged. Conditional language ("may," "could"). No named entities or frameworks.
2 = Sector-Adapted: Names frameworks/standards (NIST, ISO, SOC 2) or industry-specific terms, but nothing unique to THIS company.
3 = Firm-Specific: Contains at least one fact unique to this company — named roles, specific committees, concrete reporting lines, named programs.
4 = Quantified-Verifiable: Two or more verifiable facts — dollar amounts, dates, headcounts, percentages, named third-party firms, audit results.
BOUNDARY RULES (apply when torn between adjacent levels):
1 vs 2: "Does it name ANY framework, standard, or industry-specific term?" → Yes = 2
2 vs 3: "Does it mention anything unique to THIS company?" → Yes = 3
3 vs 4: "Does it contain TWO OR MORE independently verifiable facts?" → Yes = 4
SPECIAL RULES:
- None/Other paragraphs always get specificity_level = 1.
- Hypothetical incident language ("we may experience...") in a 10-K is NOT Incident Disclosure. It is Risk Management Process or Strategy Integration.
- Incident Disclosure is only for descriptions of events that actually occurred.
CONFIDENCE RATINGS (per dimension):
- "high": Clear-cut classification with no reasonable alternative.
- "medium": Some ambiguity, but one option is clearly stronger.
- "low": Genuinely torn between two or more options.
Be honest — overconfident ratings on hard cases are worse than admitting uncertainty.
Respond with valid JSON matching the required schema. The "reasoning" field should cite specific words or facts from the paragraph that justify your labels (1-2 sentences).
```
---
## User Prompt Template
```
Company: {company_name} ({ticker})
Filing type: {filing_type}
Filing date: {filing_date}
Section: {sec_item}
Paragraph:
{paragraph_text}
```
---
## Stage 2 Judge Prompt
Used when Stage 1 annotators disagree. The judge sees the paragraph plus all three prior annotations in randomized order.
```
You are adjudicating a labeling disagreement among three independent annotators. Each applied the same codebook but reached different conclusions.
Review all three opinions below, then provide YOUR OWN independent label based on the codebook definitions above. Do not default to majority vote — use your own expert judgment. If you agree with one annotator's reasoning, explain why their interpretation is correct.
Company: {company_name} ({ticker})
Filing type: {filing_type}
Filing date: {filing_date}
Section: {sec_item}
Paragraph:
{paragraph_text}
--- Prior annotations (randomized order) ---
Annotator A: content_category="{cat_a}", specificity_level={spec_a}
Reasoning: "{reason_a}"
Annotator B: content_category="{cat_b}", specificity_level={spec_b}
Reasoning: "{reason_b}"
Annotator C: content_category="{cat_c}", specificity_level={spec_c}
Reasoning: "{reason_c}"
```
---
## Cost and Time Tracking
### Per-Annotation Record
Every API call produces an `Annotation` record with full provenance:
```typescript
provenance: {
modelId: string, // OpenRouter model ID e.g. "google/gemini-3.1-flash-lite-preview"
provider: string, // Upstream provider e.g. "google", "xai", "anthropic"
generationId: string, // OpenRouter generation ID (from response id field)
stage: "stage1" | "stage2-judge" | "benchmark",
runId: string, // UUID per batch run
promptVersion: string, // "v1.0" — tracks prompt iterations
inputTokens: number, // From usage.prompt_tokens
outputTokens: number, // From usage.completion_tokens
reasoningTokens: number, // From usage.completion_tokens_details.reasoning_tokens
costUsd: number, // REAL cost from OpenRouter usage.cost (not estimated)
latencyMs: number, // Wall clock per request
requestedAt: string, // ISO datetime
}
```
### Cost Source
OpenRouter returns **actual cost** in every response body under `usage.cost` (USD). No estimation needed. Each response also includes a `generationId` (the `id` field) which we store in every annotation record. This enables:
- Audit trail: look up any annotation on OpenRouter's dashboard
- Richer stats via `GET /api/v1/generation?id={generationId}` (latency breakdown, provider routing, native token counts)
### Aggregation Levels
| Level | What | Where |
|-------|------|-------|
| Per-annotation | Single API call cost + latency | In each Annotation JSONL record |
| Per-model | Sum across all annotations for that model | `bun sec label:cost` |
| Per-stage | Stage 1 total, Stage 2 total | `bun sec label:cost` |
| Per-phase | Labeling total, benchmarking total | `bun sec label:cost` |
| Project total | Everything | `bun sec label:cost` |
### Time Tracking
| Metric | How |
|--------|-----|
| Per-annotation latency | `Date.now()` before/after API call |
| Batch throughput | paragraphs/minute computed from batch start/end |
| Stage 1 wall clock | Logged at batch start and end |
| Stage 2 wall clock | Logged at batch start and end |
| Total labeling time | Sum of all batch durations |
| Per-model benchmark time | Tracked during benchmark runs |
All timing is logged to `data/metadata/cost-log.jsonl` with entries like:
```json
{
"event": "batch_complete",
"stage": "stage1",
"modelId": "openai/gpt-oss-120b",
"paragraphsProcessed": 50000,
"wallClockSeconds": 14400,
"totalCostUsd": 38.50,
"throughputPerMinute": 208.3,
"timestamp": "2026-03-29T10:30:00Z"
}
```
---
## NIST CSF 2.0 Mapping
For academic grounding:
| Our Category | NIST CSF 2.0 |
|-------------|-------------|
| Board Governance | GOVERN (GV.OV, GV.RR) |
| Management Role | GOVERN (GV.RR, GV.RM) |
| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) |
| Third-Party Risk | GOVERN (GV.SC) |
| Incident Disclosure | DETECT, RESPOND, RECOVER |
| Strategy Integration | GOVERN (GV.OC, GV.RM) |
---
## Prompt Versioning
Track prompt changes so we can attribute label quality to specific prompt versions:
| Version | Date | N | Change |
|---------|------|---|--------|
| v1.0 | 2026-03-27 | 40 | Initial codebook-aligned prompt |
| v1.1 | 2026-03-28 | 40 | Added calibration examples, category decision rules. Cat 95%, Spec 68%, Both 62%. |
| v1.2 | 2026-03-28 | 40 | Expanded "what counts as unique" + materiality rule. REGRESSED (88% cat). |
| v2.0 | 2026-03-28 | 40 | Chain-of-thought schema with specific_facts array + algorithmic specificity. Gemini/Grok 5/5, GPT-OSS broken. |
| v2.1 | 2026-03-28 | 40 | Two-tier facts (organizational vs verifiable) + text enum labels. Gemini/Grok perfect but nano overrates. |
| v2.2 | 2026-03-28 | 40 | Decision-test format, simplified facts, "NOT a fact" list. Cat 95%, Spec 68%, Both 65%, Consensus 100%. |
| v2.2 | 2026-03-28 | 500 | 500-sample baseline. Cat 85.0%, Spec 60.8%, Both 51.4%, Consensus 99.6%, Spread 0.240. |
| v2.3 | 2026-03-28 | 500 | Tightened Sector-Adapted, expanded IS/NOT lists, QV boundary rules. Spec 72.0%, Both 59.2%. [1,1,2] eliminated. |
| v2.4 | 2026-03-28 | 500 | Validation step, schema constraint on specific_facts. Spec 78.6%, Both 66.8%. Nano overrating fixed. |
| v2.5 | 2026-03-28 | 500 | Improved Inc↔Strat tiebreaker, QV calibration examples. **PRODUCTION**: Cat 86.8%, Spec 81.0%, Both 70.8%, Consensus 99.4%, Spread 0.130. Inc↔Strat eliminated. |
| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). |
| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. |
| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. |
| v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). |
When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare.
---
## Iterative Prompt Tuning Protocol
The v1.0 system prompt is built from theory and synthetic examples. Before firing the full 50K run, we iterate on real data to find and fix failure modes while it costs cents, not dollars.
### Phase 0: Seed sample (before extraction is ready)
Grab 20-30 real Item 1C paragraphs manually from EDGAR full-text search (`efts.sec.gov/LATEST/search-index?q="Item 1C" cybersecurity`). Paste into a JSONL by hand. This lets prompt tuning start immediately while extraction code is still being built.
### Phase 1: Micro-pilot (30 paragraphs, all 3 Stage 1 models)
1. Select ~30 real paragraphs covering:
- At least 2 per content category (incl. None/Other)
- At least 2 per specificity level
- Mix of industries and filing years
- 5+ deliberately tricky borderline cases
2. Run all 3 Stage 1 models on these 30 with prompt v1.0.
3. **You and at least one teammate independently label the same 30** using the codebook. These are your reference labels.
4. Compare:
- Per-model accuracy vs reference
- Inter-model agreement (where do they diverge?)
- Per-category confusion (which categories do models mix up?)
- Per-specificity bias (do models systematically over/under-rate?)
- Are confidence ratings calibrated? (Do "high" labels match correct ones?)
5. **Identify failure patterns.** Common ones:
- Models gravitating to "Risk Management Process" (largest category — pull)
- Models rating specificity too high (any named entity → firm-specific)
- Board Governance / Management Role confusion
- Missing None/Other (labeling boilerplate as Strategy Integration)
### Phase 2: Prompt revision (v1.1)
Based on Phase 1 failures, revise the system prompt:
- Add "common mistakes" section with explicit corrections
- Add few-shot examples for confused categories
- Sharpen boundary rules where models diverge
- Add negative examples ("This is NOT Incident Disclosure because...")
**Do not change the Zod schema or category definitions** — only the system prompt text. Bump to v1.1. Re-run the same 30 paragraphs. Compare to v1.0.
### Phase 3: Scale pilot (200 paragraphs)
1. Extract 200 real paragraphs (stratified, broader set of filings).
2. Run all 3 Stage 1 models with the best prompt version.
3. Compute:
- **Inter-model Fleiss' Kappa** on category: target ≥ 0.65
- **Inter-model Spearman correlation** on specificity: target ≥ 0.70
- **Consensus rate**: % with 2/3+ agreement on both dims. Target ≥ 75%.
- **Confidence calibration**: are "high confidence" labels more likely agreed-upon?
4. If targets not met:
- Analyze disagreements — genuine ambiguity or prompt failure?
- Prompt failure → revise to v1.2, re-run
- Genuine ambiguity → consider rubric adjustment (merge categories, collapse specificity)
- Repeat until targets met or documented why they can't be
5. **Cost check**: extrapolate from 200 to 50K. Reasoning token usage reasonable?
### Phase 4: Green light
Once scale pilot passes:
- Lock prompt version (no changes during full run)
- Lock model configuration (reasoning effort, temperature)
- Document final prompt, configs, and pilot results
- Fire the full 50K annotation run
---
## Pipeline Reliability & Observability
### Resumability
All API-calling scripts (annotation, judging, benchmarking) use the same pattern:
1. Load output JSONL → parse each line → collect completed paragraph IDs into a Set
2. Lines that fail `JSON.parse` are skipped (truncated from a crash)
3. Filter input to only paragraphs NOT in the completed set
4. For each completion, append one valid JSON line + `flush()`
JSONL line-append is atomic on Linux. Worst case on crash: one truncated line, skipped on reload. No data loss, no duplicate work, no duplicate API spend.
### Error Handling
| Error Type | Examples | Strategy |
|------------|----------|----------|
| Transient | 429, 500, 502, 503, ECONNRESET, timeout | Exponential backoff: 1s→2s→4s→8s→16s. Max 5 retries. |
| Permanent | 400, 422 (bad request) | Log to `{output}-errors.jsonl`, skip |
| Validation | Zod parse fail on LLM response | Retry once, then log + skip |
| Budget | 402 (out of credits) | Stop immediately, write session summary, exit |
| Consecutive | 10+ errors in a row | Stop — likely systemic (model down, prompt broken) |
Error paragraphs get their own file. Retry later with `--retry-errors`.
### Graceful Shutdown (SIGINT/SIGTERM)
On Ctrl+C:
1. Stop dispatching new work
2. Wait for in-flight requests to complete (already paid for)
3. Write session summary
4. Print final stats, exit 0
### Live Dashboard (stderr)
Updates every second:
```
SEC-cyBERT │ label:annotate │ google/gemini-3.1-flash-lite-preview │ v1.1
─────────────────────────────────────────────────────────────────────────
Progress 12,847 / 50,234 (25.6%) ETA 42m 18s
Session $1.23 │ 38m 12s elapsed │ 337.4 para/min
Totals $4.56 all-time │ 3 errors (0.02%) │ 7 retries
Latency p50: 289ms │ p95: 812ms │ p99: 1,430ms
Reasoning avg 47 tokens/para │ 12.3% of output tokens
```
Goes to stderr so stdout stays clean.
### Session Log
Every run appends to `data/metadata/sessions.jsonl`:
```json
{
"sessionId": "a1b2c3d4",
"command": "label:annotate",
"modelId": "google/gemini-3.1-flash-lite-preview",
"stage": "stage1",
"promptVersion": "v1.1",
"startedAt": "2026-03-29T10:00:00Z",
"endedAt": "2026-03-29T10:38:12Z",
"durationSeconds": 2292,
"paragraphsTotal": 50234,
"paragraphsProcessed": 12847,
"paragraphsSkippedResume": 37384,
"paragraphsErrored": 3,
"costUsd": 1.23,
"reasoningTokensTotal": 482000,
"avgLatencyMs": 450,
"p95LatencyMs": 812,
"throughputPerMinute": 337.4,
"concurrency": 12,
"exitReason": "complete"
}
```
`exitReason`: `complete` | `interrupted` (Ctrl+C) | `budget_exhausted` (402) | `error_threshold` (consecutive limit)
### OpenRouter Generation ID
Every annotation record includes the OpenRouter `generationId` from the response `id` field. This enables:
- **Audit trail**: look up any annotation on OpenRouter's dashboard
- **Rich stats**: `GET /api/v1/generation?id={generationId}` returns latency breakdown, provider routing, native token counts
- **Dispute resolution**: if a label looks wrong, inspect the exact generation that produced it
---
## Gold Set Protocol
### Sampling (1,200 paragraphs minimum)
Stratify by:
- Content category (all 7 represented, oversample rare categories)
- Specificity level (all 4 represented)
- GICS sector (financial services, tech, healthcare, manufacturing minimum)
- Filing year (FY2023 and FY2024)
### Human Labeling Process
Labeling is done through a purpose-built web tool that enforces quality:
1. **Rules quiz:** Every annotator must read the codebook and pass a quiz on the rules before each labeling session. The quiz tests the three most common confusion axes: Management Role vs RMP (person-vs-function test), materiality disclaimers (Strategy Integration vs None/Other), and QV fact counting.
2. **Warm-up:** First 5 paragraphs per session are warm-up (pre-labeled, with feedback). Not counted toward gold set.
3. **Independent labeling:** Three team members independently label the full gold set using this codebook.
4. Compute inter-rater reliability:
- Cohen's Kappa (for content category — nominal, pairwise)
- Krippendorff's Alpha (for specificity level — ordinal, all annotators)
- Per-class confusion matrices
- **Target: Kappa > 0.75, Alpha > 0.67**
5. Adjudicate disagreements: third annotator tiebreaker, or discussion consensus with documented rationale
6. Run the full GenAI pipeline on the gold set and compare to human labels
### If Agreement Is Poor
- If Kappa < 0.60 on any category pair: revise that category's definition and boundary rules, re-pilot
- If Alpha < 0.50 on specificity: collapse 4-point to 3-point scale (merge 1+2 into "Non-specific" or 3+4 into "Substantive")
- Document the collapse decision and rationale in this codebook

File diff suppressed because it is too large Load Diff

1292
docs/NARRATIVE-v1.md Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,221 +1,195 @@
# Project Status — 2026-04-02 (evening) # Project Status — 2026-04-03 (v2 Reboot)
## What's Done **Deadline:** 2026-04-24 (21 days)
## What's Done (Carried Forward from v1)
### Data Pipeline ### Data Pipeline
- [x] 72,045 paragraphs extracted from ~9,000 10-K filings + 207 8-K filings - [x] 72,045 paragraphs extracted from ~9,000 10-K + 207 8-K filings
- [x] 14 filing generators identified, quality metrics per generator - [x] 14 filing generators identified, 6 surgical patches applied
- [x] 6 surgical patches applied (orphan words + heading stripping)
- [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%) - [x] Quality tier system: clean (80.7%), headed (10.3%), degraded (6.0%), minor (3.0%)
- [x] Embedded bullet detection (2,163 paragraphs flagged degraded, 0.5x sample weight) - [x] 72 truncated filings identified and excluded
- [x] All data integrity rules formalized (frozen originals, UUID-linked patches) - [x] All data integrity rules formalized (frozen originals, UUID-linked patches)
### GenAI Labeling (Stage 1) ### Pre-Training
- [x] Prompt v2.5 locked after 12+ iterations - [x] DAPT: 1 epoch on 500M tokens, eval loss 0.7250, ~14.5h on RTX 3090
- [x] 3-model panel: gemini-flash-lite + mimo-v2-flash + grok-4.1-fast - [x] TAPT: 5 epochs on 72K paragraphs, eval loss 1.0754, ~50 min on RTX 3090
- [x] 150,009 annotations completed ($115.88, 0 failures) - [x] Custom `WholeWordMaskCollator` (upstream broken for BPE)
- [x] Orphan word re-annotation: 1,537 paragraphs re-run ($3.30), merged into `stage1.patched.jsonl` - [x] Checkpoints: `checkpoints/dapt/` and `checkpoints/tapt/`
- [x] Codebook v3.0 with 3 major rulings
### DAPT + TAPT Pre-Training ### v1 Labeling (preserved, not used for v2 training)
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped) - [x] 150K Stage 1 annotations (v2.5 prompt, $115.88)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090. - [x] 10-model benchmark (8 suppliers, $45.63)
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/` - [x] Human labeling: 6 annotators × 600 paragraphs, category α=0.801, specificity α=0.546
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08. - [x] Gold adjudication: 13-signal cross-analysis, 5-tier adjudication
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/` - [x] Codebook v1.0→v3.5 iteration (12+ prompt versions, 6 v3.5 rounds)
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers) - [x] All v1 data preserved at original paths + `docs/NARRATIVE-v1.md`
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
### Human Labeling — Complete ### v2 Codebook (this session)
- [x] All 6 annotators completed 600 paragraphs each (3,600 labels total, 1,200 paragraphs × 3) - [x] LABELING-CODEBOOK.md v2: broadened Level 2, 1+ QV, "what question?" test
- [x] BIBD assignment: each paragraph labeled by exactly 3 of 6 annotators - [x] CODEBOOK-ETHOS.md: full reasoning, worked edge cases
- [x] Full data export: raw labels, timing, quiz sessions, metrics → `data/gold/` - [x] NARRATIVE.md: data/pretraining carried forward, pivot divider, v2 section started
- [x] Comprehensive IRR analysis → `data/gold/charts/` - [x] STATUS.md: this document
| Metric | Category | Specificity | Both | ---
|--------|----------|-------------|------|
| Consensus (3/3 agree) | 56.8% | 42.3% | 27.0% |
| Krippendorff's α | 0.801 | 0.546 | — |
| Avg Cohen's κ | 0.612 | 0.440 | — |
### Prompt v3.0 ## What's Next (v2 Pipeline)
- [x] Codebook v3.0 rulings: materiality disclaimers → SI, SPACs → N/O, person-vs-function test for MR↔RMP
- [x] Prompt version bumped from v2.5 → v3.0
### GenAI Holdout Benchmark — Complete ### Step 1: Codebook Finalization ← CURRENT
- [x] 6 benchmark models + Opus 4.6 on the 1,200 holdout paragraphs - [x] Draft v2 codebook with systemic changes
- [x] All 1,200 annotations per model (0 failures after minimax/kimi fence-stripping fix) - [x] Draft codebook ethos with full reasoning
- [x] Total benchmark cost: $45.63 - [ ] Get group approval on v2 codebook (share both docs)
- [ ] Incorporate any group feedback
| Model | Supplier | Cost | Cat % vs Opus | Both % vs Opus | ### Step 2: Prompt Iteration (dev set)
|-------|----------|------|---------------|----------------| - [ ] Draw ~200 paragraph dev set from existing Stage 1 labels (stratified, separate from holdout)
| openai/gpt-5.4 | OpenAI | $6.79 | 88.2% | 79.8% | - [ ] Update Stage 1 prompt to match v2 codebook
| google/gemini-3.1-pro-preview | Google | $16.09 | 87.4% | 80.0% | - [ ] Run 2-3 models on dev set, analyze results
| moonshotai/kimi-k2.5 | Moonshot | $7.70 | 85.1% | 76.8% | - [ ] Iterate prompt against judge panel until reasonable consensus
| z-ai/glm-5:exacto | Zhipu | $6.86 | 86.2% | 76.5% | - [ ] Update codebook with any rulings needed (should be minimal if rules are clean)
| xiaomi/mimo-v2-pro:exacto | Xiaomi | $6.59 | 85.7% | 76.3% | - [ ] Re-approval if codebook changed materially
| minimax/minimax-m2.7:exacto | MiniMax | $1.61 | 82.8% | 63.6% | - **Estimated cost:** ~$5-10
| anthropic/claude-opus-4.6 | Anthropic | $0 | — | — | - **Estimated time:** 1-2 sessions
Plus Stage 1 panel already on file = **10 models, 8 suppliers**. ### Step 3: Stage 1 Re-Run
- [ ] Lock v2 prompt
- [ ] Re-run Stage 1 on full corpus (~50K paragraphs × 3 models)
- [ ] Distribution check: verify Level 2 grew to ~20%, category distribution healthy
- [ ] If distribution is off → iterate codebook/prompt before proceeding
- **Estimated cost:** ~$120
- **Estimated time:** ~30 min execution
### 13-Signal Cross-Source Analysis — Complete ### Step 4: Holdout Selection
- [x] 30 diagnostic charts generated → `data/gold/charts/` - [ ] Draw stratified holdout from new Stage 1 labels
- [x] Leave-one-out analysis (no model privileged as reference) - ~170 per category class × 7 ≈ 1,190
- [x] Adjudication tier breakdown computed - Random within each stratum (NOT difficulty-weighted)
- Secondary constraint: minimum ~100 per specificity level
- Exclude dev set paragraphs
- [ ] Draw separate AI-labeled extension set (up to 20K) if desired
- **Depends on:** Step 3 complete + distribution check passed
**Adjudication tiers (13 signals per paragraph):** ### Step 5: Labelapp Update
- [ ] Update quiz questions for v2 codebook (new Level 2 definition, 1+ QV, "what question?" test)
- [ ] Update warmup paragraphs with v2 examples
- [ ] Update codebook sidebar content
- [ ] Load new holdout paragraphs into labelapp
- [ ] Generate new BIBD assignments (3 of 6 annotators per paragraph)
- [ ] Test the full flow (quiz → warmup → labeling)
- **Depends on:** Step 4 complete
| Tier | Count | % | Rule | ### Step 6: Parallel Labeling
|------|-------|---|------| - [ ] **Humans:** Tell annotators to start labeling v2 holdout
| 1 | 756 | 63.0% | 10+/13 agree on both dimensions → auto gold | - [ ] **Models:** Run full benchmark panel on holdout (10+ models, 8+ suppliers)
| 2 | 216 | 18.0% | Human + GenAI majorities agree → cross-validated | - Stage 1 panel (gemini-flash-lite, mimo-v2-flash, grok-4.1-fast)
| 3 | 26 | 2.2% | Humans split, GenAI converges → expert review | - Benchmark panel (gpt-5.4, gemini-pro, kimi-k2.5, glm-5, mimo-v2-pro, minimax-m2.7)
| 4 | 202 | 16.8% | Universal disagreement → expert review | - Opus 4.6 via Anthropic SDK (new addition, treated as another benchmark model)
- **Estimated model cost:** ~$45
- **Estimated human time:** 2-3 days (600 paragraphs per annotator)
- **Depends on:** Step 5 complete
**Leave-one-out ranking (each source vs majority of other 12):** ### Step 7: Gold Set Assembly
- [ ] Compute human IRR (target: category α > 0.75, specificity α > 0.67)
- [ ] Gold = majority vote (where all 3 disagree, model consensus tiebreaker)
- [ ] Validate gold against model panel — check for systematic human errors (learned from v1 SI↔N/O)
- **Depends on:** Step 6 complete (both humans and models)
| Rank | Source | Cat % | Spec % | Both % | ### Step 8: Stage 2 (if needed)
|------|--------|-------|--------|--------| - [ ] Bench Stage 2 adjudication accuracy against gold
| 1 | Opus 4.6 | 92.6 | 90.8 | 84.0 | - [ ] If Stage 2 adds value → iterate prompt, run on disputed Stage 1 paragraphs
| 2 | Kimi K2.5 | 91.6 | 91.1 | 83.3 | - [ ] If Stage 2 adds minimal value → document finding, skip production run
| 3 | Gemini Pro | 91.1 | 90.1 | 82.3 | - **Estimated cost:** ~$20-40 if run
| 4 | GPT-5.4 | 91.4 | 88.8 | 82.1 | - **Depends on:** Step 7 complete
| 8 | H:Xander (best human) | 91.3 | 83.9 | 76.9 |
| 16 | H:Aaryan (outlier) | 59.1 | 24.7 | 15.8 |
**Key finding:** Opus earns the #1 spot through leave-one-out — it's not special because we designated it as gold; it genuinely disagrees with the crowd least (7.4% odd-one-out rate). ### Step 9: Training Data Assembly
- [ ] Unanimous Stage 1 labels → full weight
- [ ] Calibrated majority labels → full weight
- [ ] Judge high-confidence (if Stage 2 run) → full weight
- [ ] Quality tier weights: clean/headed/minor = 1.0, degraded = 0.5
- [ ] Nuke 72 truncated filings
- **Depends on:** Step 8 complete
### Codebook v3.5 & Prompt Iteration — Complete ### Step 10: Fine-Tuning
- [x] Cross-analysis: GenAI vs human systematic errors identified (SI↔N/O 23:0, MR↔RMP 38:13, BG↔MR 33:6) - [ ] Ablation matrix: {base, +DAPT, +DAPT+TAPT} × {±class weighting} × {CE vs focal loss}
- [x] v3.5 rulings: SI materiality assessment test, BG purpose test, MR↔RMP 3-step chain - [ ] Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal)
- [x] v3.5 gold re-run: 7 models × 359 confusion-axis holdout paragraphs ($18) - [ ] Ordinal regression (CORAL) for specificity
- [x] 6 rounds prompt iteration on 26 regression paragraphs ($1.02): v3.0=18/26 → v3.5=22/26 - [ ] SCL for boundary separation (optional, if time permits)
- [x] SI rule tightened: "could have material adverse effect" = NOT SI (speculation, not assessment) - **Estimated time:** 12-20h GPU
- [x] Cross-reference exception: materiality language in cross-refs = N/O - **Depends on:** Step 9 complete
- [x] BG threshold: one-sentence committee mention doesn't flip to BG
- [x] Stage 1 corrections flagged: 308 paragraphs (180 materiality + 128 SPACs)
- [x] Prompt locked at v3.5, codebook updated, version history documented
- [x] SI↔N/O paradox investigated and resolved: models correct, humans systematically over-call SI on speculation
- [x] Codebook Case 9 contradiction with Rule 6 fixed ("could" example → N/O)
- [x] Gold adjudication strategy for SI↔N/O defined: trust model consensus, apply SI via regex for assessments
| Data asset | Location | ### Step 11: Evaluation & Paper
|-----------|----------| - [ ] Macro F1 on holdout (target: > 0.80 for both heads)
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 models × 359) | - [ ] Per-class F1 breakdown
| v3.5 Opus annotations | `data/annotations/golden-v35/opus.jsonl` (359) | - [ ] Full GenAI benchmark table (10+ models × holdout)
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (308) | - [ ] Cost/time/reproducibility comparison
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) | - [ ] Error analysis on hardest cases
- [ ] IGNITE slides (20 slides, 15s each)
- [ ] Python notebooks for replication (assignment requirement)
- **Depends on:** Step 10 complete
### Gold Set Adjudication v1 — Complete ---
- [x] Aaryan redo integrated: 50.3% of labels changed, α 0.801→0.825 (cat), 0.546→0.661 (spec)
- [x] Old Aaryan labels preserved in `data/gold/human-labels-aaryan-v1.jsonl`
- [x] Cross-axis systematic error analysis: models correct ~85% on MR↔RMP, MR↔BG, RMP↔BG, TP↔RMP, SI↔N/O
- [x] 5-tier adjudication: T1 super-consensus (911), T2 cross-validated (108), T3 rule-based (30), T4 model-unanimous (59), T5 plurality (92)
- [x] 30 rule-based overrides (27 SI↔N/O + 3 T5 codebook resolutions)
### Gold Set Adjudication v2 — Complete (T5 deep analysis) ## Timeline Estimate
- [x] Full model disagreement analysis: 6-model vote vectors on all 1,200 paragraphs
- [x] Gemini identified as systematic MR outlier (z≈+2.3, 302 MR vs ~192 avg, drives 45% MR↔RMP confusion)
- [x] Gemini exclusion experiment: NULL RESULT at T5 (human MR bias makes it redundant; tiering already neutralizes at T4)
- [x] v3.5 prompt impact: unanimity 25%→60%, but created new BG↔RMP hotspot (+171%)
- [x] **Text-based BG vote removal**: automated, verifiable — if "board" absent from text, BG model votes removed. 13 labels corrected, source accuracy UP for 10/12 sources
- [x] **10 new codebook tiebreaker overrides**: ID↔SI (negative assertions), SPAC rule, board-removal test, committee-level test
- [x] **Specificity hybrid**: human unanimous → human label, human split → model majority. 195 specificity labels updated
- [x] All changes validated experimentally (one variable at a time, acceptance criteria checked)
- [x] T5: 92 → 85, gold≠human: 151 → 144
| Source | Accuracy vs Gold (v1) | Accuracy vs Gold (v2) | Δ | | Step | Days | Cumulative |
|--------|----------------------|----------------------|---| |------|------|-----------|
| Xander | 91.0% | 91.5% | +0.5% | | 1. Codebook approval | 1 | 1 |
| Opus | 88.6% | 89.1% | +0.5% | | 2. Prompt iteration | 2 | 3 |
| GPT-5.4 | 87.4% | 88.5% | +1.1% | | 3. Stage 1 re-run | 0.5 | 3.5 |
| GLM-5 | 86.0% | 86.5% | +0.5% | | 4. Holdout selection | 0.5 | 4 |
| Elisabeth | 85.8% | 86.5% | +0.7% | | 5. Labelapp update | 1 | 5 |
| MIMO | 85.8% | 86.2% | +0.5% | | 6. Parallel labeling | 3 | 8 |
| Meghan | 85.3% | 86.0% | +0.7% | | 7. Gold assembly | 1 | 9 |
| Kimi | 84.5% | 84.9% | +0.4% | | 8. Stage 2 (if needed) | 1 | 10 |
| Gemini | 84.0% | 84.6% | +0.6% | | 9. Training data assembly | 0.5 | 10.5 |
| Joey | 80.7% | 80.2% | -0.5% | | 10. Fine-tuning | 3-5 | 13.5-15.5 |
| Aaryan | 75.2% | 74.2% | -1.0% | | 11. Evaluation + paper | 3-5 | 16.5-20.5 |
| Anuj | 69.3% | 69.7% | +0.3% |
| Data asset | Location | **Buffer:** 0.5-4.5 days. Tight but feasible if Steps 1-5 execute cleanly.
|-----------|----------|
| Adjudicated gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
| Old Aaryan labels | `data/gold/human-labels-aaryan-v1.jsonl` (600) |
| Adjudication charts | `data/gold/charts/gold-*.png` (4 charts) |
| Adjudication script | `scripts/adjudicate-gold.py` (v2) |
| Experiment harness | `scripts/adjudicate-gold-experiment.py` |
| T5 analysis docs | `docs/T5-ANALYSIS.md` |
## What's Next (in dependency order) ---
### 1. (Optional) Manual review of remaining 85 T5-plurality paragraphs ## Rubric Checklist (Assignment)
- 85 paragraphs resolved by signal plurality — lowest confidence tier
- 71% on the BG↔MR↔RMP triangle (irreducible ambiguity)
- 62 have weak plurality (4-5/9) — diminishing returns
- Could improve gold set by ~1-3% if reviewed, but diminishing returns
### 2. Stage 2 re-eval on training data ### C (f1 > .80): the goal
- Pilot gpt-5.4-mini vs gpt-5.4 on holdout validation sample - [ ] Fine-tuned model with F1 > .80 — category likely, specificity needs v2 broadening
- Run on 308 flagged Stage 1 corrections (180 materiality + 128 SPACs) - [x] Performance comparison GenAI vs fine-tuned — 10 models benchmarked (will re-run on v2 holdout)
- Also run standard Stage 2 judge on existing disagreements with v3.5 prompt - [x] Labeled datasets — 150K Stage 1 + 1,200 gold (v1; will re-do for v2)
- [x] Documentation — extensive
- [ ] Python notebooks for replication
### 3. Training data assembly ### B (3+ of 4): already have all 4
- Unanimous Stage 1 labels (35,204 paragraphs) → full weight - [x] Cost, time, reproducibility — dollar amounts for every API call
- Calibrated majority labels (~9-12K) → full weight - [x] 6+ models, 3+ suppliers — 10 models, 8 suppliers (+ Opus in v2)
- Judge high-confidence labels (~2-3K) → full weight - [x] Contemporary self-collected data — 72K paragraphs from SEC EDGAR
- Quality tier weights: clean/headed/minor=1.0, degraded=0.5 - [x] Compelling use case — SEC cyber disclosure quality assessment
### 4. Fine-tuning + ablations ### A (3+ of 4): have 3, working on 4th
- 8+ experiments: {base, +DAPT, +DAPT+TAPT} × {±SCL} × {±class weighting} - [x] Error analysis — T5 deep-dive, confusion axis analysis, model reasoning examination
- Dual-head classifier: shared ModernBERT backbone + category head (7-class) + specificity head (4-class ordinal) - [x] Mitigation strategy — v1→v2 codebook evolution, experimental validation
- Focal loss / class-weighted CE for category imbalance - [ ] Additional baselines — dictionary/keyword approach (specificity IS/NOT lists as baseline)
- Ordinal regression (CORAL) for specificity - [x] Comparison to amateur labels — annotator before/after, human vs model agreement analysis
### 5. Evaluation + paper ---
- Macro F1 + per-class F1 on holdout (must exceed 0.80 for category)
- Full GenAI benchmark table (10 models × 1,200 holdout)
- Cost/time/reproducibility comparison
- Error analysis on Tier 4 paragraphs (A-grade criterion)
- IGNITE slides (20 slides, 15s each)
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT ✓ ─────────────────────────────→ Fine-tuning → Eval
Track B (API): Opus re-run ✓─┐ │
├→ v3.5 re-run ✓ → SI paradox ✓ ───┐ │
Track C (API): 6-model bench ✓┘ │ │
Gold adjud. ✓ ┤ │
Track E (API): v3.5 prompt ✓ → S1 flags ✓ → Stage 2 re-eval ───┘───┘
Track D (Human): Labeling ✓ → IRR ✓ → 13-signal ✓ → Aaryan redo ✓
```
## Key File Locations ## Key File Locations
| What | Where | | What | Where |
|------|-------| |------|-------|
| v2 codebook | `docs/LABELING-CODEBOOK.md` |
| v2 codebook ethos | `docs/CODEBOOK-ETHOS.md` |
| v2 narrative | `docs/NARRATIVE.md` |
| v1 codebook (preserved) | `docs/LABELING-CODEBOOK-v1.md` |
| v1 narrative (preserved) | `docs/NARRATIVE-v1.md` |
| Strategy notes | `docs/STRATEGY-NOTES.md` |
| Paragraphs | `data/paragraphs/paragraphs-clean.jsonl` (72,045) |
| Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) | | Patched paragraphs | `data/paragraphs/paragraphs-clean.patched.jsonl` (49,795) |
| Patched annotations | `data/annotations/stage1.patched.jsonl` (150,009) | | v1 Stage 1 annotations | `data/annotations/stage1.patched.jsonl` (150,009) |
| Quality scores | `data/paragraphs/quality/quality-scores.jsonl` (72,045) | | v1 gold labels | `data/gold/gold-adjudicated.jsonl` (1,200) |
| Human labels (raw) | `data/gold/human-labels-raw.jsonl` (3,600 labels) | | v1 human labels | `data/gold/human-labels-raw.jsonl` (3,600) |
| Human label metrics | `data/gold/metrics.json` | | v1 benchmark annotations | `data/annotations/bench-holdout/*.jsonl` |
| Holdout paragraphs | `data/gold/paragraphs-holdout.jsonl` (1,200) |
| Diagnostic charts | `data/gold/charts/*.png` (30 charts) |
| Opus golden labels | `data/annotations/golden/opus.jsonl` (1,200) |
| Benchmark annotations | `data/annotations/bench-holdout/{model}.jsonl` (6 × 1,200) |
| Original sampled IDs | `labelapp/.sampled-ids.original.json` (1,200 holdout PIDs) |
| DAPT corpus | `data/dapt-corpus/shard-*.jsonl` (14,756 docs) |
| DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` | | DAPT checkpoint | `checkpoints/dapt/modernbert-large/final/` |
| TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` | | TAPT checkpoint | `checkpoints/tapt/modernbert-large/final/` |
| v3.5 bench annotations | `data/annotations/bench-holdout-v35/*.jsonl` (7 × 359) | | DAPT corpus | `data/dapt-corpus/shard-*.jsonl` |
| v3.5 Opus golden | `data/annotations/golden-v35/opus.jsonl` (359) | | Stage 1 prompt | `ts/src/label/prompts.ts` |
| Stage 1 correction flags | `data/annotations/stage1-corrections.jsonl` (1,014) | | Annotation runner | `ts/src/label/annotate.ts` |
| Holdout re-run IDs | `data/gold/holdout-rerun-v35.jsonl` (359) | | Labelapp | `labelapp/` |
| Analysis script | `scripts/analyze-gold.py` (30-chart, 13-signal analysis) |
| Data dump script | `labelapp/scripts/dump-all.ts` |