28 KiB
Codebook Ethos — Design Reasoning & Edge Case Analysis
This document explains the reasoning behind every decision in the labeling codebook. It is the training companion for human annotators and the design record for the project. If you read this document and disagree with anything in the codebook, flag it — we want to resolve disagreements here, not at labeling time.
Why This Document Exists
The codebook (LABELING-CODEBOOK.md) tells you WHAT to do. This document tells you WHY. The distinction matters because:
- Models need clean instructions. The codebook is designed to go directly into an LLM system prompt. Extra explanation creates context pollution and can cause models to overfit to edge case reasoning rather than applying general rules.
- Humans need understanding. A human annotator who understands the reasoning behind a rule will correctly handle novel edge cases that the rule doesn't explicitly cover. A human who only knows the rule will freeze on ambiguity or make inconsistent judgment calls.
- Decisions need documentation. Every bright line in the codebook represents a deliberate choice. Documenting the reasoning makes those choices auditable, revisable, and defensible in the final paper.
Why v2? What Changed from v1
The v1 codebook (preserved at docs/LABELING-CODEBOOK-v1.md) was built over 12+ prompt iterations and served through 150K Stage 1 annotations, a 6-person human labeling round, and a 10-model benchmark. It worked — but it had structural problems that became visible only at evaluation time:
Problem 1: Specificity Level 2 was too narrow
The professor's construct defines Level 2 as "Sector-adapted — references industry but no firm-specific details." Our v1 codebook interpreted this as "names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.)." That interpretation was too literal. Things like penetration testing, vulnerability scanning, SIEM, phishing simulations — these are all cybersecurity industry practices that a security professional instantly recognizes as domain-specific. Our codebook classified them as Level 1 (generic boilerplate), which squeezed Level 2 down to 3.9% of the holdout (47 samples).
At 47 samples, ±3 correct/incorrect swings F1 by ~0.06. The measurement is too noisy for reliable per-class evaluation.
v2 fix: Level 2 is now "Domain-Adapted" — uses cybersecurity domain terminology recognizable to a security professional, not just named standards. The projected distribution shifts from ~44/4/37/14 to ~25/20/37/18. Every class has real mass.
Problem 2: Level 4 required 2+ QV facts (counting problem)
The professor's construct says: "(4) Quantified and verifiable — includes specific metrics, dollar amounts, incident timelines, or third-party audit references." That's a list of qualifying facts, not a "count two" rule. Our v1 codebook added the 2-fact threshold, which created a narrow Level 4 (14.1%) and forced annotators into a counting exercise that was error-prone and contentious.
v2 fix: 1+ QV-eligible fact → Level 4. No counting. The bright line is: "Can an external party independently verify this claim?" One verifiable dollar amount, one named third-party firm, one specific date — any of these is already more informative than a paragraph without them.
Problem 3: The BG/MR/RMP triangle was patched, not fixed
v1 accumulated six decision rules and ten borderline cases — many were patches for systemic ambiguity rather than clean rules. The v3.0 person-vs-function test and v3.5 three-step decision chain were good ideas, but they were bolted on as rulings to an unchanged set of definitions. Models had to process increasingly complex instructions with diminishing returns.
v2 fix: The "What question does this paragraph answer?" test replaces the patchwork. MR's headline is now "How is management organized to handle cybersecurity?" — broader than "who a specific person is" (which missed paragraphs about management structure without named individuals) and clearer than a multi-step mechanical test. The person-removal test survives as a confirmation tool, not the primary rule.
Problem 4: The holdout was adversarial by design
v1's holdout was stratified to OVER-SAMPLE confusion-axis paragraphs. This was great for codebook development (stress-testing rules on hard cases) but terrible for evaluation (inflating error rates and depressing F1). Combined with the narrow Level 2, this created a structurally unfavorable evaluation set.
v2 fix: Random stratified sample — equal per category class, random within each stratum. Hard cases are represented at their natural frequency, not overweighted.
Category Reasoning
Why "What Question Does This Paragraph Answer?"
Previous approaches tried to classify based on surface features: grammatical subjects, keyword presence, mechanical tests. These worked for clear-cut cases but failed on the governance chain (Board → Committee → Officer → Program) that appears in thousands of SEC filings.
The "what question?" test works because it asks about communicative PURPOSE, not surface features. A paragraph that chains "The Audit Committee oversees... our CISO reports quarterly... the program includes penetration testing" has keywords from all three of BG, MR, and RMP. The question test cuts through: what is this paragraph TRYING TO TELL YOU? It's trying to tell you how oversight works. → BG.
This is also the test that humans naturally apply. When you read a paragraph and "just know" it's about governance vs. process, you're implicitly asking what the paragraph's purpose is. The codebook now makes that implicit test explicit.
The Board Governance / Management Role Boundary
The core issue: SEC Item 106(c) has two parts — (c)(1) covers board oversight and (c)(2) covers management's role. Many filings interleave them in a single paragraph.
The rule: Governance-chain paragraphs default to BG. They become MR only when management's organizational role is the primary content.
Why this default? Because the governance chain exists TO DESCRIBE OVERSIGHT. When a paragraph says "The Audit Committee oversees our cybersecurity program. Our CISO reports quarterly to the Committee on threat landscape and program effectiveness," the paragraph is explaining how oversight works. The CISO is the mechanism through which the board gets information — the paragraph is about the board's oversight structure, not about the CISO as a person or management's organizational role.
MR captures something different: it answers "how is management organized?" This includes:
- Who holds cybersecurity responsibilities and how those responsibilities are divided
- What qualifies those people (credentials, experience, background)
- How management-level structures work (steering committees, reporting lines between officers)
- The identity and background of specific individuals
A paragraph about the CISO's 20 years of experience, CISSP certification, and team of 12 → MR. A paragraph about the board receiving quarterly reports from the CISO → BG. Same person mentioned, different purpose.
The directionality heuristic (confirmation tool, not primary rule):
- Board → Management (describing governance structure flowing down) → BG
- Management → Board (describing reporting relationship flowing up) → usually BG (the board is still the focus as the recipient)
- Management → Management (how roles are divided, who reports to whom in management) → MR
- Either mentioned, but most content is about actual processes → RMP
The Management Role / Risk Management Process Boundary
The core issue: This was the #1 disagreement axis in v1 (2,290 disputes). The pattern is always the same: a paragraph names a CISO/CIO/CTO in the opening clause, then describes what the cybersecurity program does. Is it about the person or the program?
The person-removal test: Remove all person-specific content. If a substantive description remains → RMP. This works because:
- If the paragraph is ABOUT the program, removing the person who oversees it leaves the program description intact
- If the paragraph is ABOUT the person, removing their details leaves nothing meaningful
Why this test and not a noun count or keyword list: We tried mechanical approaches in v1 (step-by-step decision chains, grammatical subject tests). They worked for easy cases but made hard cases harder — annotators had to run through a mental flowchart instead of reading the paragraph naturally. The person-removal test is a single thought experiment that maps to what humans already do intuitively.
The remaining hard case — management committee with process details:
"Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs."
Person-removal test: remove committee membership → "monthly to review cybersecurity risks, assess emerging threats, and oversee vulnerability management and incident response programs." Still has content, but it's thin — the committee structure IS the primary content. → MR.
If the paragraph instead spent three more sentences describing how the vulnerability management program works → RMP (process becomes dominant). The test scales with paragraph length naturally.
The Strategy Integration / None/Other Boundary
The core issue: v1 had 1,094 disputes on this axis, almost all from materiality disclaimers. The sentence "risks have not materially affected our business strategy, results of operations, or financial condition" appears in thousands of filings. Is it SI (a materiality assessment) or N/O (boilerplate)?
The rule: It's SI. Even though the language is generic, the company IS fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. Category captures WHAT the paragraph discloses (a materiality assessment). Specificity captures HOW specific it is (generic boilerplate = Level 1). These are independent dimensions.
The "could" vs. "have not" distinction: This is a linguistic bright line, not a judgment call.
- "Have not materially affected" → past tense, definitive statement → assessment → SI
- "Are reasonably likely to materially affect" → SEC's required forward-looking language → assessment → SI
- "Could have a material adverse effect" → conditional, hypothetical → speculation → N/O (or classify by other content)
The keyword is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold. "Could" is the generic risk-factor language that appears in every 10-K regardless of actual risk level.
Cross-references with materiality language: "For risks that may materially affect us, see Item 1A" is N/O. The paragraph's purpose is pointing elsewhere. The word "materially" describes what Item 1A discusses, not the company's own conclusion. But: "Risks have not materially affected us. See Item 1A" is SI — the first sentence IS an assessment, and the cross-reference is subordinate.
Specificity Reasoning
Why Broaden Level 2: The ERM Test
The v1 definition of Level 2 ("names a specific recognized standard") was too narrow because it conflated "domain-specific" with "names a formal standard." A paragraph that says "we conduct penetration testing and vulnerability assessments" is clearly more informative than "we have processes to manage cybersecurity risks" — the first uses domain vocabulary, the second uses generic business language. But v1 classified both as Level 1.
The v2 test: "Would this term appear naturally in a generic enterprise risk management document?" This captures the construct's intent — "references industry" means using the industry's vocabulary, not just citing its standards.
Why "incident response plan" stays at Level 1: IRP is used across all risk management domains — cybersecurity, physical security, natural disasters, supply chain disruptions. A non-security ERM professional would use this term naturally. By contrast, "penetration testing" is uniquely cybersecurity — you don't penetration-test a supply chain or a natural disaster response.
Why "security awareness training" is Level 2: This is borderline. A businessperson might say "we train employees on security." But the specific phrase "security awareness training" is a recognized cybersecurity program type. The term itself references a domain-specific practice, even though it's become common. A non-security person would say "we train our employees" (Level 1), not "we provide security awareness training" (Level 2). The difference IS the domain vocabulary.
Why "tabletop exercises" stays at Level 1: Tabletop exercises are used in emergency management, business continuity, and general risk management — not just cybersecurity. "Cybersecurity tabletop exercises simulating ransomware scenarios" → Level 2 (the qualifier makes it domain-specific). But bare "tabletop exercises" could refer to any risk domain.
Why 1+ QV Fact: The External Verifiability Test
The v1 rule was 2+ QV facts. This created problems:
- Counting is error-prone. Annotators and models disagree on what counts. Is "CISO" a QV fact? Is "quarterly" a fact? The counting itself became a source of disagreement.
- The construct doesn't require counting. The professor's Level 4 definition lists types of qualifying facts, not a minimum count.
- One verifiable fact IS quantified and verifiable. A paragraph that says "We maintain $100M in cyber insurance coverage" is genuinely more informative and verifiable than one without dollar amounts. The 2-fact threshold was artificial.
The v2 test asks: Can an external party independently verify at least one claim in this paragraph? One specific number, one named third-party firm, one named certification held by an individual — any of these crosses the threshold.
Why named roles (CISO) are NOT QV: A role title tells you something about the company's structure (firm-specific, Level 3) but is not a quantified claim an outsider can verify. "Our CISO" is identification. "Our CISO holds CISSP certification" adds a verifiable claim (CISSP holders are in a public registry). The role gets you to Level 3; the certification pushes to Level 4.
Why named individuals alone are NOT QV: "Our CISO, Jane Smith" is firm-specific (Level 3). You could look her up, but the NAME itself isn't a quantified claim about cybersecurity posture. "Jane Smith, who has 20 years of cybersecurity experience" adds a verifiable quantity. The name identifies; the experience quantifies.
The certification trilogy — a critical distinction:
- "Our program is aligned with ISO 27001" → Level 2 (references a standard, no firm-specific claim)
- "We are working toward ISO 27001 certification" → Level 3 (firm-specific intent, but no verifiable achievement)
- "We maintain ISO 27001 certification" → Level 4 (verifiable claim — you can check if a company holds this certification)
The difference between "aligned with" and "maintain certification" is the difference between aspiration and audited fact.
Worked Edge Cases
Case 1: The Governance Chain
"The Board of Directors, through its Audit Committee, oversees the Company's cybersecurity risk management program. The Audit Committee receives regular updates from the CISO on the results of penetration testing and vulnerability assessments."
"What question?" test: "How does the board oversee cybersecurity?" → BG Specificity: "penetration testing," "vulnerability assessments" = domain terminology → Level 2 Why not RMP? The process details (pen testing, vuln assessments) are subordinate to the reporting structure. The paragraph exists to tell you that the Audit Committee oversees things and receives reports — the program details are examples of WHAT is reported.
Case 2: CISO Attribution + Program Description
"Our CISO oversees our cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework."
Person-removal test: "cybersecurity program, which includes regular risk assessments, penetration testing, vulnerability scanning, and incident response planning aligned with the NIST CSF framework" → complete program description → RMP Specificity: Domain terms (pen testing, vuln scanning) + named standard (NIST CSF) → Level 2 Why not MR? The paragraph tells you nothing about the CISO as a person — no qualifications, no experience, no reporting line, no team. The CISO is an attribution tag, like a byline on a news article. The content is the program.
Case 3: CISO Qualifications
"Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer. She leads a team of 12 dedicated cybersecurity professionals."
"What question?" test: "How is management organized / who is this person?" → MR Specificity: CISSP/CISM (named certifications, QV), 20 years (specific number, QV), 12 professionals (headcount, QV) — any one of these → Level 4 Why not RMP? Every sentence is about the person: their title, credentials, experience, reporting line, team. Remove the person-specific content and nothing remains.
Case 4: CFO/VP Role Allocation (No Named Individuals)
"Our CFO and VP of IT jointly oversee our cybersecurity program. The CFO is responsible for risk governance and insurance, while the VP of IT manages technical operations. They report to the board quarterly on cybersecurity matters."
"What question?" test: "How is management organized?" → MR Person-removal test: Remove all role content → "report to the board quarterly on cybersecurity matters" → barely anything → MR confirmed Specificity: VP of IT = cybersecurity-specific title → Level 3 (firm-specific) Why this is MR without named individuals: MR isn't "who a specific person is" — it's "how management is organized." This paragraph describes role allocation and reporting structure. The roles are named, the responsibilities are divided, the governance chain is defined. This is organizational disclosure.
Case 5: Management Committee with Process Details
"Our Cybersecurity Steering Committee, comprising the CISO, CIO, CFO, and General Counsel, meets monthly to review cybersecurity risks, assess emerging threats, and oversee our vulnerability management and incident response programs."
"What question?" test: "How is management organized?" → MR Person-removal test: Remove committee membership → thin but the activities remain → borderline Tiebreak: The paragraph's FRAME is the committee — it introduces the committee and describes what it does. The activities listed (review, assess, oversee) are verbs of management oversight, not operational descriptions of HOW those programs work. → MR, Specificity 3 (named committee + composition = firm-specific) When this flips to RMP: If the paragraph spent most of its length describing how the vulnerability management program works (tools, methodology, frequency, findings), with the committee mentioned only as context → RMP.
Case 6: Materiality Assessment (Backward-Looking)
"Risks from cybersecurity threats have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."
Materiality test: Company stating a conclusion → SI Specificity: Boilerplate language (every company says this) → Level 1 Why this is SI and not N/O: The company is fulfilling its SEC obligation to assess materiality. The fact that the language is generic makes it low-specificity, but the CATEGORY is about what the paragraph discloses (a materiality assessment), not how specific it is.
Case 7: Materiality Speculation
"Cybersecurity risks could have a material adverse effect on our business, financial condition, and results of operations."
Materiality test: "Could" = speculation, not a conclusion → N/O Specificity: N/O always gets Level 1 Why this is N/O and not SI: This is generic risk-factor language that appears in virtually every 10-K, regardless of whether the company has ever experienced a cybersecurity incident. The company is not stating a conclusion about its cybersecurity posture — it's acknowledging that cybersecurity risks exist. This carries zero informational content about THIS company's cybersecurity situation.
Case 8: Forward-Looking Assessment (SEC Qualifier)
"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."
Materiality test: "Reasonably likely to materially affect" = SEC's Item 106(b)(2) threshold → SI Specificity: Boilerplate → Level 1 Why "reasonably likely" is different from "could": "Reasonably likely" is the SEC's required assessment language. A company using this phrase is making a forward-looking materiality assessment, not idly speculating. It's still boilerplate (Spec 1), but it IS an assessment (SI).
Case 9: Cross-Reference with vs. without Assessment
N/O: "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A, 'Risk Factors.'" → The paragraph points elsewhere. "May materially affect" describes what Item 1A discusses. → N/O, Level 1
SI: "We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors." → The first sentence IS an assessment. The cross-reference is subordinate. → SI, Level 1
The test: does the paragraph MAKE a materiality conclusion, or only REFERENCE one that exists elsewhere?
Case 10: SPAC / No-Operations Company
"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any."
→ N/O, Level 1. The board mention is perfunctory ("generally responsible... if any"). The company explicitly has no program. The absence of a program is not a disclosure of a program, and an incidental governance mention in the context of "we have nothing" does not constitute substantive board governance disclosure.
Case 11: Named Tool as QV Fact
"We utilize CrowdStrike Falcon for endpoint detection and response across our enterprise."
Category: "What does the program do?" → RMP Specificity: CrowdStrike Falcon = named product = QV-eligible fact → Level 4 Why this is Level 4: A company naming its specific EDR tool is genuinely more transparent and verifiable than "we use endpoint detection tools." You could confirm this claim. This is exactly what the construct means by "quantified and verifiable."
Case 12: Single Named Tool (v1 was Level 3, v2 is Level 4)
Under v1's 2-fact rule, a paragraph with only one named product was Level 3. Under v2's 1-fact rule, it's Level 4. This is intentional — the 2-fact threshold was artificial. One verifiable external reference IS "quantified and verifiable."
Case 13: Insurance with Dollar Amount
"We maintain cybersecurity insurance coverage with $100 million in aggregate coverage and a $5 million deductible per incident."
"What question?" test: "How does cybersecurity affect the business?" → SI (insurance is a financial/business-impact response) Specificity: $100M and $5M = dollar amounts (QV) → Level 4
Case 14: Regulatory Compliance — Three Variants
N/O: "The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy." → A truism. No disclosure of what the company DOES. → N/O, Level 1
RMP, Level 2: "We maintain compliance with PCI DSS, HIPAA, and GDPR through regular audits and monitoring of our security controls." → Names specific standards + describes compliance activities → RMP, Level 2
RMP, Level 4: "We passed our PCI DSS Level 1 audit in March 2024, conducted by Trustwave." → Names standard + specific date + named third-party auditor → RMP, Level 4
Case 15: "Under the Direction of" Attribution
"Under the direction of our CISO, the Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring."
Person-removal test: "The Company has implemented a comprehensive cybersecurity program including penetration testing, vulnerability assessments, and 24/7 security monitoring." → Complete program description → RMP, Level 2
Case 16: ERM Integration
"Our cybersecurity risk management program is integrated into our overall enterprise risk management framework."
Category: This describes a program characteristic → RMP Specificity: "Enterprise risk management" and "integrated" are generic business language → Level 1 Why not Level 2: "Enterprise risk management" is a general business concept, not cybersecurity domain terminology. The ERM test: would this sentence appear in a generic ERM document? Yes, it could describe integrating ANY risk program into ERM. → Level 1.
Case 17: "Dedicated Cybersecurity Team"
"We have a dedicated cybersecurity team that is responsible for managing our cybersecurity risks."
Category: RMP (what the team does — manages cyber risks) Specificity: "Dedicated cybersecurity team" = domain-adapted organizational approach → Level 2 Why Level 2 and not Level 3: Many companies claim "dedicated" teams. The term describes a general organizational approach (having people dedicated to cybersecurity), not a fact unique to THIS company. Compare: "a dedicated team of 12 cybersecurity professionals" → Level 4 (the headcount is QV). The word "dedicated" itself doesn't differentiate.
Case 18: Multiple Category Paragraph — Incident + Cost
"On January 15, 2024, we detected unauthorized access to our customer support portal. We estimate the total cost of remediation at approximately $8.5 million."
Both ID and SI content. Which dominates? The incident (what happened) is the frame; the cost is a detail within the incident narrative. → ID, Level 4 (January 15, 2024 + $8.5M = QV facts)
If the paragraph were primarily financial analysis with one sentence mentioning an incident → SI.
Case 19: Negative Incident Assertion
"We have not experienced any material cybersecurity incidents during the reporting period."
Materiality test: Negative assertion with materiality framing → SI, Level 1 Why SI and not N/O: The company is STATING A CONCLUSION about the absence of material incidents. This is a materiality assessment even though it's negative. Why not ID: No incident is described. The paragraph assesses business impact (no material incidents), not incident details.
What We Preserved from v1
Not everything changed. The following were validated through 150K annotations, 10-model benchmarks, and human labeling:
- 7 content categories mapped to SEC rule structure — the construct is sound
- 4 specificity levels as an ordinal scale — the graduated concept works
- IS/NOT list pattern — the single most effective prompt engineering technique from v1. Lists beat rules for specificity.
- Validation step — "review your facts, remove NOT-list items" catches model self-correction
- Materiality assessment vs. speculation — linguistic bright line, well-calibrated in v3.5
- SPAC/no-operations rule — resolved cleanly
- TP vs RMP distinction — "who is being assessed?" test works
- ID for actual incidents only — hypothetical language doesn't trigger ID
These are proven components. v2 changes the boundaries and definitions around them, not the components themselves.