SEC-cyBERT/docs/archive/v1/LABELING-CODEBOOK-v1.md

# Labeling Codebook — SEC Cybersecurity Disclosure Quality

This codebook is the authoritative reference for all human and GenAI labeling. Every annotator (human or model) must follow these definitions exactly. The LLM system prompt is generated directly from this document.

---

## Classification Design

**Unit of analysis:** One paragraph from an SEC filing (Item 1C of 10-K, or Item 1.05/8.01/7.01 of 8-K).

**Classification type:** Multi-class (single-label), NOT multi-label. Each paragraph receives exactly one content category.

**Each paragraph receives two labels:**
1. **Content Category** — single-label, one of 7 mutually exclusive classes
2. **Specificity Level** — ordinal integer 1-4

**None/Other policy:** Required. Since this is multi-class (not multi-label), we need a catch-all for paragraphs that don't fit the 6 substantive categories. A paragraph receives None/Other when it contains no cybersecurity-specific disclosure content (e.g., forward-looking statement disclaimers, section headers, general business language).

---

## Dimension 1: Content Category

Each paragraph is assigned exactly **one** content category. If a paragraph spans multiple categories, assign the **dominant** category — the one that best describes the paragraph's primary communicative purpose.

### Board Governance

- **SEC basis:** Item 106(c)(1)
- **Covers:** Board or committee oversight of cybersecurity risks, briefing frequency, board member cybersecurity expertise
- **Key markers:** "Audit Committee," "Board of Directors oversees," "quarterly briefings," "board-level expertise," "board committee"
- **Assign when:** The grammatical subject performing the primary action is the board or a board committee

**Example texts:**

> *"The Board of Directors oversees the Company's management of cybersecurity risks. The Board has delegated oversight of cybersecurity and data privacy matters to the Audit Committee."*
> → Board Governance, Specificity 3 (names Audit Committee — firm-specific delegation)

> *"Our Board of Directors recognizes the critical importance of maintaining the trust and confidence of our customers and stakeholders, and cybersecurity risk is an area of increasing focus for our Board."*
> → Board Governance, Specificity 1 (could apply to any company — generic statement of intent)

> *"The Audit Committee, which includes two members with significant technology and cybersecurity expertise, receives quarterly reports from the CISO and conducts an annual deep-dive review of the Company's cybersecurity program, threat landscape, and incident response readiness."*
> → Board Governance, Specificity 3 (names specific committee, describes specific briefing cadence and scope)

### Management Role

- **SEC basis:** Item 106(c)(2)
- **Covers:** The specific *person* filling a cybersecurity leadership position: their name, qualifications, career history, credentials, tenure, reporting lines, management committees responsible for cybersecurity
- **Key markers:** "Chief Information Security Officer," "reports to," "years of experience," "management committee," "CISSP," "CISM," named individuals, career background
- **Assign when:** The paragraph tells you something about *who the person is* — their background, credentials, experience, or reporting structure. A paragraph that names a CISO/CIO/CTO and then describes what the cybersecurity *program* does is NOT Management Role — it is Risk Management Process with an incidental role attribution. The test is whether the paragraph is about the **person** or about the **function**.

**The person-vs-function test:** If you removed the role holder's name, title, qualifications, and background from the paragraph and the remaining content still describes substantive cybersecurity activities, processes, or oversight → the paragraph is about the function (Risk Management Process), not the person (Management Role). Management Role requires the person's identity or credentials to be the primary content, not just a brief attribution of who runs the program.

**Example texts:**

> *"Our Vice President of Information Security, who holds CISSP and CISM certifications and has over 20 years of experience in cybersecurity, reports directly to our Chief Information Officer and is responsible for leading our cybersecurity program."*
> → Management Role, Specificity 3 — The paragraph is about the person: their credentials, experience, and reporting line. (named role, certifications, reporting line — all firm-specific)

> *"Management is responsible for assessing and managing cybersecurity risks within the organization."*
> → Management Role, Specificity 1 (generic, no named roles or structure)

> *"Our CISO, Sarah Chen, leads a dedicated cybersecurity team of 35 professionals and presents monthly threat briefings to the executive leadership team. Ms. Chen joined the Company in 2019 after serving as Deputy CISO at a Fortune 100 financial services firm."*
> → Management Role, Specificity 4 — The paragraph is about the person: their name, team size, background, prior role. (named individual, team size, specific frequency, prior employer — multiple verifiable facts)

> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, penetration testing, and incident response planning aligned with the NIST CSF framework."*
> → **Risk Management Process**, NOT Management Role — The CISO is mentioned once as attribution, but the paragraph is about what the program does. Remove "Our CISO oversees" and the paragraph still makes complete sense as a process description.

### Risk Management Process

- **SEC basis:** Item 106(b)
- **Covers:** Risk assessment methodology, framework adoption (NIST, ISO, etc.), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration
- **Key markers:** "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management," "tabletop exercises," "incident response plan," "SOC," "SIEM"
- **Assign when:** The paragraph primarily describes the company's internal cybersecurity processes, tools, or methodologies

**Example texts:**

> *"We maintain a cybersecurity risk management program that is integrated into our overall enterprise risk management framework. Our program is designed to identify, assess, and manage material cybersecurity risks to our business."*
> → Risk Management Process, Specificity 1 (generic, could apply to any company)

> *"Our cybersecurity program is aligned with the NIST Cybersecurity Framework and incorporates elements of ISO 27001. We conduct regular risk assessments, vulnerability scanning, and penetration testing as part of our continuous monitoring approach."*
> → Risk Management Process, Specificity 2 (names frameworks but no firm-specific detail)

> *"We operate a 24/7 Security Operations Center that uses Splunk SIEM and CrowdStrike Falcon endpoint detection. Our incident response team conducts quarterly tabletop exercises simulating ransomware, supply chain compromise, and insider threat scenarios."*
> → Risk Management Process, Specificity 4 (named tools, named vendor, specific exercise frequency and scenarios — verifiable)

### Third-Party Risk

- **SEC basis:** Item 106(b)
- **Covers:** Vendor/supplier risk oversight, external assessor engagement, contractual security requirements, supply chain risk management
- **Key markers:** "third-party," "service providers," "vendor risk," "external auditors," "supply chain," "SOC 2 report," "contractual requirements"
- **Assign when:** The central topic is oversight of external parties' cybersecurity, not the company's own internal processes

**Example texts:**

> *"We face cybersecurity risks associated with our use of third-party service providers who may have access to our systems and data."*
> → Third-Party Risk, Specificity 1 (generic risk statement)

> *"Our vendor risk management program requires all third-party service providers with access to sensitive data to meet minimum security standards, including SOC 2 Type II certification or equivalent third-party attestation."*
> → Third-Party Risk, Specificity 2 (names SOC 2 standard but no firm-specific detail about which vendors or how many)

> *"We assessed 312 vendors in fiscal 2024 through our Third-Party Risk Management program. All Tier 1 vendors (those with access to customer PII or financial data) are required to provide annual SOC 2 Type II reports. In fiscal 2024, 14 vendors were placed on remediation plans and 3 vendor relationships were terminated for non-compliance."*
> → Third-Party Risk, Specificity 4 (specific numbers, specific actions, specific criteria — all verifiable)

### Incident Disclosure

- **SEC basis:** 8-K Item 1.05 (and 8.01/7.01 post-May 2024)
- **Covers:** Description of cybersecurity incidents — nature, scope, timing, impact assessment, remediation actions, ongoing investigation
- **Key markers:** "unauthorized access," "detected," "incident," "remediation," "impacted," "forensic investigation," "breach," "compromised"
- **Assign when:** The paragraph primarily describes what happened in a cybersecurity incident

**Example texts:**

> *"We have experienced, and may in the future experience, cybersecurity incidents that could have a material adverse effect on our business, results of operations, and financial condition."*
> → Incident Disclosure, Specificity 1 (hypothetical — no actual incident described. Note: if this appears in Item 1C rather than an 8-K, consider None/Other instead since it's generic risk language)

> *"On January 15, 2024, we detected unauthorized access to our customer support portal. The threat actor exploited a known vulnerability in a third-party software component. Upon detection, we activated our incident response plan, contained the intrusion, and engaged Mandiant for forensic investigation."*
> → Incident Disclosure, Specificity 4 (specific date, specific system, named forensic firm, specific attack vector — all verifiable)

> *"In December 2023, the Company experienced a cybersecurity incident involving unauthorized access to certain internal systems. The Company promptly took steps to contain and remediate the incident, including engaging third-party cybersecurity experts."*
> → Incident Disclosure, Specificity 3 (specific month, specific action — but no named firms or quantified impact)

### Strategy Integration

- **SEC basis:** Item 106(b)(2)
- **Covers:** Material impact (or lack thereof) on business strategy or financials, cybersecurity insurance, investment/resource allocation, cost of incidents
- **Key markers:** "business strategy," "insurance," "investment," "material," "financial condition," "budget," "not materially affected," "results of operations"
- **Assign when:** The paragraph primarily discusses business/financial consequences or strategic response to cyber risk, not the risk management activities themselves
- **Includes materiality ASSESSMENTS:** A materiality assessment is the company stating a conclusion about whether cybersecurity has or will affect business outcomes. Backward-looking ("have not materially affected"), forward-looking with SEC qualifier ("reasonably likely to materially affect"), and negative assertions ("have not experienced material incidents") are all assessments → SI. Generic risk warnings ("could have a material adverse effect") are NOT assessments — they are boilerplate speculation that appears in every 10-K → classify by primary content. "Material" as an adjective ("managing material risks") is also not an assessment.

**Example texts:**

> *"Cybersecurity risks, including those described above, have not materially affected, and are not reasonably likely to materially affect, our business strategy, results of operations, or financial condition."*
> → Strategy Integration, Specificity 1 (boilerplate materiality statement — nearly identical language appears across thousands of filings, but it IS a materiality assessment)

> *"We have not identified any cybersecurity incidents or threats that have materially affected us. For more information, see Item 1A, Risk Factors."*
> → Strategy Integration, Specificity 1 — The materiality assessment is the substantive content. The cross-reference is noise and does not pull the paragraph to None/Other.

> *"We maintain cybersecurity insurance coverage as part of our overall risk management strategy to help mitigate potential financial losses from cybersecurity incidents."*
> → Strategy Integration, Specificity 2 (mentions insurance but no specifics)

> *"We increased our cybersecurity budget by 32% to $45M in fiscal 2024, representing 0.8% of revenue. We maintain cyber liability insurance with $100M in aggregate coverage through AIG and Chubb, with a $5M deductible per incident."*
> → Strategy Integration, Specificity 4 (dollar amounts, percentages, named insurers, specific deductible — all verifiable)

### None/Other

- **Covers:** Forward-looking statement disclaimers, section headers, cross-references to other filing sections, general business language that mentions cybersecurity incidentally, text erroneously extracted from outside Item 1C/1.05
- **No specificity scoring needed:** Always assign Specificity 1 for None/Other paragraphs (since there is no cybersecurity disclosure to rate)
- **SPACs and shell companies:** Companies that explicitly state they have no operations, no cybersecurity program, or no formal processes receive None/Other regardless of incidental mentions of board oversight or risk acknowledgment. The absence of a program is not a description of a program. Paragraphs like "We have not adopted any cybersecurity risk management program. Our board is generally responsible for oversight" are None/Other — the board mention is perfunctory, not substantive governance disclosure.
- **Distinguishing from Strategy Integration:** A pure cross-reference ("See Item 1A, Risk Factors") with no materiality assessment is None/Other. But if the paragraph includes an explicit materiality conclusion ("have not materially affected our business strategy"), it becomes Strategy Integration even if a cross-reference is also present. The test: does the paragraph make a substantive claim about cybersecurity's impact on the business? If yes → Strategy Integration. If it only points elsewhere → None/Other.

**Example texts:**

> *"This Annual Report on Form 10-K contains forward-looking statements within the meaning of Section 27A of the Securities Act of 1933, as amended, and Section 21E of the Securities Exchange Act of 1934, as amended."*
> → None/Other, Specificity 1

> *"Item 1C. Cybersecurity"*
> → None/Other, Specificity 1 (section header only)

> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"*
> → None/Other, Specificity 1 (cross-reference, no disclosure content)

> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program. Our board of directors is generally responsible for oversight of cybersecurity risks, if any."*
> → None/Other, Specificity 1 — No substantive disclosure. The board mention is incidental; the company explicitly has no program to disclose.

> *"We do not consider that we face significant cybersecurity risk and have not adopted any formal processes for assessing cybersecurity risk."*
> → None/Other, Specificity 1 — Absence of a program is not a program description.

---

## Category Decision Rules

### Rule 1: Dominant Category
If a paragraph spans multiple categories, assign the one whose topic occupies the most text or is the paragraph's primary communicative purpose.

### Rule 2: Board vs. Management (the board-line test)

**Core principle:** The governance hierarchy has distinct layers — board/committee oversight at the top, management execution below. The paragraph's category depends on which layer is the primary focus.

| Layer | Category | Key signals |
|-------|----------|-------------|
| Board/committee directing, receiving reports, or overseeing | Board Governance | "Board oversees," "Committee reviews," "reports to the Board" (board is recipient) |
| Named officer's qualifications, responsibilities, reporting lines | Management Role | "CISO has 20 years experience," "responsible for," credentials |
| Program/framework/controls described | Risk Management Process | "program is designed to," "framework includes," "controls aligned with" |

**When a paragraph spans layers** (governance chain paragraphs): apply the **purpose test** — what is the paragraph's communicative purpose?

- **Purpose = describing oversight/reporting structure** (who reports to whom, briefing cadence, committee responsibilities, how information flows to the board) → **Board Governance**, even if officers appear as grammatical subjects. The officers are intermediaries in the governance chain, not the focus.
- **Purpose = describing who a person is** (qualifications, credentials, experience, career history) → **Management Role**.
- **Governance-chain paragraphs are almost always Board Governance.** They become Management Role ONLY when the officer's personal qualifications/credentials are the dominant content.

| Signal | Category |
|--------|----------|
| Board/committee is the grammatical subject | Board Governance |
| Board delegates responsibility to management | Board Governance |
| Management role reports TO the board (describing reporting structure) | Board Governance (the purpose is describing how oversight works) |
| Management role's qualifications, experience, credentials described | Management Role |
| "Board oversees... CISO reports to Board quarterly" | Board Governance (oversight structure) |
| "CISO reports quarterly to the Board on..." | Board Governance (reporting structure, not about who the CISO is) |
| "The CISO has 20 years of experience and reports to the CIO" | Management Role (person's qualifications are the content) |
| Governance overview spanning board → committee → officer → program | **Board Governance** (purpose is describing the structure) |

### Rule 2b: Management Role vs. Risk Management Process (three-step decision chain)

This is the single most common source of annotator disagreement. Apply the following tests in order — stop at the first decisive result.

**Step 1 — Subject test:** What is the paragraph's grammatical subject?
- Clear process/framework/program as subject with no person detail → **Risk Management Process**. Stop.
- Person/role as subject → this is a **signal**, not decisive. Always continue to Step 2. Many SEC disclosures name an officer then describe the program — Step 2 determines which is the actual content.

**Step 2 — Person-removal test:** Could you delete all named roles, titles, qualifications, experience descriptions, and credentials from the paragraph and still have a coherent cybersecurity disclosure?
- **YES** → **Risk Management Process** (the process stands on its own; people are incidental)
- **NO** → **Management Role** (the paragraph is fundamentally about who these people are)
- Borderline → continue to Step 3

**Step 3 — Qualifications tiebreaker:** Does the paragraph include experience (years), certifications (CISSP, CISM), education, team size, or career history for named individuals?
- **YES** → **Management Role** (qualifications are MR-specific content; the SEC requires management role disclosure specifically because investors want to know WHO is responsible)
- **NO** → **Risk Management Process** (no person-specific content beyond a title attribution)

| Signal | Category |
|--------|----------|
| The person's background, credentials, tenure, experience, education, career history | Management Role |
| The person's name is given | Management Role (strong signal) |
| Reporting lines as primary content (who reports to whom, management committee structure) | Management Role |
| Role title mentioned as attribution ("Our CISO oversees...") followed by process description | **Risk Management Process** |
| Activities, tools, methodologies, frameworks as the primary content | **Risk Management Process** |
| The paragraph would still make sense if you removed the role title and replaced it with "the Company" | **Risk Management Process** |

**Key principle:** Naming a cybersecurity leadership title (CISO, CIO, CTO, VP of Security) does not make a paragraph Management Role. The title is often an incidental attribution — the paragraph names who is responsible then describes what the program does. If the paragraph's substantive content is about processes, activities, or tools, it is Risk Management Process regardless of how many times a role title appears. Management Role requires the paragraph's content to be about the *person* — who they are, what makes them qualified, how long they've served, what their background is.

### Rule 3: Risk Management vs. Third-Party
| Signal | Category |
|--------|----------|
| Company's own internal processes, tools, teams | Risk Management Process |
| Third parties mentioned as ONE component of internal program | Risk Management Process |
| Vendor oversight is the CENTRAL topic | Third-Party Risk |
| External assessor hired to test the company | Risk Management Process (they serve the company) |
| Requirements imposed ON vendors | Third-Party Risk |

### Rule 4: Incident vs. Strategy
| Signal | Category |
|--------|----------|
| Describes what happened (timeline, scope, response) | Incident Disclosure |
| Describes business impact of an incident (costs, revenue, insurance claim) | Strategy Integration |
| Mixed: "We detected X... at a cost of $Y" | Assign based on which is dominant — if cost is one sentence in a paragraph about the incident → Incident Disclosure |

### Rule 5: None/Other Threshold
Assign None/Other ONLY when the paragraph contains no substantive cybersecurity disclosure content. If a paragraph mentions cybersecurity even briefly in service of a disclosure obligation, assign the relevant content category.

**Exception — SPACs and no-operations companies:** A paragraph that explicitly states the company has no cybersecurity program, no operations, or no formal processes is None/Other even if it perfunctorily mentions board oversight or risk acknowledgment. The absence of a program is not substantive disclosure.

### Rule 6: Materiality Language → Strategy Integration
Any paragraph that explicitly connects cybersecurity to business materiality is **Strategy Integration** — regardless of tense, mood, or how generic the language is. This includes:

- **Backward-looking assessments:** "have not materially affected our business strategy, results of operations, or financial condition"
- **Forward-looking assessments with SEC qualifier:** "are reasonably likely to materially affect," "if realized, are reasonably likely to materially affect"
- **Negative assertions with materiality framing:** "we have not experienced any material cybersecurity incidents"

**The test:** Is the company STATING A CONCLUSION about materiality?

- "Risks have not materially affected our business strategy" → YES, conclusion → SI
- "Risks are reasonably likely to materially affect us" → YES, forward-looking conclusion → SI
- "Risks could have a material adverse effect on our business" → NO, speculation → not SI (classify by primary content)
- "Managing material risks associated with cybersecurity" → NO, adjective → not SI

The key word is "reasonably likely" — that's the SEC's Item 106(b)(2) threshold for forward-looking materiality. Bare "could" is speculation, not an assessment.

**Why this is SI and not N/O:** The company is fulfilling its SEC Item 106(b)(2) obligation to assess whether cyber risks affect business strategy. The fact that the language is generic makes it Specificity 1, not None/Other. Category captures WHAT the paragraph discloses (a materiality assessment); specificity captures HOW specific that disclosure is (generic boilerplate = Spec 1).

**What remains N/O:** A cross-reference is N/O even if it contains materiality language — "For a description of the risks from cybersecurity threats that may materially affect the Company, see Item 1A" is N/O because the paragraph's purpose is pointing the reader elsewhere, not making an assessment. The word "materially" here describes what Item 1A discusses, not the company's own conclusion. Also N/O: generic IT-dependence language ("our IT systems are important to operations") with no materiality claim, and forward-looking boilerplate about risks generally without invoking materiality ("we face various risks").

**The distinction:** "Risks that may materially affect us — see Item 1A" = N/O (cross-reference). "Risks have not materially affected us. See Item 1A" = SI (the first sentence IS an assessment). The test is whether the company is MAKING a materiality conclusion vs DESCRIBING what another section covers.

---

## Borderline Cases

### Case 1: Framework mention + firm-specific fact
> *"We follow NIST CSF and our CISO oversees the program."*

The NIST mention → Level 2 anchor. The CISO reference → firm-specific. **Apply boundary rule 2→3: "Does it mention anything unique to THIS company?" Yes (CISO role exists at this company) → Level 3.**

### Case 2: Named role but generic description
> *"Our Chief Information Security Officer is responsible for managing cybersecurity risks."*

Names a role (CISO) → potentially Level 3. But the description is completely generic. **Apply judgment: the mere existence of a CISO title is firm-specific (not all companies have one). → Level 3.** If the paragraph said "a senior executive is responsible" without naming the role → Level 1.

### Case 3: Specificity-rich None/Other
> *"On March 15, 2025, we filed a Current Report on Form 8-K disclosing a cybersecurity incident. For details, see our Form 8-K filed March 15, 2025, accession number 0001193125-25-012345."*

Contains specific dates and filing numbers, but the paragraph itself contains no disclosure content — it's a cross-reference. → **None/Other, Specificity 1.** Specificity only applies to disclosure substance, not to metadata.

### Case 4: Hypothetical incident language in 10-K
> *"We may experience cybersecurity incidents that could disrupt our operations."*

This appears in Item 1C, not an 8-K. It describes no actual incident. → **Risk Management Process or Strategy Integration (depending on context), NOT Incident Disclosure.** Incident Disclosure is reserved for descriptions of events that actually occurred.

### Case 5: Dual-category paragraph
> *"The Audit Committee oversees our cybersecurity program, which is led by our CISO who holds CISSP certification and reports quarterly to the Committee."*

Board (Audit Committee oversees) + Management (CISO qualifications, reporting). The opening clause sets the frame: this is about the Audit Committee's oversight, and the CISO detail is subordinate. → **Board Governance, Specificity 3.**

### Case 6: Management Role vs. Risk Management Process — the person-vs-function test
> *"Our CISO oversees the Company's cybersecurity program, which includes risk assessments, vulnerability scanning, and incident response planning. The program is aligned with the NIST CSF framework and integrated into our enterprise risk management process."*

The CISO is named as attribution, but the paragraph is about what the program does — assessments, scanning, response planning, framework alignment, ERM integration. Remove "Our CISO oversees" and it still makes complete sense as a process description. → **Risk Management Process, Specificity 2** (NIST CSF framework, no firm-specific facts beyond that).

> *"Our CISO has over 20 years of experience in cybersecurity and holds CISSP and CISM certifications. She reports directly to the CIO and oversees a team of 12 security professionals. Prior to joining the Company in 2019, she served as VP of Security at a Fortune 500 technology firm."*

The entire paragraph is about the person: experience, certifications, reporting line, team size, tenure, prior role. → **Management Role, Specificity 4** (years of experience + team headcount + named certifications = multiple QV-eligible facts).

### Case 7: Materiality disclaimer — Strategy Integration vs. None/Other
> *"We have not identified any cybersecurity incidents or threats that have materially affected our business strategy, results of operations, or financial condition. However, like other companies, we have experienced threats from time to time. For more information, see Item 1A, Risk Factors."*

Contains an explicit materiality assessment ("materially affected... business strategy, results of operations, or financial condition"). The cross-reference and generic threat mention are noise. → **Strategy Integration, Specificity 1.**

> *"For additional information about risks related to our information technology systems, see Part I, Item 1A, 'Risk Factors.'"*

No materiality assessment. Pure cross-reference. → **None/Other, Specificity 1.**

### Case 8: SPAC / no-operations company
> *"We are a special purpose acquisition company with no business operations. We have not adopted any cybersecurity risk management program or formal processes. Our Board of Directors is generally responsible for oversight of cybersecurity risks, if any. We have not encountered any cybersecurity incidents since our IPO."*

Despite touching RMP (no program), Board Governance (board is responsible), and Strategy Integration (no incidents), the paragraph contains no substantive disclosure. The company explicitly has no program, and the board mention is perfunctory ("generally responsible... if any"). The absence of a program is not a program description. → **None/Other, Specificity 1.**

### Case 9: Materiality language — assessment vs. speculation (v3.5 revision)
> *"We face risks from cybersecurity threats that, if realized and material, are reasonably likely to materially affect us, including our operations, business strategy, results of operations, or financial condition."*

The phrase "reasonably likely to materially affect" is the SEC's Item 106(b)(2) qualifier — this is a forward-looking materiality **assessment**, not speculation. → **Strategy Integration, Specificity 1.**

> *"We have not identified any risks from cybersecurity threats that have materially affected or are reasonably likely to materially affect the Company."*

Backward-looking negative assertion + SEC-qualified forward-looking assessment. → **Strategy Integration, Specificity 1.**

> *"Information systems can be vulnerable to a range of cybersecurity threats that could potentially have a material impact on our business strategy, results of operations and financial condition."*

Despite mentioning "material impact" and "business strategy," the operative verb is "could" — this is boilerplate **speculation** present in virtually every 10-K risk factor section. The company is not stating a conclusion about whether cybersecurity HAS or IS REASONABLY LIKELY TO affect them; it is describing a hypothetical. → **None/Other, Specificity 1.** (Per Rule 6: "could have a material adverse effect" = speculation, not assessment.)

> *"We face various risks related to our IT systems."*

No materiality language, no connection to business strategy/financial condition. This is generic IT-dependence language. → **None/Other, Specificity 1.**

**The distinction:** "reasonably likely to materially affect" (SEC qualifier, forward-looking assessment) ≠ "could potentially have a material impact" (speculation). The former uses the SEC's required assessment language; the latter uses conditional language that every company uses regardless of actual risk.

### Case 10: Generic regulatory compliance language
> *"Regulatory Compliance: The Company is subject to various regulatory requirements related to cybersecurity, data protection, and privacy. Non-compliance with these regulations could result in financial penalties, legal liabilities, and reputational damage."*

This acknowledges that regulations exist and non-compliance would be bad — a truism for every public company. It does not describe any process, program, or framework the company uses to comply. It does not make a materiality assessment. It names no specific regulation. → **None/Other, Specificity 1.**

The key distinctions:
- If the paragraph names a specific regulation (GDPR, HIPAA, PCI DSS, CCPA) but still describes no company-specific program → **Risk Management Process, Specificity 2** (named standard triggers Sector-Adapted)
- If the paragraph assesses whether regulatory non-compliance has "materially affected" the business → **Strategy Integration** (materiality assessment per Rule 6)
- If the paragraph describes what the company *does* to comply (audits, controls, certifications) → **Risk Management Process** at appropriate specificity

---

## Dimension 2: Specificity Level

Each paragraph receives a specificity level (1-4) indicating how company-specific the disclosure is. Apply the decision test in order — stop at the first "yes."

### Decision Test

1. **Count hard verifiable facts ONLY** (specific dates, dollar amounts, headcounts/percentages, named third-party firms, named products/tools, named certifications). TWO or more? → **Quantified-Verifiable (4)**
2. **Does it contain at least one fact from the IS list below?** → **Firm-Specific (3)**
3. **Does it name a recognized standard** (NIST, ISO 27001, SOC 2, CIS, GDPR, PCI DSS, HIPAA)? → **Sector-Adapted (2)**
4. **None of the above?** → **Generic Boilerplate (1)**

None/Other paragraphs always receive Specificity 1.

### Level Definitions

| Level | Name | Description |
|-------|------|-------------|
| 1 | Generic Boilerplate | Could paste into any company's filing unchanged. No named entities, frameworks, roles, dates, or specific details. |
| 2 | Sector-Adapted | Names a specific recognized standard (NIST, ISO 27001, SOC 2, etc.) but contains nothing unique to THIS company. General practices (pen testing, vulnerability scanning, tabletop exercises) do NOT qualify — only named standards. |
| 3 | Firm-Specific | Contains at least one fact from the IS list that identifies something unique to THIS company's disclosure. |
| 4 | Quantified-Verifiable | Contains TWO or more hard verifiable facts (see QV-eligible list). One fact = Firm-Specific, not QV. |

### ✓ IS a Specific Fact (any ONE → at least Firm-Specific)

- **Cybersecurity-specific titles:** CISO, CTO, CIO, VP of IT/Security, Information Security Officer, Director of IT Security, HSE Director overseeing cybersecurity, Chief Digital Officer (when overseeing cyber), Cybersecurity Director
- **Named non-generic committees:** Technology Committee, Cybersecurity Committee, Risk Committee, ERM Committee (NOT "Audit Committee" — that exists at every public company)
- **Specific team/department compositions:** "Legal, Compliance, and Finance" (but NOT just "a cross-functional team")
- **Specific dates:** "In December 2023", "On May 6, 2024", "fiscal 2025"
- **Named internal programs with unique identifiers:** "Cyber Incident Response Plan (CIRP)" (must have a distinguishing name/abbreviation — generic "incident response plan" does not qualify)
- **Named products, systems, tools:** Splunk, CrowdStrike Falcon, Azure Sentinel, ServiceNow
- **Named third-party firms:** Mandiant, Deloitte, CrowdStrike, PwC
- **Specific numbers:** headcounts, dollar amounts, percentages, exact durations ("17 years", "12 professionals")
- **Certification claims:** "We maintain ISO 27001 certification" (holding a certification is more than naming a standard)
- **Named universities in credential context:** "Ph.D. from Princeton University" (independently verifiable)

### ✗ IS NOT a Specific Fact (do NOT use to justify Firm-Specific)

- **Generic governance:** "the Board", "Board of Directors", "management", "Audit Committee", "the Committee"
- **Generic C-suite:** CEO, CFO, COO, President, General Counsel — these exist at every company and are not cybersecurity-specific
- **Generic IT leadership (NOT cybersecurity-specific):** "Head of IT", "IT Manager", "Director of IT", "Chief Compliance Officer", "Associate Vice President of IT" — these are general corporate/IT titles, not cybersecurity roles per the IS list
- **Unnamed entities:** "third-party experts", "external consultants", "cybersecurity firms", "managed service provider"
- **Generic cadences:** "quarterly", "annual", "periodic", "regular" — without exact dates
- **Boilerplate phrases:** "cybersecurity risks", "material adverse effect", "business operations", "financial condition"
- **Standard incident language:** "forensic investigation", "law enforcement", "regulatory obligations", "incident response protocols"
- **Vague quantifiers:** "certain systems", "some employees", "a number of", "a portion of"
- **Common practices:** "penetration testing", "vulnerability scanning", "tabletop exercises", "phishing simulations", "security awareness training"
- **Generic program names:** "incident response plan", "business continuity plan", "cybersecurity program", "Third-Party Risk Management Program", "Company-wide training" — no unique identifier or distinguishing abbreviation
- **Company self-references:** the company's own name, "the Company", "the Bank", subsidiary names, filing form types
- **Company milestones:** "since our IPO", "since inception" — not cybersecurity facts

### QV-Eligible Facts (count toward the 2-fact threshold for Quantified-Verifiable)

✓ Specific dates (month+year or exact date)
✓ Dollar amounts, headcounts, percentages
✓ Named third-party firms (Mandiant, CrowdStrike, Deloitte)
✓ Named products/tools (Splunk, Azure Sentinel)
✓ Named certifications held by individuals (CISSP, CISM, CEH)
✓ Years of experience as a specific number ("17 years", "over 20 years")
✓ Named universities in credential context

**Do NOT count toward QV** (these trigger Firm-Specific but not QV):
✗ Named roles (CISO, CIO)
✗ Named committees
✗ Named frameworks (NIST, ISO 27001) — these trigger Sector-Adapted
✗ Team compositions, reporting structures
✗ Named internal programs
✗ Generic degrees without named university ("BS in Management")

### Validation Step

Before finalizing specificity, review the extracted facts. Remove any that appear on the NOT list. If no facts remain after filtering → Generic Boilerplate (or Sector-Adapted if a named standard is present). Do not let NOT-list items inflate the specificity rating.

---

## LLM Response Schema

The exact Zod schema passed to `generateObject`. This is the contract between the LLM and our pipeline.

```typescript
import { z } from "zod";

export const ContentCategory = z.enum([
  "Board Governance",
  "Management Role",
  "Risk Management Process",
  "Third-Party Risk",
  "Incident Disclosure",
  "Strategy Integration",
  "None/Other",
]);

export const SpecificityLevel = z.union([
  z.literal(1),
  z.literal(2),
  z.literal(3),
  z.literal(4),
]);

export const Confidence = z.enum(["high", "medium", "low"]);

export const LabelOutput = z.object({
  content_category: ContentCategory
    .describe("The single most applicable content category for this paragraph"),
  specificity_level: SpecificityLevel
    .describe("1=generic boilerplate, 2=sector-adapted, 3=firm-specific, 4=quantified-verifiable"),
  category_confidence: Confidence
    .describe("high=clear-cut, medium=some ambiguity, low=genuinely torn between categories"),
  specificity_confidence: Confidence
    .describe("high=clear-cut, medium=borderline adjacent levels, low=could argue for 2+ levels"),
  reasoning: z.string()
    .describe("Brief 1-2 sentence justification citing specific evidence from the text"),
});
```

**Output example:**
```json
{
  "content_category": "Risk Management Process",
  "specificity_level": 3,
  "category_confidence": "high",
  "specificity_confidence": "medium",
  "reasoning": "Names NIST CSF (sector-adapted) and describes quarterly tabletop exercises specific to this company's program, pushing to firm-specific. Specificity borderline 2/3 — tabletop exercises could be generic or firm-specific depending on interpretation."
}
```

---

## System Prompt

> **Note:** The system prompt below is the v1.0 template from the initial codebook. The production Stage 1 prompt is **v2.5** (in `ts/src/label/prompts.ts`), which incorporates the IS/NOT lists, calibration examples, validation step, and decision test from this codebook. The Stage 2 judge prompt (`buildJudgePrompt()` in the same file) adds dynamic disambiguation rules and confidence calibration. **This codebook is the source of truth; the prompt mirrors it.**

The v1.0 template is preserved below for reference. See `ts/src/label/prompts.ts` for the current production prompt.

```
You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (Form 10-K Item 1C and Form 8-K Item 1.05 filings) under SEC Release 33-11216.

For each paragraph, assign exactly two labels:

(a) content_category — the single most applicable category:
  - "Board Governance": Board/committee oversight of cyber risk, briefing cadence, board member cyber expertise. SEC basis: Item 106(c)(1).
  - "Management Role": CISO/CTO/CIO identification, qualifications, reporting lines, management committees. SEC basis: Item 106(c)(2).
  - "Risk Management Process": Risk assessment methods, framework adoption (NIST, ISO), vulnerability management, monitoring, incident response planning, tabletop exercises, ERM integration. SEC basis: Item 106(b).
  - "Third-Party Risk": Vendor/supplier security oversight, external assessor requirements, contractual security standards, supply chain risk. SEC basis: Item 106(b).
  - "Incident Disclosure": Description of actual cybersecurity incidents — nature, scope, timing, impact, remediation. SEC basis: 8-K Item 1.05.
  - "Strategy Integration": Material impact on business strategy/financials, cyber insurance, investment/resource allocation. SEC basis: Item 106(b)(2).
  - "None/Other": Forward-looking disclaimers, section headers, cross-references, non-cybersecurity content.

If a paragraph spans multiple categories, assign the DOMINANT one — the category that best describes the paragraph's primary communicative purpose.

(b) specificity_level — integer 1 through 4:
  1 = Generic Boilerplate: Could apply to any company unchanged. Conditional language ("may," "could"). No named entities or frameworks.
  2 = Sector-Adapted: Names frameworks/standards (NIST, ISO, SOC 2) or industry-specific terms, but nothing unique to THIS company.
  3 = Firm-Specific: Contains at least one fact unique to this company — named roles, specific committees, concrete reporting lines, named programs.
  4 = Quantified-Verifiable: Two or more verifiable facts — dollar amounts, dates, headcounts, percentages, named third-party firms, audit results.

BOUNDARY RULES (apply when torn between adjacent levels):
  1 vs 2: "Does it name ANY framework, standard, or industry-specific term?" → Yes = 2
  2 vs 3: "Does it mention anything unique to THIS company?" → Yes = 3
  3 vs 4: "Does it contain TWO OR MORE independently verifiable facts?" → Yes = 4

SPECIAL RULES:
  - None/Other paragraphs always get specificity_level = 1.
  - Hypothetical incident language ("we may experience...") in a 10-K is NOT Incident Disclosure. It is Risk Management Process or Strategy Integration.
  - Incident Disclosure is only for descriptions of events that actually occurred.

CONFIDENCE RATINGS (per dimension):
  - "high": Clear-cut classification with no reasonable alternative.
  - "medium": Some ambiguity, but one option is clearly stronger.
  - "low": Genuinely torn between two or more options.
Be honest — overconfident ratings on hard cases are worse than admitting uncertainty.

Respond with valid JSON matching the required schema. The "reasoning" field should cite specific words or facts from the paragraph that justify your labels (1-2 sentences).
```

---

## User Prompt Template

```
Company: {company_name} ({ticker})
Filing type: {filing_type}
Filing date: {filing_date}
Section: {sec_item}

Paragraph:
{paragraph_text}
```

---

## Stage 2 Judge Prompt

Used when Stage 1 annotators disagree. The judge sees the paragraph plus all three prior annotations in randomized order.

```
You are adjudicating a labeling disagreement among three independent annotators. Each applied the same codebook but reached different conclusions.

Review all three opinions below, then provide YOUR OWN independent label based on the codebook definitions above. Do not default to majority vote — use your own expert judgment. If you agree with one annotator's reasoning, explain why their interpretation is correct.

Company: {company_name} ({ticker})
Filing type: {filing_type}
Filing date: {filing_date}
Section: {sec_item}

Paragraph:
{paragraph_text}

--- Prior annotations (randomized order) ---

Annotator A: content_category="{cat_a}", specificity_level={spec_a}
  Reasoning: "{reason_a}"

Annotator B: content_category="{cat_b}", specificity_level={spec_b}
  Reasoning: "{reason_b}"

Annotator C: content_category="{cat_c}", specificity_level={spec_c}
  Reasoning: "{reason_c}"
```

---

## Cost and Time Tracking

### Per-Annotation Record

Every API call produces an `Annotation` record with full provenance:

```typescript
provenance: {
  modelId: string,          // OpenRouter model ID e.g. "google/gemini-3.1-flash-lite-preview"
  provider: string,         // Upstream provider e.g. "google", "xai", "anthropic"
  generationId: string,     // OpenRouter generation ID (from response id field)
  stage: "stage1" | "stage2-judge" | "benchmark",
  runId: string,            // UUID per batch run
  promptVersion: string,    // "v1.0" — tracks prompt iterations
  inputTokens: number,      // From usage.prompt_tokens
  outputTokens: number,     // From usage.completion_tokens
  reasoningTokens: number,  // From usage.completion_tokens_details.reasoning_tokens
  costUsd: number,          // REAL cost from OpenRouter usage.cost (not estimated)
  latencyMs: number,        // Wall clock per request
  requestedAt: string,      // ISO datetime
}
```

### Cost Source

OpenRouter returns **actual cost** in every response body under `usage.cost` (USD). No estimation needed. Each response also includes a `generationId` (the `id` field) which we store in every annotation record. This enables:
- Audit trail: look up any annotation on OpenRouter's dashboard
- Richer stats via `GET /api/v1/generation?id={generationId}` (latency breakdown, provider routing, native token counts)

### Aggregation Levels

| Level | What | Where |
|-------|------|-------|
| Per-annotation | Single API call cost + latency | In each Annotation JSONL record |
| Per-model | Sum across all annotations for that model | `bun sec label:cost` |
| Per-stage | Stage 1 total, Stage 2 total | `bun sec label:cost` |
| Per-phase | Labeling total, benchmarking total | `bun sec label:cost` |
| Project total | Everything | `bun sec label:cost` |

### Time Tracking

| Metric | How |
|--------|-----|
| Per-annotation latency | `Date.now()` before/after API call |
| Batch throughput | paragraphs/minute computed from batch start/end |
| Stage 1 wall clock | Logged at batch start and end |
| Stage 2 wall clock | Logged at batch start and end |
| Total labeling time | Sum of all batch durations |
| Per-model benchmark time | Tracked during benchmark runs |

All timing is logged to `data/metadata/cost-log.jsonl` with entries like:

```json
{
  "event": "batch_complete",
  "stage": "stage1",
  "modelId": "openai/gpt-oss-120b",
  "paragraphsProcessed": 50000,
  "wallClockSeconds": 14400,
  "totalCostUsd": 38.50,
  "throughputPerMinute": 208.3,
  "timestamp": "2026-03-29T10:30:00Z"
}
```

---

## NIST CSF 2.0 Mapping

For academic grounding:

| Our Category | NIST CSF 2.0 |
|-------------|-------------|
| Board Governance | GOVERN (GV.OV, GV.RR) |
| Management Role | GOVERN (GV.RR, GV.RM) |
| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) |
| Third-Party Risk | GOVERN (GV.SC) |
| Incident Disclosure | DETECT, RESPOND, RECOVER |
| Strategy Integration | GOVERN (GV.OC, GV.RM) |

---

## Prompt Versioning

Track prompt changes so we can attribute label quality to specific prompt versions:

| Version | Date | N | Change |
|---------|------|---|--------|
| v1.0 | 2026-03-27 | 40 | Initial codebook-aligned prompt |
| v1.1 | 2026-03-28 | 40 | Added calibration examples, category decision rules. Cat 95%, Spec 68%, Both 62%. |
| v1.2 | 2026-03-28 | 40 | Expanded "what counts as unique" + materiality rule. REGRESSED (88% cat). |
| v2.0 | 2026-03-28 | 40 | Chain-of-thought schema with specific_facts array + algorithmic specificity. Gemini/Grok 5/5, GPT-OSS broken. |
| v2.1 | 2026-03-28 | 40 | Two-tier facts (organizational vs verifiable) + text enum labels. Gemini/Grok perfect but nano overrates. |
| v2.2 | 2026-03-28 | 40 | Decision-test format, simplified facts, "NOT a fact" list. Cat 95%, Spec 68%, Both 65%, Consensus 100%. |
| v2.2 | 2026-03-28 | 500 | 500-sample baseline. Cat 85.0%, Spec 60.8%, Both 51.4%, Consensus 99.6%, Spread 0.240. |
| v2.3 | 2026-03-28 | 500 | Tightened Sector-Adapted, expanded IS/NOT lists, QV boundary rules. Spec 72.0%, Both 59.2%. [1,1,2] eliminated. |
| v2.4 | 2026-03-28 | 500 | Validation step, schema constraint on specific_facts. Spec 78.6%, Both 66.8%. Nano overrating fixed. |
| v2.5 | 2026-03-28 | 500 | Improved Inc↔Strat tiebreaker, QV calibration examples. **PRODUCTION**: Cat 86.8%, Spec 81.0%, Both 70.8%, Consensus 99.4%, Spread 0.130. Inc↔Strat eliminated. |
| v2.6 | 2026-03-28 | 500 | Changed category defs to TEST: format. REGRESSED (Both 67.8%). |
| v2.7 | 2026-03-28 | 500 | Added COMMON MISTAKES section. 100% consensus but Both 67.6%. |
| v3.0 | 2026-03-29 | — | **Codebook overhaul.** Three rulings: (A) materiality disclaimers → Strategy Integration, (B) SPACs/no-ops → None/Other, (C) person-vs-function test for Mgmt Role vs RMP. Added full IS/NOT lists and QV-eligible list to codebook. Added Rule 2b, Rule 6, 4 new borderline cases. Prompt update pending. |
| v3.5 | 2026-04-02 | 26 | **Post-gold-analysis rulings, 6 iteration rounds on 26 regression paragraphs ($1.02).** Driven by 13-signal cross-analysis + targeted prompt iteration. (A) Rule 6 refined: materiality ASSESSMENTS → SI (backward-looking conclusions + "reasonably likely" forward-looking). Generic "could have a material adverse effect" is NOT an assessment — it stays N/O/RMP. Cross-references with materiality language also stay N/O. (B) Rule 2 expanded: purpose test for BG — governance structure descriptions are BG, but a one-sentence committee mention doesn't flip the category. (C) Rule 2b expanded: three-step MR↔RMP decision chain; Step 1 only decisive for RMP (process is subject), never short-circuits to MR. (D) N/O vs RMP clarified: actual measures implemented = RMP even in risk-factor framing. Result: +4pp on 26 hardest paragraphs vs v3.0 (18→22/26). |

When the prompt changes (after pilot testing, rubric revision, etc.), bump the version and log what changed. Every annotation record carries `promptVersion` so we can filter/compare.

---

## Iterative Prompt Tuning Protocol

The v1.0 system prompt is built from theory and synthetic examples. Before firing the full 50K run, we iterate on real data to find and fix failure modes while it costs cents, not dollars.

### Phase 0: Seed sample (before extraction is ready)

Grab 20-30 real Item 1C paragraphs manually from EDGAR full-text search (`efts.sec.gov/LATEST/search-index?q="Item 1C" cybersecurity`). Paste into a JSONL by hand. This lets prompt tuning start immediately while extraction code is still being built.

### Phase 1: Micro-pilot (30 paragraphs, all 3 Stage 1 models)

1. Select ~30 real paragraphs covering:
   - At least 2 per content category (incl. None/Other)
   - At least 2 per specificity level
   - Mix of industries and filing years
   - 5+ deliberately tricky borderline cases

2. Run all 3 Stage 1 models on these 30 with prompt v1.0.

3. **You and at least one teammate independently label the same 30** using the codebook. These are your reference labels.

4. Compare:
   - Per-model accuracy vs reference
   - Inter-model agreement (where do they diverge?)
   - Per-category confusion (which categories do models mix up?)
   - Per-specificity bias (do models systematically over/under-rate?)
   - Are confidence ratings calibrated? (Do "high" labels match correct ones?)

5. **Identify failure patterns.** Common ones:
   - Models gravitating to "Risk Management Process" (largest category — pull)
   - Models rating specificity too high (any named entity → firm-specific)
   - Board Governance / Management Role confusion
   - Missing None/Other (labeling boilerplate as Strategy Integration)

### Phase 2: Prompt revision (v1.1)

Based on Phase 1 failures, revise the system prompt:
- Add "common mistakes" section with explicit corrections
- Add few-shot examples for confused categories
- Sharpen boundary rules where models diverge
- Add negative examples ("This is NOT Incident Disclosure because...")

**Do not change the Zod schema or category definitions** — only the system prompt text. Bump to v1.1. Re-run the same 30 paragraphs. Compare to v1.0.

### Phase 3: Scale pilot (200 paragraphs)

1. Extract 200 real paragraphs (stratified, broader set of filings).

2. Run all 3 Stage 1 models with the best prompt version.

3. Compute:
   - **Inter-model Fleiss' Kappa** on category: target ≥ 0.65
   - **Inter-model Spearman correlation** on specificity: target ≥ 0.70
   - **Consensus rate**: % with 2/3+ agreement on both dims. Target ≥ 75%.
   - **Confidence calibration**: are "high confidence" labels more likely agreed-upon?

4. If targets not met:
   - Analyze disagreements — genuine ambiguity or prompt failure?
   - Prompt failure → revise to v1.2, re-run
   - Genuine ambiguity → consider rubric adjustment (merge categories, collapse specificity)
   - Repeat until targets met or documented why they can't be

5. **Cost check**: extrapolate from 200 to 50K. Reasoning token usage reasonable?

### Phase 4: Green light

Once scale pilot passes:
- Lock prompt version (no changes during full run)
- Lock model configuration (reasoning effort, temperature)
- Document final prompt, configs, and pilot results
- Fire the full 50K annotation run

---

## Pipeline Reliability & Observability

### Resumability

All API-calling scripts (annotation, judging, benchmarking) use the same pattern:

1. Load output JSONL → parse each line → collect completed paragraph IDs into a Set
2. Lines that fail `JSON.parse` are skipped (truncated from a crash)
3. Filter input to only paragraphs NOT in the completed set
4. For each completion, append one valid JSON line + `flush()`

JSONL line-append is atomic on Linux. Worst case on crash: one truncated line, skipped on reload. No data loss, no duplicate work, no duplicate API spend.

### Error Handling

| Error Type | Examples | Strategy |
|------------|----------|----------|
| Transient | 429, 500, 502, 503, ECONNRESET, timeout | Exponential backoff: 1s→2s→4s→8s→16s. Max 5 retries. |
| Permanent | 400, 422 (bad request) | Log to `{output}-errors.jsonl`, skip |
| Validation | Zod parse fail on LLM response | Retry once, then log + skip |
| Budget | 402 (out of credits) | Stop immediately, write session summary, exit |
| Consecutive | 10+ errors in a row | Stop — likely systemic (model down, prompt broken) |

Error paragraphs get their own file. Retry later with `--retry-errors`.

### Graceful Shutdown (SIGINT/SIGTERM)

On Ctrl+C:
1. Stop dispatching new work
2. Wait for in-flight requests to complete (already paid for)
3. Write session summary
4. Print final stats, exit 0

### Live Dashboard (stderr)

Updates every second:

```
 SEC-cyBERT │ label:annotate │ google/gemini-3.1-flash-lite-preview │ v1.1
 ─────────────────────────────────────────────────────────────────────────
 Progress   12,847 / 50,234  (25.6%)     ETA 42m 18s
 Session    $1.23 │ 38m 12s elapsed │ 337.4 para/min
 Totals     $4.56 all-time │ 3 errors (0.02%) │ 7 retries
 Latency    p50: 289ms │ p95: 812ms │ p99: 1,430ms
 Reasoning  avg 47 tokens/para │ 12.3% of output tokens
```

Goes to stderr so stdout stays clean.

### Session Log

Every run appends to `data/metadata/sessions.jsonl`:

```json
{
  "sessionId": "a1b2c3d4",
  "command": "label:annotate",
  "modelId": "google/gemini-3.1-flash-lite-preview",
  "stage": "stage1",
  "promptVersion": "v1.1",
  "startedAt": "2026-03-29T10:00:00Z",
  "endedAt": "2026-03-29T10:38:12Z",
  "durationSeconds": 2292,
  "paragraphsTotal": 50234,
  "paragraphsProcessed": 12847,
  "paragraphsSkippedResume": 37384,
  "paragraphsErrored": 3,
  "costUsd": 1.23,
  "reasoningTokensTotal": 482000,
  "avgLatencyMs": 450,
  "p95LatencyMs": 812,
  "throughputPerMinute": 337.4,
  "concurrency": 12,
  "exitReason": "complete"
}
```

`exitReason`: `complete` | `interrupted` (Ctrl+C) | `budget_exhausted` (402) | `error_threshold` (consecutive limit)

### OpenRouter Generation ID

Every annotation record includes the OpenRouter `generationId` from the response `id` field. This enables:
- **Audit trail**: look up any annotation on OpenRouter's dashboard
- **Rich stats**: `GET /api/v1/generation?id={generationId}` returns latency breakdown, provider routing, native token counts
- **Dispute resolution**: if a label looks wrong, inspect the exact generation that produced it

---

## Gold Set Protocol

### Sampling (1,200 paragraphs minimum)

Stratify by:
- Content category (all 7 represented, oversample rare categories)
- Specificity level (all 4 represented)
- GICS sector (financial services, tech, healthcare, manufacturing minimum)
- Filing year (FY2023 and FY2024)

### Human Labeling Process

Labeling is done through a purpose-built web tool that enforces quality:
1. **Rules quiz:** Every annotator must read the codebook and pass a quiz on the rules before each labeling session. The quiz tests the three most common confusion axes: Management Role vs RMP (person-vs-function test), materiality disclaimers (Strategy Integration vs None/Other), and QV fact counting.
2. **Warm-up:** First 5 paragraphs per session are warm-up (pre-labeled, with feedback). Not counted toward gold set.
3. **Independent labeling:** Three team members independently label the full gold set using this codebook.
4. Compute inter-rater reliability:
   - Cohen's Kappa (for content category — nominal, pairwise)
   - Krippendorff's Alpha (for specificity level — ordinal, all annotators)
   - Per-class confusion matrices
   - **Target: Kappa > 0.75, Alpha > 0.67**
5. Adjudicate disagreements: third annotator tiebreaker, or discussion consensus with documented rationale
6. Run the full GenAI pipeline on the gold set and compare to human labels

### If Agreement Is Poor

- If Kappa < 0.60 on any category pair: revise that category's definition and boundary rules, re-pilot
- If Alpha < 0.50 on specificity: collapse 4-point to 3-point scale (merge 1+2 into "Non-specific" or 3+4 into "Substantive")
- Document the collapse decision and rationale in this codebook