SEC-cyBERT/docs/archive/planning/signoff-deliverable.md

# Construct of Interest and Data Sign-off

**Team:** S1 Team 4 | **Construct:** Project 3 — Cybersecurity Governance and Incident Disclosure Quality (SEC-Aligned)

---

## 1. Construct Definition

Our construct of interest is **cybersecurity disclosure quality** in SEC filings, operationalized as two simultaneous classification dimensions applied to each paragraph.

**Dimension 1: Content Category** (single-label, multi-class). Each paragraph receives exactly one of seven mutually exclusive categories derived from [SEC Release 33-11216](https://www.sec.gov/files/rules/final/2023/33-11216.pdf) (July 2023). The rule's mandated content domains map to six substantive categories — with Third-Party Risk separated from Risk Management Process because the rule specifically enumerates third-party oversight as a distinct disclosure requirement under 106(b) — plus a None/Other catch-all:

| Category                | SEC Basis | Covers                                                                              |
| ----------------------- | --------- | ----------------------------------------------------------------------------------- |
| Board Governance        | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise                |
| Management Role         | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure                        |
| Risk Management Process | 106(b)    | Assessment methodology, framework adoption (NIST, ISO), monitoring, ERM integration |
| Third-Party Risk        | 106(b)    | Vendor oversight, external assessors, supply chain risk                             |
| Incident Disclosure     | 8-K 1.05  | Incident nature, scope, timing, material impact, remediation                        |
| Strategy Integration    | 106(b)(2) | Material impact on business strategy/financials, cyber insurance                    |
| None/Other              | —         | Boilerplate intros, legal disclaimers, non-cybersecurity content                    |

**Dimension 2: Disclosure Specificity** (ordinal, 1–4). Measures how informative a paragraph is: (1) Generic Boilerplate — could apply to any company unchanged; (2) Sector-Adapted — references named frameworks but no firm-specific detail; (3) Firm-Specific — names unique roles, committees, or programs; (4) Quantified-Verifiable — includes metrics, dates, dollar amounts, or independently confirmable facts.

**Decision rules for borderline cases:** Does it name any framework or standard? (yes → 2). Does it mention anything unique to this company? (yes → 3). Does it contain two or more specific, verifiable facts? (yes → 4).

**Example annotations from our codebook:**

> _"Our Board of Directors recognizes the critical importance of maintaining the trust and confidence of our customers, and cybersecurity risk is an area of increasing focus for our Board."_
> → Board Governance, Specificity 1 (could apply to any company — generic statement of intent)

> _"We assessed 312 vendors in fiscal 2024 through our Third-Party Risk Management program. All Tier 1 vendors are required to provide annual SOC 2 Type II reports. In fiscal 2024, 14 vendors were placed on remediation plans and 3 vendor relationships were terminated."_
> → Third-Party Risk, Specificity 4 (specific numbers, specific actions, specific criteria — all verifiable)

> _"Our CISO, Sarah Chen, leads a dedicated cybersecurity team of 35 professionals. Ms. Chen joined the Company in 2019 after serving as Deputy CISO at a Fortune 100 financial services firm."_
> → Management Role, Specificity 4 (named individual, team size, prior role — multiple verifiable facts)

## 2. Sources and Citations

The construct is **theoretically grounded** in disclosure theory ([Verrecchia, 2001](<https://doi.org/10.1016/S0165-4101(01)00025-8>)) and regulatory compliance as an information-provision mechanism. The SEC's final rule provides the taxonomic backbone: it specifies four content domains — governance, risk management, strategy integration, and incident disclosure — creating a natural multi-class classification task directly from the regulatory text. Our categories further map to [NIST CSF 2.0](https://www.nist.gov/cyberframework) functions (GOVERN, IDENTIFY, PROTECT, DETECT, RESPOND, RECOVER) for independent academic grounding.

The **specificity dimension** draws on the disclosure quality literature. [Hope, Hu, and Lu (2016)](https://doi.org/10.1007/s11142-016-9371-1) demonstrate that boilerplate risk-factor disclosures are uninformative to investors, while specific disclosures predict future outcomes. [Gordon, Loeb, and Sohail (2010)](https://doi.org/10.2307/25750692) establish that voluntary IT security disclosures vary in informativeness and that more specific disclosures correlate with market valuations. [Von Solms and Von Solms (2004)](https://doi.org/10.1016/j.cose.2004.05.002) provide the information security governance framework connecting board oversight to operational risk management. The [Gibson Dunn annual surveys](https://www.gibsondunn.com/cybersecurity-disclosure-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-cos/) of S&P 100 cybersecurity disclosures empirically document the variation in quality across firms, confirming that the specificity gradient is observable in practice.

The **methodological foundation** is the [Ringel (2023)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949) synthetic experts pipeline — frontier LLMs generate training labels, then a small open-weights model is fine-tuned to approximate the GenAI labeler at near-zero marginal cost. [Ma et al. (2026)](https://arxiv.org/abs/2601.09142) provide the multi-model consensus labeling architecture we adopt for quality assurance. **No validated classifier or public labeled dataset for SEC cybersecurity disclosure quality currently exists** — this is the gap our project fills.

## 3. Data Description

**What data:** Paragraphs extracted from SEC EDGAR filings — specifically, Item 1C of annual 10-K filings (cybersecurity risk management, strategy, and governance) and Items 1.05/8.01/7.01 of 8-K filings (cybersecurity incident disclosures). Two full annual filing cycles exist (FY2023–FY2024), covering ~9,000 10-K filings containing Item 1C and 207 cybersecurity 8-K filings.

**How acquired:** All data is publicly available through the [SEC EDGAR system](https://www.sec.gov/search-filings/edgar-application-programming-interfaces). We built a TypeScript extraction pipeline that bulk-downloads filings via the EDGAR API, parses Item 1C sections from 10-K HTML across 14 identified filing generators, and segments into paragraphs (20–500 words, with bullet-list merging and continuation-line detection). For 8-K incident filings, a separate scanner processes the SEC's bulk `submissions.zip` to deterministically capture all cybersecurity 8-Ks, including the post-May 2024 shift from Item 1.05 to Items 8.01/7.01. The corpus currently contains **72,045 paragraphs**. We will label ~48,000 paragraphs for training via a three-model GenAI panel following the Ringel (2023) pipeline, with a locked 1,200-paragraph holdout to be human-labeled by 6 annotators (3 per paragraph) for validation.

**Why at scale:** Every publicly traded U.S. company must now file Item 1C annually, generating thousands of new disclosures each cycle. Investors, compliance teams, regulators, and cybersecurity consultants need to assess disclosure quality across hundreds or thousands of filings simultaneously — infeasible by manual reading. A validated classifier enables longitudinal trend analysis (are disclosures becoming more specific over time?), cross-sectional benchmarking (which industries lag in governance disclosure?), and event-driven monitoring of incidents. The [iXBRL CYD taxonomy](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf) (effective December 2024) further increases the volume of machine-parseable filings. The data is abundant, recurring annually, and the classification task is too nuanced for keyword dictionaries but well-defined enough for a fine-tuned specialist model — the textbook case for a vertical AI.