# Project 3: SEC Cybersecurity Disclosure Quality Classifier ## Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer **Project:** Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale. **Methodology:** Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model. **Why this project:** No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce **three publishable artifacts**: a novel dataset, a labeling methodology, and a SOTA classifier. --- ## Table of Contents 1. [Regulatory Background](#1-regulatory-background) 2. [Labeling Rubric](#2-labeling-rubric) 3. [Data Acquisition](#3-data-acquisition) 4. [GenAI Labeling Pipeline](#4-genai-labeling-pipeline) 5. [Model Strategy](#5-model-strategy) 6. [Evaluation & Validation](#6-evaluation--validation) 7. [Release Artifacts](#7-release-artifacts) 8. [3-Week Schedule (6 People)](#8-3-week-schedule-6-people) 9. [Budget](#9-budget) 10. [Reference Links](#10-reference-links) --- ## 1. Regulatory Background ### The Rule: SEC Release 33-11216 (July 2023) The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification. Full rule PDF: Fact sheet: ### Item 1C — Annual Disclosure (10-K) Appears as **Regulation S-K Item 106**, reported in **Item 1C** of the 10-K. Two mandated subsections: **Item 106(b) — Risk Management and Strategy:** 1. Processes for assessing, identifying, and managing material cybersecurity risks 2. Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM) 3. Whether the company engages external assessors, consultants, or auditors 4. Processes to oversee/identify risks from third-party service providers 5. Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition **Item 106(c) — Governance:** *Board Oversight (106(c)(1)):* - Description of board's oversight of cybersecurity risks - Identification of responsible board committee/subcommittee - Processes by which the board/committee is informed about risks *Management's Role (106(c)(2)):* - Which management positions/committees are responsible - Relevant expertise of those persons - How management monitors prevention, detection, mitigation, and remediation - Whether and how frequently management reports to the board **Key design note:** The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures. ### Item 1.05 — Incident Disclosure (8-K) Required within **4 business days** of determining a cybersecurity incident is material: 1. Material aspects of the nature, scope, and timing of the incident 2. Material impact or reasonably likely material impact on the registrant **Key nuances:** - The 4-day clock starts at the **materiality determination**, not the incident itself - Companies explicitly do NOT need to disclose technical details that would impede response/remediation - The AG can delay disclosure up to 120 days for national security - Companies must amend the 8-K when new material information becomes available **The May 2024 shift:** After SEC Director Erik Gerding clarified that Item 1.05 is only for *material* incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures: - Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01 - Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01 **Our extraction must capture all three item types.** ### Compliance Timeline | Date | Milestone | |------|-----------| | Jul 26, 2023 | Rule adopted | | Sep 5, 2023 | Rule effective | | Dec 15, 2023 | Item 1C required in 10-Ks (FY ending on/after this date) | | Dec 18, 2023 | Item 1.05 required in 8-Ks | | Jun 15, 2024 | Item 1.05 required for smaller reporting companies | | Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required | | Dec 18, 2024 | iXBRL tagging of 8-K Item 1.05 required | ### iXBRL CYD Taxonomy The SEC published the **Cybersecurity Disclosure (CYD) Taxonomy** on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the `cyd` prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping. Taxonomy schema: `http://xbrl.sec.gov/cyd/2024` Taxonomy guide: ### Corpus Size | Filing Type | Estimated Count (as of early 2026) | |-------------|-----------------------------------| | 10-K with Item 1C (FY2023 cycle) | ~4,500 | | 10-K with Item 1C (FY2024 cycle) | ~4,500 | | 8-K cybersecurity incidents | ~80 filings (55 incidents + amendments) | | **Total filings** | **~9,000-10,000** | | **Estimated paragraphs** (from Item 1C) | **~50,000-80,000** | --- ## 2. Labeling Rubric ### Dimension 1: Content Category (single-label per paragraph) Derived directly from the SEC rule structure. Each paragraph receives exactly one category: | Category | SEC Basis | What It Covers | Example Markers | |----------|-----------|----------------|-----------------| | **Board Governance** | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise | "Audit Committee," "Board of Directors oversees," "quarterly briefings" | | **Management Role** | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure | "Chief Information Security Officer," "reports to," "years of experience" | | **Risk Management Process** | 106(b) | Assessment/identification processes, ERM integration, framework references | "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management" | | **Third-Party Risk** | 106(b) | Vendor oversight, external assessors/consultants, supply chain risk | "third-party," "service providers," "penetration testing by," "external auditors" | | **Incident Disclosure** | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation | "unauthorized access," "detected," "incident," "remediation," "impacted" | | **Strategy Integration** | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation | "business strategy," "insurance," "investment," "material," "financial condition" | | **None/Other** | — | Boilerplate intros, legal disclaimers, non-cybersecurity content | Forward-looking statement disclaimers, general risk language | ### Dimension 2: Specificity (4-point ordinal per paragraph) Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers: | Level | Label | Definition | Decision Test | |-------|-------|------------|---------------| | **1** | **Generic Boilerplate** | Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice. | "Could I paste this into a different company's filing unchanged?" → Yes | | **2** | **Sector-Adapted** | References industry context or named frameworks (NIST, ISO) but no firm-specific detail. | "Does this name something specific but not unique to THIS company?" → Yes | | **3** | **Firm-Specific** | Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability. | "Does this contain at least one fact unique to THIS company?" → Yes | | **4** | **Quantified-Verifiable** | Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail. | "Could an outsider verify a specific claim in this paragraph?" → Yes | **Boundary rules for annotators:** - If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2 - If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3 - If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4 **Important:** EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. **Pilot test this 4-level scale on 50 paragraphs early.** Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor. ### Boilerplate vs. Substantive Markers (from the literature) **Boilerplate indicators:** - Conditional language: "may," "could," "might" - Generic risk statements without company-specific context - No named individuals, committees, or frameworks - Identical language across same-industry filings (cosine similarity > 0.8) - Passive voice: "cybersecurity risks are managed" **Substantive indicators:** - Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee") - Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS) - Concrete processes (penetration testing frequency, tabletop exercises) - Quantification (dollar investment, headcount, incident counts, training completion rates) - Third-party names or types of assessments - Temporal specificity (dates, frequencies, durations) ### Mapping to NIST CSF 2.0 For academic grounding, our content categories map to NIST CSF 2.0 functions: | Our Category | NIST CSF 2.0 | |-------------|-------------| | Board Governance | GOVERN (GV.OV, GV.RR) | | Management Role | GOVERN (GV.RR, GV.RM) | | Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) | | Third-Party Risk | GOVERN (GV.SC) | | Incident Disclosure | DETECT, RESPOND, RECOVER | | Strategy Integration | GOVERN (GV.OC, GV.RM) | --- ## 3. Data Acquisition ### 3.1 Extracting 10-K Item 1C **Recommended pipeline:** ``` sec-edgar-downloader → edgar-crawler → paragraph segmentation → dataset (bulk download) (parse Item 1C) (split into units) ``` **Tools:** | Tool | Purpose | Install | Notes | |------|---------|---------|-------| | `sec-edgar-downloader` | Bulk download 10-K filings by CIK | `pip install sec-edgar-downloader` | Pure downloader, no parsing | | `edgar-crawler` | Extract specific item sections to JSON | `git clone github.com/lefterisloukas/edgar-crawler` | Best for bulk extraction; configure `['1C']` in items list | | `edgartools` | Interactive exploration, XBRL parsing | `pip install edgartools` | `tenk['Item 1C']` accessor; great for prototyping | | `sec-api` | Commercial API, zero parsing headaches | `pip install sec-api` | `extractorApi.get_section(url, "1C", "text")` — paid, free tier available | **EDGAR API requirements:** - Rate limit: 10 requests/second - Required: Custom `User-Agent` header with name and email (e.g., `"TeamName team@email.com"`) - SEC blocks requests without proper User-Agent (returns 403) **For iXBRL-tagged filings (2025+):** Use `edgartools` XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories. **Fallback corpus:** `PleIAs/SEC` on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself. ### 3.2 Extracting 8-K Incident Disclosures | Tool | Purpose | URL | |------|---------|-----| | `sec-8k-item105` | Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback | `github.com/JMousqueton/sec-8k-item105` | | `SECurityTr8Ker` | Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts | `github.com/pancak3lullz/SECurityTr8Ker` | | Debevoise 8-K Tracker | Curated list with filing links, dates, amendments | `debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/` | | Board Cybersecurity Tracker | Links filings to MITRE ATT&CK, impact assessments | `board-cybersecurity.com/incidents/tracker` | **Critical:** Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift). ### 3.3 Paragraph Segmentation Once Item 1C text is extracted, segment into paragraphs: - Split on double newlines or `

` tags (depending on extraction format) - Minimum paragraph length: 20 words (filter out headers, whitespace) - Maximum paragraph length: 500 words (split longer blocks at sentence boundaries) - Preserve metadata: company name, CIK, ticker, filing date, fiscal year Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = **~50,000-70,000 paragraphs** ### 3.4 Pre-Existing Datasets and Resources | Resource | What It Is | URL | |----------|-----------|-----| | PleIAs/SEC | 373K full 10-K texts (CC0) | `huggingface.co/datasets/PleIAs/SEC` | | EDGAR-CORPUS | 220K filings with sections pre-parsed (Apache 2.0) | `huggingface.co/datasets/eloukas/edgar-corpus` | | Board Cybersecurity 23-Feature Analysis | Regex-based extraction of 23 governance/security features from 4,538 10-Ks | `board-cybersecurity.com/research/insights/` | | Gibson Dunn S&P 100 Survey | Detailed feature analysis of disclosure content | `corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-...` | | Florackis et al. (2023) "Cybersecurity Risk" | Firm-level cyber risk measure from 10-K text, RFS publication | SSRN: 3725130, data companion: 4319606 | | zeroshot/cybersecurity-corpus | General cybersecurity text (not SEC-specific, useful for DAPT) | `huggingface.co/datasets/zeroshot/cybersecurity-corpus` | --- ## 4. GenAI Labeling Pipeline ### 4.1 Multi-Model Consensus (EvasionBench Architecture) We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators. **Stage 1 — Dual Independent Annotation (all ~50K paragraphs):** - Annotator A: **Claude Sonnet 4.6** (batch API — $1.50/$7.50 per M input/output tokens) - Annotator B: **Gemini 2.5 Flash** ($0.30/$2.50 per M tokens) - Architectural diversity (Anthropic vs. Google) minimizes correlated errors - ~83% of paragraphs will have immediate agreement **Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):** - Judge 1: **Claude Opus 4.6** (batch — $2.50/$12.50 per M tokens) - Judge 2: **GPT-5** (batch — $0.63/$5.00 per M tokens) - Judge 3: **Gemini 2.5 Pro** (~$2-4/$12-18 per M tokens) - Majority vote (2/3) resolves disagreements - Anti-bias: randomize label presentation order **Stage 3 — Active Learning Pass:** - Cluster remaining low-confidence cases - Human-review ~5% (~2,500 cases) to identify systematic errors - Iterate rubric if needed, re-run affected subsets ### 4.2 Prompt Template ``` SYSTEM PROMPT: You are an expert annotator classifying paragraphs from SEC cybersecurity disclosures (10-K Item 1C and 8-K Item 1.05 filings). For each paragraph, assign: (a) content_category: exactly one of ["Board Governance", "Management Role", "Risk Management Process", "Third-Party Risk", "Incident Disclosure", "Strategy Integration", "None/Other"] (b) specificity_level: integer 1-4 CONTENT CATEGORIES: - Board Governance: Board/committee oversight of cybersecurity risks, briefing frequency, board member cyber expertise - Management Role: CISO/CTO/CIO identification, qualifications, reporting structure, management committees - Risk Management Process: Risk assessment methodology, framework adoption (NIST, ISO, etc.), vulnerability management, monitoring, incident response planning, tabletop exercises - Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement, contractual security requirements, supply chain risk - Incident Disclosure: Description of cybersecurity incidents, scope, timing, impact, remediation actions - Strategy Integration: Material impact on business strategy or financials, cyber insurance, investment/resource allocation - None/Other: Boilerplate introductions, legal disclaimers, forward-looking statement warnings, non-cybersecurity content SPECIFICITY SCALE: 1 - Generic Boilerplate: Could apply to any company. Conditional language ("may," "could"). No named entities. Example: "We face cybersecurity risks that could materially affect our business operations." 2 - Sector-Adapted: References industry context or named frameworks but no firm-specific details. Example: "We employ a cybersecurity framework aligned with the NIST Cybersecurity Framework to manage cyber risk." 3 - Firm-Specific: Contains facts unique to this company — named roles, committees, specific programs, reporting lines. Example: "Our CISO reports quarterly to the Audit Committee on cybersecurity risk posture and incident trends." 4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Example: "Following the March 2024 incident affecting our payment systems, we engaged CrowdStrike and implemented network segmentation at a cost of $4.2M, completing remediation in Q3 2024." BOUNDARY RULES: - If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" If yes → 2 - If torn between 2 and 3: "Does it mention anything unique to THIS company?" If yes → 3 - If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" If yes → 4 Respond with valid JSON only. Include a brief reasoning field. USER PROMPT: Company: {company_name} Filing Date: {filing_date} Paragraph: {paragraph_text} ``` **Expected output:** ```json { "content_category": "Board Governance", "specificity_level": 3, "reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts." } ``` ### 4.3 Practical Labeling Notes - **Always use Batch API.** Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time. - **Prompt caching:** The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price. - **Structured output mode:** Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%. - **Reasoning models (o3, extended thinking):** Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption. ### 4.4 Gold Set Protocol **Non-negotiable for publication quality.** 1. Sample 300-500 paragraphs, stratified by: - Expected content category (ensure all 7 represented) - Expected specificity level (ensure all 4 represented) - Industry (financial services, tech, healthcare, manufacturing) - Filing year (FY2023 vs FY2024) 2. Two team members independently label the full gold set 3. Compute: - Cohen's Kappa (binary/nominal categories) - Krippendorff's Alpha (ordinal specificity scale) - Per-class confusion matrices - Target: Kappa > 0.75 ("substantial agreement") 4. Adjudicate disagreements with a third team member 5. Run the full MMC pipeline on the gold set and compare --- ## 5. Model Strategy ### 5.1 Primary: SEC-ModernBERT-large **This model does not exist publicly. Building it is a core contribution.** **Base model:** `answerdotai/ModernBERT-large` - 395M parameters - 8,192-token native context (vs. 512 for DeBERTa-v3-large) - RoPE + alternating local/global attention + FlashAttention - 2-4x faster than DeBERTa-v3-large - Apache 2.0 license - GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4) **Step 1 — Domain-Adaptive Pre-Training (DAPT):** Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large": - **Training corpus:** 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements. - **MLM objective:** 30% masking rate (ModernBERT convention) - **Learning rate:** ~5e-5 (much lower than from-scratch pre-training) - **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32 - **VRAM estimate:** ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090 **Evidence DAPT works:** - Gururangan et al. (2020): consistent improvements across all tested domains - Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens - Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens - Databricks customer report: 70% → 95% accuracy with domain-specific pre-training **Step 2 — Classification Fine-Tuning:** Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs: - **Sequence length:** 2048 tokens (captures full regulatory paragraphs that 512-token models truncate) - **Two classification heads:** content_category (7-class softmax) + specificity_level (4-class ordinal or softmax) - **Add supervised contrastive loss (SCL):** Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes. - **VRAM:** ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090 - **3090 supports bf16** natively via Ampere Tensor Cores. Use `bf16=True` in HuggingFace Trainer. No loss scaling needed (unlike fp16). ### 5.2 Dark Horse: NeoBERT `chandar-lab/NeoBERT` - **250M parameters** (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large) - 4,096-token context - SwiGLU, RoPE, Pre-RMSNorm, FlashAttention - GLUE: 89.0 (close to DeBERTa-v3-large's 91.4) - MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9) - MIT license - Requires `trust_remote_code=True` - Almost nobody is using it for domain-specific tasks Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM. ### 5.3 Baseline: DeBERTa-v3-large `microsoft/deberta-v3-large` - 304M backbone + 131M embedding = ~435M total - 512-token native context (can push to ~1024) - Disentangled attention + ELECTRA-style RTD pre-training - GLUE: **91.4** — still the highest among all encoders - MIT license - **Weakness:** no long context support, completely fails at retrieval tasks Include as baseline to show improvement from (a) long context and (b) DAPT. ### 5.4 Ablation Design | Experiment | Model | Context | DAPT | SCL | Purpose | |-----------|-------|---------|------|-----|---------| | Baseline | DeBERTa-v3-large | 512 | No | No | "Standard" approach per syllabus | | + Long context | ModernBERT-large | 2048 | No | No | Shows context window benefit | | + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | Shows DAPT benefit | | + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | Shows SCL benefit | | Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params, comparable? | | **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance | The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors. ### 5.5 Training Framework - **Encoder fine-tuning:** HuggingFace `transformers` + `Trainer` with `AutoModelForSequenceClassification` - **DAPT continued pre-training:** HuggingFace `transformers` with `DataCollatorForLanguageModeling` - **SCL implementation:** Custom training loop or modify Trainer with dual loss - **Few-shot prototyping:** `SetFit` (sentence-transformers based) for rapid baseline in <30 seconds **Key reference:** Phil Schmid's ModernBERT fine-tuning tutorial: ### 5.6 Domain-Specific Encoder Models (for comparison only) These exist but are all BERT-base (110M params, 512 context) — architecturally outdated: | Model | HuggingFace ID | Domain | Params | |-------|---------------|--------|--------| | SEC-BERT | `nlpaueb/sec-bert-base` | 260K 10-K filings | 110M | | SEC-BERT-SHAPE | `nlpaueb/sec-bert-shape` | Same, with number normalization | 110M | | FinBERT | `ProsusAI/finbert` | Financial sentiment | 110M | | Legal-BERT | `nlpaueb/legal-bert-base-uncased` | 12GB legal text | 110M | | SecureBERT | arXiv:2204.02685 | Cybersecurity text | 110M | Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits. --- ## 6. Evaluation & Validation ### 6.1 Required Metrics (from syllabus) | Metric | Target | Notes | |--------|--------|-------| | **Macro-F1** on human holdout | Report per-class and overall | Minimum 1.2K holdout examples | | **Per-class F1** | Identify weak categories | Expect "None/Other" to be noisiest | | **Krippendorff's Alpha** | > 0.67 (adequate), > 0.75 (good) | GenAI labels vs. human gold set | | **Calibration plots** | Reliability diagrams | For probabilistic outputs (softmax) | | **Robustness splits** | Report by time period, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles | ### 6.2 Downstream Validity Tests These demonstrate that the classifier's predictions correlate with real-world outcomes: **Test 1 — Breach Prediction (strongest):** - Do firms with lower specificity scores subsequently appear in breach databases? - Cross-reference with: - **Privacy Rights Clearinghouse** (80K+ breaches; Mendeley dataset provides ticker/CIK matching: `doi.org/10.17632/w33nhh3282.1`) - **VCDB** (8K+ incidents, VERIS schema: `github.com/vz-risk/VCDB`) - **Board Cybersecurity Incident Tracker** (direct SEC filing links: `board-cybersecurity.com/incidents/tracker`) - **CISA KEV Catalog** (known exploited vulnerabilities: `cisa.gov/known-exploited-vulnerabilities-catalog`) **Test 2 — Market Reaction (if time permits):** - Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing - Does prior Item 1C disclosure quality predict magnitude of reaction? - Small sample (~55 incidents) but high signal - Regression: CAR = f(specificity_score, incident_severity, firm_size, industry) **Test 3 — Known-Groups Validity (easy, always include):** - Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures? - Do larger firms (by market cap) have more specific disclosures? - These are expected results — confirming them validates the measure **Test 4 — Boilerplate Index (easy, always include):** - Compute cosine similarity of each company's Item 1C to the industry-median disclosure - Does our specificity score inversely correlate with this similarity measure? - This is an independent, construct-free validation of the "uniqueness" dimension ### 6.3 External Benchmark Per syllabus: "include an external benchmark approach (i.e., previous best practice)." - **Board Cybersecurity's 23-feature regex extraction** is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure. - **Florackis et al. (2023) cybersecurity risk measure** from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule). --- ## 7. Release Artifacts By project end, publish: 1. **HuggingFace Dataset:** Extracted Item 1C paragraphs with labels — first public dataset of its kind 2. **SEC-ModernBERT-large:** Domain-adapted model weights — first SEC-specific ModernBERT 3. **Fine-tuned classifiers:** Content category + specificity models, ready to deploy 4. **Labeling rubric + prompt templates:** Reusable for future SEC disclosure research 5. **Extraction pipeline code:** EDGAR → structured paragraphs → labeled dataset 6. **Evaluation notebook:** All metrics, ablations, validation tests --- ## 8. 3-Week Schedule (6 People) ### Team Roles | Role | Person(s) | Primary Responsibility | |------|-----------|----------------------| | **Data Lead** | Person A | EDGAR extraction pipeline, paragraph segmentation, data cleaning | | **Data Support** | Person B | 8-K extraction, breach database cross-referencing, dataset QA | | **Labeling Lead** | Person C | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration | | **Annotation** | Person D | Gold set human labeling, inter-rater reliability, active learning review | | **Model Lead** | Person E | DAPT pre-training, classification fine-tuning, ablation experiments | | **Eval & Writing** | Person F | Validation tests, metrics computation, final presentation, documentation | ### Week 1: Data + Rubric | Day | Person A (Data Lead) | Person B (Data Support) | Person C (Labeling Lead) | Person D (Annotation) | Person E (Model Lead) | Person F (Eval & Writing) | |-----|---------------------|------------------------|-------------------------|----------------------|----------------------|--------------------------| | **Mon** | Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader) | Set up 8-K extraction (sec-8k-item105) | Draft labeling rubric v1 from SEC rule | Read SEC rule + Gibson Dunn survey | Download ModernBERT-large, set up training env | Outline evaluation plan, identify breach databases | | **Tue** | Begin bulk 10-K download (FY2023 cycle) | Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01) | Pilot rubric on 30 paragraphs with Claude Opus | Pilot rubric on same 30 paragraphs independently | Download PleIAs/SEC corpus, prepare DAPT data | Download PRC Mendeley dataset, VCDB, set up cross-ref | | **Wed** | Continue download (FY2024 cycle), begin Item 1C parsing | Build company metadata table (CIK → ticker → GICS sector → market cap) | Compare pilot labels with Person D, revise rubric boundary rules | Compute initial inter-rater agreement, flag problem areas | Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090) | Map VCDB incidents to SEC filers by name matching | | **Thu** | Paragraph segmentation pipeline, quality checks | Merge 8-K incidents with Board Cybersecurity Tracker data | Rubric v2 finalized; set up batch API calls for dual annotation | Begin gold set sampling (300-500 paragraphs, stratified) | DAPT continues (monitor loss, checkpoint) | Draft presentation outline | | **Fri** | **Milestone: Full paragraph corpus ready (~50K+ paragraphs)** | **Milestone: 8-K incident dataset complete** | Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus | Continue gold set labeling (target: finish 150/300) | DAPT continues | **Milestone: Evaluation framework + breach cross-ref ready** | ### Week 2: Labeling + Training | Day | Person A | Person B | Person C | Person D | Person E | Person F | |-----|----------|----------|----------|----------|----------|----------| | **Mon** | Data cleaning — fix extraction errors, handle edge cases | Assist Person D with gold set labeling (second annotator) | Monitor dual annotation results (should be ~60% complete) | Continue gold set labeling, begin second pass | DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning | Compute gold set inter-rater reliability (Kappa, Alpha) | | **Tue** | Build train/holdout split logic (stratified by industry, year, specificity) | Continue gold set second-annotator pass | Dual annotation complete → extract disagreements (~17%) | Finish gold set, adjudicate disagreements with Person C | Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning | Analyze gold set confusion patterns, recommend rubric tweaks | | **Wed** | Final dataset assembly | Assist Person C with judge panel setup | Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements | Run MMC pipeline on gold set, compare with human labels | ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning | **Milestone: Gold set validated, Kappa computed** | | **Thu** | Prepare HuggingFace dataset card | Begin active learning — cluster low-confidence cases | Judge panel results in; assemble final labeled dataset | Human-review ~500 low-confidence cases from active learning | SEC-ModernBERT-large done; begin NeoBERT experiment | Robustness split analysis (by industry, year, filing size) | | **Fri** | **Milestone: Labeled dataset finalized (~50K paragraphs)** | **Milestone: Active learning pass complete** | QA final labels — spot-check 100 random samples | Assist Person E with evaluation | Begin ensemble experiment (SEC-ModernBERT + DeBERTa) | **Milestone: All baseline + ablation training complete** | ### Week 3: Evaluation + Presentation | Day | Person A | Person B | Person C | Person D | Person E | Person F | |-----|----------|----------|----------|----------|----------|----------| | **Mon** | Publish dataset to HuggingFace | Run breach prediction validation (PRC + VCDB cross-ref) | Write labeling methodology section | Calibration plots for all models | Final ensemble tuning; publish model weights to HuggingFace | Compile all metrics into evaluation tables | | **Tue** | Write data acquisition section | Run known-groups validity (industry, size effects) | Write GenAI labeling section | Boilerplate index validation (cosine similarity) | Write model strategy section | Draft full results section | | **Wed** | Code cleanup, README for extraction pipeline | Market reaction analysis if feasible (optional) | Review/edit all written sections | Create figures: confusion matrices, calibration plots | Review/edit model section | Assemble presentation slides | | **Thu** | **Full team: review presentation, rehearse, polish** | | | | | | | **Fri** | **Presentation day** | | | | | | ### Critical Path & Dependencies ``` Week 1: Data extraction (A,B) ──────────────────┐ Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri) DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2) Eval framework (F) ─────────────────────────────────────→ (ready for Week 2) Week 2: GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri) Gold set (D + B) ──────────────────────→ Validated (Wed) Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble Metrics (F) ───────────────────→ Robustness splits Week 3: Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index Writing (all) ──────────────→ Sections → Review → Presentation Release (A,E) ──────────────→ HuggingFace dataset + model weights ``` --- ## 9. Budget | Item | Cost | |------|------| | GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch) | ~$115 | | GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 | | Prompt caching savings | -$30 to -$40 | | SEC EDGAR data | $0 (public domain) | | Breach databases (PRC open data, VCDB, CISA KEV) | $0 | | Compute (RTX 3090, already owned) | $0 | | **Total** | **~$130-170** | For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated. --- ## 10. Reference Links ### SEC Rule & Guidance - [SEC Final Rule 33-11216 (PDF)](https://www.sec.gov/files/rules/final/2023/33-11216.pdf) - [SEC Fact Sheet](https://www.sec.gov/files/33-11216-fact-sheet.pdf) - [SEC Small Business Compliance Guide](https://www.sec.gov/resources-small-businesses/small-business-compliance-guides/cybersecurity-risk-management-strategy-governance-incident-disclosure) - [CYD iXBRL Taxonomy Guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf) ### Law Firm Surveys & Analysis - [Gibson Dunn S&P 100 Survey (Harvard Law Forum)](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/) - [PwC First Wave of 10-K Cyber Disclosures](https://www.pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/sec-final-cybersecurity-disclosure-rules/sec-10-k-cyber-disclosures.html) - [Debevoise 8-K Lessons Learned](https://www.debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/) - [Greenberg Traurig 2025 Trends Update](https://www.gtlaw.com/en/insights/2025/2/sec-cybersecurity-disclosure-trends-2025-update-on-corporate-reporting-practices) - [Known Trends: First Year of 8-K Filings](https://www.knowntrends.com/2025/02/snapshot-the-first-year-of-cybersecurity-incident-filings-on-form-8-k-since-adoption-of-new-rules/) - [NYU: Lessons Learned from 8-K Reporting](https://wp.nyu.edu/compliance_enforcement/2025/03/25/lessons-learned-one-year-of-form-8-k-material-cybersecurity-incident-reporting/) ### Data Extraction Tools - [edgar-crawler (GitHub)](https://github.com/lefterisloukas/edgar-crawler) - [edgartools (GitHub)](https://github.com/dgunning/edgartools) - [sec-edgar-downloader (PyPI)](https://pypi.org/project/sec-edgar-downloader/) - [sec-8k-item105 (GitHub)](https://github.com/JMousqueton/sec-8k-item105) - [SECurityTr8Ker (GitHub)](https://github.com/pancak3lullz/SECurityTr8Ker) - [SEC EDGAR APIs](https://www.sec.gov/search-filings/edgar-application-programming-interfaces) - [SEC EDGAR Full-Text Search](https://efts.sec.gov/LATEST/search-index) ### Datasets - [PleIAs/SEC — 373K 10-K texts (HuggingFace, CC0)](https://huggingface.co/datasets/PleIAs/SEC) - [EDGAR-CORPUS — 220K filings, sections parsed (HuggingFace, Apache 2.0)](https://huggingface.co/datasets/eloukas/edgar-corpus) - [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/risk-frameworks-security-standards-in-10k-item-1c-cybersecurity-disclosures-through-2024-06-30/) - [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker) - [PRC Mendeley Breach Dataset (with tickers)](http://dx.doi.org/10.17632/w33nhh3282.1) - [VCDB (GitHub)](https://github.com/vz-risk/VCDB) - [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog) - [zeroshot/cybersecurity-corpus (HuggingFace)](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus) ### Models - [ModernBERT-large (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-large) - [ModernBERT-base (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-base) - [NeoBERT (HuggingFace, MIT)](https://huggingface.co/chandar-lab/NeoBERT) - [DeBERTa-v3-large (HuggingFace, MIT)](https://huggingface.co/microsoft/deberta-v3-large) - [SEC-BERT (HuggingFace)](https://huggingface.co/nlpaueb/sec-bert-base) - [ProsusAI FinBERT (HuggingFace)](https://huggingface.co/ProsusAI/finbert) - [EvasionBench Eva-4B-V2 (HuggingFace)](https://huggingface.co/FutureMa/Eva-4B-V2) ### Key Papers - Ringel (2023), "Creating Synthetic Experts with Generative AI" — [SSRN:4542949](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949) - Ludwig et al. (2026), "Extracting Consumer Insight from Text" — [arXiv:2602.15312](https://arxiv.org/abs/2602.15312) - Ma et al. (2026), "EvasionBench" — [arXiv:2601.09142](https://arxiv.org/abs/2601.09142) - Florackis et al. (2023), "Cybersecurity Risk" — [SSRN:3725130](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130) - Gururangan et al. (2020), "Don't Stop Pretraining" — [arXiv:2004.10964](https://arxiv.org/abs/2004.10964) - ModernBERT paper — [arXiv:2412.13663](https://arxiv.org/abs/2412.13663) - NeoBERT paper — [arXiv:2502.19587](https://arxiv.org/abs/2502.19587) - ModernBERT vs DeBERTa-v3 comparison — [arXiv:2504.08716](https://arxiv.org/abs/2504.08716) - Patent domain ModernBERT DAPT — [arXiv:2509.14926](https://arxiv.org/abs/2509.14926) - SEC filing scaling laws for continued pre-training — [arXiv:2512.12384](https://arxiv.org/abs/2512.12384) - Gunel et al. (2020), Supervised Contrastive Learning for fine-tuning — [OpenReview](https://openreview.net/forum?id=cu7IUiOhujH) - Phil Schmid, "Fine-tune classifier with ModernBERT in 2025" — [philschmid.de](https://www.philschmid.de/fine-tune-modern-bert-in-2025) - Berkman et al. (2018), Cybersecurity disclosure quality scoring - Li, No, and Boritz (2023), BERT-based classification of cybersecurity disclosures - Scalable 10-K Analysis with LLMs — [arXiv:2409.17581](https://arxiv.org/abs/2409.17581) - SecureBERT — [arXiv:2204.02685](https://arxiv.org/abs/2204.02685) - Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" (PNAS) — [arXiv:2303.15056](https://arxiv.org/abs/2303.15056) - Pangakis et al. (2023), "Automated Annotation Requires Validation" — [arXiv:2306.00176](https://arxiv.org/abs/2306.00176) ### Methodological Playbook - [Ringel 2026 Capstone Pipeline Example (ZIP)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip) - [Class 21 Exemplary Presentation (PDF)](http://www.ringel.ai/UNC/2026/BUSI488/Class21/Ringel_488-2026_Class21.pdf)