SEC-cyBERT/docs/reference/P3_SEC_Cybersecurity_Capstone.md
2026-03-31 16:27:47 -04:00

41 KiB
Raw Permalink Blame History

Project 3: SEC Cybersecurity Disclosure Quality Classifier

Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer

Project: Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale.

Methodology: Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

Why this project: No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce three publishable artifacts: a novel dataset, a labeling methodology, and a SOTA classifier.


Table of Contents

  1. Regulatory Background
  2. Labeling Rubric
  3. Data Acquisition
  4. GenAI Labeling Pipeline
  5. Model Strategy
  6. Evaluation & Validation
  7. Release Artifacts
  8. 3-Week Schedule (6 People)
  9. Budget
  10. Reference Links

1. Regulatory Background

The Rule: SEC Release 33-11216 (July 2023)

The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification.

Full rule PDF: https://www.sec.gov/files/rules/final/2023/33-11216.pdf Fact sheet: https://www.sec.gov/files/33-11216-fact-sheet.pdf

Item 1C — Annual Disclosure (10-K)

Appears as Regulation S-K Item 106, reported in Item 1C of the 10-K. Two mandated subsections:

Item 106(b) — Risk Management and Strategy:

  1. Processes for assessing, identifying, and managing material cybersecurity risks
  2. Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM)
  3. Whether the company engages external assessors, consultants, or auditors
  4. Processes to oversee/identify risks from third-party service providers
  5. Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition

Item 106(c) — Governance:

Board Oversight (106(c)(1)):

  • Description of board's oversight of cybersecurity risks
  • Identification of responsible board committee/subcommittee
  • Processes by which the board/committee is informed about risks

Management's Role (106(c)(2)):

  • Which management positions/committees are responsible
  • Relevant expertise of those persons
  • How management monitors prevention, detection, mitigation, and remediation
  • Whether and how frequently management reports to the board

Key design note: The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures.

Item 1.05 — Incident Disclosure (8-K)

Required within 4 business days of determining a cybersecurity incident is material:

  1. Material aspects of the nature, scope, and timing of the incident
  2. Material impact or reasonably likely material impact on the registrant

Key nuances:

  • The 4-day clock starts at the materiality determination, not the incident itself
  • Companies explicitly do NOT need to disclose technical details that would impede response/remediation
  • The AG can delay disclosure up to 120 days for national security
  • Companies must amend the 8-K when new material information becomes available

The May 2024 shift: After SEC Director Erik Gerding clarified that Item 1.05 is only for material incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures:

  • Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01
  • Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01

Our extraction must capture all three item types.

Compliance Timeline

Date Milestone
Jul 26, 2023 Rule adopted
Sep 5, 2023 Rule effective
Dec 15, 2023 Item 1C required in 10-Ks (FY ending on/after this date)
Dec 18, 2023 Item 1.05 required in 8-Ks
Jun 15, 2024 Item 1.05 required for smaller reporting companies
Dec 15, 2024 iXBRL tagging of Item 106 (CYD taxonomy) required
Dec 18, 2024 iXBRL tagging of 8-K Item 1.05 required

iXBRL CYD Taxonomy

The SEC published the Cybersecurity Disclosure (CYD) Taxonomy on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the cyd prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping.

Taxonomy schema: http://xbrl.sec.gov/cyd/2024 Taxonomy guide: https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf

Corpus Size

Filing Type Estimated Count (as of early 2026)
10-K with Item 1C (FY2023 cycle) ~4,500
10-K with Item 1C (FY2024 cycle) ~4,500
8-K cybersecurity incidents ~80 filings (55 incidents + amendments)
Total filings ~9,000-10,000
Estimated paragraphs (from Item 1C) ~50,000-80,000

2. Labeling Rubric

Dimension 1: Content Category (single-label per paragraph)

Derived directly from the SEC rule structure. Each paragraph receives exactly one category:

Category SEC Basis What It Covers Example Markers
Board Governance 106(c)(1) Board/committee oversight, briefing frequency, board cyber expertise "Audit Committee," "Board of Directors oversees," "quarterly briefings"
Management Role 106(c)(2) CISO/CTO identification, qualifications, reporting structure "Chief Information Security Officer," "reports to," "years of experience"
Risk Management Process 106(b) Assessment/identification processes, ERM integration, framework references "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management"
Third-Party Risk 106(b) Vendor oversight, external assessors/consultants, supply chain risk "third-party," "service providers," "penetration testing by," "external auditors"
Incident Disclosure 8-K 1.05 Nature/scope/timing of incidents, material impact, remediation "unauthorized access," "detected," "incident," "remediation," "impacted"
Strategy Integration 106(b)(2) Material impact on business strategy, cyber insurance, resource allocation "business strategy," "insurance," "investment," "material," "financial condition"
None/Other Boilerplate intros, legal disclaimers, non-cybersecurity content Forward-looking statement disclaimers, general risk language

Dimension 2: Specificity (4-point ordinal per paragraph)

Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers:

Level Label Definition Decision Test
1 Generic Boilerplate Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice. "Could I paste this into a different company's filing unchanged?" → Yes
2 Sector-Adapted References industry context or named frameworks (NIST, ISO) but no firm-specific detail. "Does this name something specific but not unique to THIS company?" → Yes
3 Firm-Specific Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability. "Does this contain at least one fact unique to THIS company?" → Yes
4 Quantified-Verifiable Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail. "Could an outsider verify a specific claim in this paragraph?" → Yes

Boundary rules for annotators:

  • If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2
  • If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3
  • If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4

Important: EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. Pilot test this 4-level scale on 50 paragraphs early. Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor.

Boilerplate vs. Substantive Markers (from the literature)

Boilerplate indicators:

  • Conditional language: "may," "could," "might"
  • Generic risk statements without company-specific context
  • No named individuals, committees, or frameworks
  • Identical language across same-industry filings (cosine similarity > 0.8)
  • Passive voice: "cybersecurity risks are managed"

Substantive indicators:

  • Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee")
  • Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS)
  • Concrete processes (penetration testing frequency, tabletop exercises)
  • Quantification (dollar investment, headcount, incident counts, training completion rates)
  • Third-party names or types of assessments
  • Temporal specificity (dates, frequencies, durations)

Mapping to NIST CSF 2.0

For academic grounding, our content categories map to NIST CSF 2.0 functions:

Our Category NIST CSF 2.0
Board Governance GOVERN (GV.OV, GV.RR)
Management Role GOVERN (GV.RR, GV.RM)
Risk Management Process IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all)
Third-Party Risk GOVERN (GV.SC)
Incident Disclosure DETECT, RESPOND, RECOVER
Strategy Integration GOVERN (GV.OC, GV.RM)

3. Data Acquisition

3.1 Extracting 10-K Item 1C

Recommended pipeline:

sec-edgar-downloader  →  edgar-crawler  →  paragraph segmentation  →  dataset
  (bulk download)       (parse Item 1C)    (split into units)

Tools:

Tool Purpose Install Notes
sec-edgar-downloader Bulk download 10-K filings by CIK pip install sec-edgar-downloader Pure downloader, no parsing
edgar-crawler Extract specific item sections to JSON git clone github.com/lefterisloukas/edgar-crawler Best for bulk extraction; configure ['1C'] in items list
edgartools Interactive exploration, XBRL parsing pip install edgartools tenk['Item 1C'] accessor; great for prototyping
sec-api Commercial API, zero parsing headaches pip install sec-api extractorApi.get_section(url, "1C", "text") — paid, free tier available

EDGAR API requirements:

  • Rate limit: 10 requests/second
  • Required: Custom User-Agent header with name and email (e.g., "TeamName team@email.com")
  • SEC blocks requests without proper User-Agent (returns 403)

For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories.

Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself.

3.2 Extracting 8-K Incident Disclosures

Tool Purpose URL
sec-8k-item105 Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback github.com/JMousqueton/sec-8k-item105
SECurityTr8Ker Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts github.com/pancak3lullz/SECurityTr8Ker
Debevoise 8-K Tracker Curated list with filing links, dates, amendments debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/
Board Cybersecurity Tracker Links filings to MITRE ATT&CK, impact assessments board-cybersecurity.com/incidents/tracker

Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift).

3.3 Paragraph Segmentation

Once Item 1C text is extracted, segment into paragraphs:

  • Split on double newlines or <p> tags (depending on extraction format)
  • Minimum paragraph length: 20 words (filter out headers, whitespace)
  • Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
  • Preserve metadata: company name, CIK, ticker, filing date, fiscal year

Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = ~50,000-70,000 paragraphs

3.4 Pre-Existing Datasets and Resources

Resource What It Is URL
PleIAs/SEC 373K full 10-K texts (CC0) huggingface.co/datasets/PleIAs/SEC
EDGAR-CORPUS 220K filings with sections pre-parsed (Apache 2.0) huggingface.co/datasets/eloukas/edgar-corpus
Board Cybersecurity 23-Feature Analysis Regex-based extraction of 23 governance/security features from 4,538 10-Ks board-cybersecurity.com/research/insights/
Gibson Dunn S&P 100 Survey Detailed feature analysis of disclosure content corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-...
Florackis et al. (2023) "Cybersecurity Risk" Firm-level cyber risk measure from 10-K text, RFS publication SSRN: 3725130, data companion: 4319606
zeroshot/cybersecurity-corpus General cybersecurity text (not SEC-specific, useful for DAPT) huggingface.co/datasets/zeroshot/cybersecurity-corpus

4. GenAI Labeling Pipeline

4.1 Multi-Model Consensus (EvasionBench Architecture)

We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators.

Stage 1 — Dual Independent Annotation (all ~50K paragraphs):

  • Annotator A: Claude Sonnet 4.6 (batch API — $1.50/$7.50 per M input/output tokens)
  • Annotator B: Gemini 2.5 Flash ($0.30/$2.50 per M tokens)
  • Architectural diversity (Anthropic vs. Google) minimizes correlated errors
  • ~83% of paragraphs will have immediate agreement

Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):

  • Judge 1: Claude Opus 4.6 (batch — $2.50/$12.50 per M tokens)
  • Judge 2: GPT-5 (batch — $0.63/$5.00 per M tokens)
  • Judge 3: Gemini 2.5 Pro (~$2-4/$12-18 per M tokens)
  • Majority vote (2/3) resolves disagreements
  • Anti-bias: randomize label presentation order

Stage 3 — Active Learning Pass:

  • Cluster remaining low-confidence cases
  • Human-review ~5% (~2,500 cases) to identify systematic errors
  • Iterate rubric if needed, re-run affected subsets

4.2 Prompt Template

SYSTEM PROMPT:
You are an expert annotator classifying paragraphs from SEC cybersecurity
disclosures (10-K Item 1C and 8-K Item 1.05 filings).

For each paragraph, assign:
(a) content_category: exactly one of ["Board Governance", "Management Role",
    "Risk Management Process", "Third-Party Risk", "Incident Disclosure",
    "Strategy Integration", "None/Other"]
(b) specificity_level: integer 1-4

CONTENT CATEGORIES:
- Board Governance: Board/committee oversight of cybersecurity risks, briefing
  frequency, board member cyber expertise
- Management Role: CISO/CTO/CIO identification, qualifications, reporting
  structure, management committees
- Risk Management Process: Risk assessment methodology, framework adoption
  (NIST, ISO, etc.), vulnerability management, monitoring, incident response
  planning, tabletop exercises
- Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement,
  contractual security requirements, supply chain risk
- Incident Disclosure: Description of cybersecurity incidents, scope, timing,
  impact, remediation actions
- Strategy Integration: Material impact on business strategy or financials,
  cyber insurance, investment/resource allocation
- None/Other: Boilerplate introductions, legal disclaimers, forward-looking
  statement warnings, non-cybersecurity content

SPECIFICITY SCALE:
1 - Generic Boilerplate: Could apply to any company. Conditional language
    ("may," "could"). No named entities.
    Example: "We face cybersecurity risks that could materially affect our
    business operations."

2 - Sector-Adapted: References industry context or named frameworks but no
    firm-specific details.
    Example: "We employ a cybersecurity framework aligned with the NIST
    Cybersecurity Framework to manage cyber risk."

3 - Firm-Specific: Contains facts unique to this company — named roles,
    committees, specific programs, reporting lines.
    Example: "Our CISO reports quarterly to the Audit Committee on
    cybersecurity risk posture and incident trends."

4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates,
    frequencies, third-party audit references, or independently verifiable facts.
    Example: "Following the March 2024 incident affecting our payment systems,
    we engaged CrowdStrike and implemented network segmentation at a cost of
    $4.2M, completing remediation in Q3 2024."

BOUNDARY RULES:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry
  term?" If yes → 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?"
  If yes → 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable
  facts?" If yes → 4

Respond with valid JSON only. Include a brief reasoning field.

USER PROMPT:
Company: {company_name}
Filing Date: {filing_date}
Paragraph:
{paragraph_text}

Expected output:

{
  "content_category": "Board Governance",
  "specificity_level": 3,
  "reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts."
}

4.3 Practical Labeling Notes

  • Always use Batch API. Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time.
  • Prompt caching: The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price.
  • Structured output mode: Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%.
  • Reasoning models (o3, extended thinking): Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption.

4.4 Gold Set Protocol

Non-negotiable for publication quality.

  1. Sample 300-500 paragraphs, stratified by:

    • Expected content category (ensure all 7 represented)
    • Expected specificity level (ensure all 4 represented)
    • Industry (financial services, tech, healthcare, manufacturing)
    • Filing year (FY2023 vs FY2024)
  2. Two team members independently label the full gold set

  3. Compute:

    • Cohen's Kappa (binary/nominal categories)
    • Krippendorff's Alpha (ordinal specificity scale)
    • Per-class confusion matrices
    • Target: Kappa > 0.75 ("substantial agreement")
  4. Adjudicate disagreements with a third team member

  5. Run the full MMC pipeline on the gold set and compare


5. Model Strategy

5.1 Primary: SEC-ModernBERT-large

This model does not exist publicly. Building it is a core contribution.

Base model: answerdotai/ModernBERT-large

  • 395M parameters
  • 8,192-token native context (vs. 512 for DeBERTa-v3-large)
  • RoPE + alternating local/global attention + FlashAttention
  • 2-4x faster than DeBERTa-v3-large
  • Apache 2.0 license
  • GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4)

Step 1 — Domain-Adaptive Pre-Training (DAPT):

Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":

  • Training corpus: 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
  • MLM objective: 30% masking rate (ModernBERT convention)
  • Learning rate: ~5e-5 (much lower than from-scratch pre-training)
  • Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
  • VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090

Evidence DAPT works:

  • Gururangan et al. (2020): consistent improvements across all tested domains
  • Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
  • Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens
  • Databricks customer report: 70% → 95% accuracy with domain-specific pre-training

Step 2 — Classification Fine-Tuning:

Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs:

  • Sequence length: 2048 tokens (captures full regulatory paragraphs that 512-token models truncate)
  • Two classification heads: content_category (7-class softmax) + specificity_level (4-class ordinal or softmax)
  • Add supervised contrastive loss (SCL): Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes.
  • VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
  • 3090 supports bf16 natively via Ampere Tensor Cores. Use bf16=True in HuggingFace Trainer. No loss scaling needed (unlike fp16).

5.2 Dark Horse: NeoBERT

chandar-lab/NeoBERT

  • 250M parameters (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large)
  • 4,096-token context
  • SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
  • GLUE: 89.0 (close to DeBERTa-v3-large's 91.4)
  • MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9)
  • MIT license
  • Requires trust_remote_code=True
  • Almost nobody is using it for domain-specific tasks

Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM.

5.3 Baseline: DeBERTa-v3-large

microsoft/deberta-v3-large

  • 304M backbone + 131M embedding = ~435M total
  • 512-token native context (can push to ~1024)
  • Disentangled attention + ELECTRA-style RTD pre-training
  • GLUE: 91.4 — still the highest among all encoders
  • MIT license
  • Weakness: no long context support, completely fails at retrieval tasks

Include as baseline to show improvement from (a) long context and (b) DAPT.

5.4 Ablation Design

Experiment Model Context DAPT SCL Purpose
Baseline DeBERTa-v3-large 512 No No "Standard" approach per syllabus
+ Long context ModernBERT-large 2048 No No Shows context window benefit
+ Domain adapt SEC-ModernBERT-large 2048 Yes No Shows DAPT benefit
+ Contrastive SEC-ModernBERT-large 2048 Yes Yes Shows SCL benefit
Efficiency NeoBERT (+ DAPT) 2048 Yes Yes 40% fewer params, comparable?
Ensemble SEC-ModernBERT + DeBERTa mixed mixed Maximum performance

The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors.

5.5 Training Framework

  • Encoder fine-tuning: HuggingFace transformers + Trainer with AutoModelForSequenceClassification
  • DAPT continued pre-training: HuggingFace transformers with DataCollatorForLanguageModeling
  • SCL implementation: Custom training loop or modify Trainer with dual loss
  • Few-shot prototyping: SetFit (sentence-transformers based) for rapid baseline in <30 seconds

Key reference: Phil Schmid's ModernBERT fine-tuning tutorial: https://www.philschmid.de/fine-tune-modern-bert-in-2025

5.6 Domain-Specific Encoder Models (for comparison only)

These exist but are all BERT-base (110M params, 512 context) — architecturally outdated:

Model HuggingFace ID Domain Params
SEC-BERT nlpaueb/sec-bert-base 260K 10-K filings 110M
SEC-BERT-SHAPE nlpaueb/sec-bert-shape Same, with number normalization 110M
FinBERT ProsusAI/finbert Financial sentiment 110M
Legal-BERT nlpaueb/legal-bert-base-uncased 12GB legal text 110M
SecureBERT arXiv:2204.02685 Cybersecurity text 110M

Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits.


6. Evaluation & Validation

6.1 Required Metrics (from syllabus)

Metric Target Notes
Macro-F1 on human holdout Report per-class and overall Minimum 1.2K holdout examples
Per-class F1 Identify weak categories Expect "None/Other" to be noisiest
Krippendorff's Alpha > 0.67 (adequate), > 0.75 (good) GenAI labels vs. human gold set
Calibration plots Reliability diagrams For probabilistic outputs (softmax)
Robustness splits Report by time period, industry, filing size FY2023 vs FY2024; GICS sector; word count quartiles

6.2 Downstream Validity Tests

These demonstrate that the classifier's predictions correlate with real-world outcomes:

Test 1 — Breach Prediction (strongest):

  • Do firms with lower specificity scores subsequently appear in breach databases?
  • Cross-reference with:
    • Privacy Rights Clearinghouse (80K+ breaches; Mendeley dataset provides ticker/CIK matching: doi.org/10.17632/w33nhh3282.1)
    • VCDB (8K+ incidents, VERIS schema: github.com/vz-risk/VCDB)
    • Board Cybersecurity Incident Tracker (direct SEC filing links: board-cybersecurity.com/incidents/tracker)
    • CISA KEV Catalog (known exploited vulnerabilities: cisa.gov/known-exploited-vulnerabilities-catalog)

Test 2 — Market Reaction (if time permits):

  • Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing
  • Does prior Item 1C disclosure quality predict magnitude of reaction?
  • Small sample (~55 incidents) but high signal
  • Regression: CAR = f(specificity_score, incident_severity, firm_size, industry)

Test 3 — Known-Groups Validity (easy, always include):

  • Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures?
  • Do larger firms (by market cap) have more specific disclosures?
  • These are expected results — confirming them validates the measure

Test 4 — Boilerplate Index (easy, always include):

  • Compute cosine similarity of each company's Item 1C to the industry-median disclosure
  • Does our specificity score inversely correlate with this similarity measure?
  • This is an independent, construct-free validation of the "uniqueness" dimension

6.3 External Benchmark

Per syllabus: "include an external benchmark approach (i.e., previous best practice)."

  • Board Cybersecurity's 23-feature regex extraction is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure.
  • Florackis et al. (2023) cybersecurity risk measure from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule).

7. Release Artifacts

By project end, publish:

  1. HuggingFace Dataset: Extracted Item 1C paragraphs with labels — first public dataset of its kind
  2. SEC-ModernBERT-large: Domain-adapted model weights — first SEC-specific ModernBERT
  3. Fine-tuned classifiers: Content category + specificity models, ready to deploy
  4. Labeling rubric + prompt templates: Reusable for future SEC disclosure research
  5. Extraction pipeline code: EDGAR → structured paragraphs → labeled dataset
  6. Evaluation notebook: All metrics, ablations, validation tests

8. 3-Week Schedule (6 People)

Team Roles

Role Person(s) Primary Responsibility
Data Lead Person A EDGAR extraction pipeline, paragraph segmentation, data cleaning
Data Support Person B 8-K extraction, breach database cross-referencing, dataset QA
Labeling Lead Person C Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration
Annotation Person D Gold set human labeling, inter-rater reliability, active learning review
Model Lead Person E DAPT pre-training, classification fine-tuning, ablation experiments
Eval & Writing Person F Validation tests, metrics computation, final presentation, documentation

Week 1: Data + Rubric

Day Person A (Data Lead) Person B (Data Support) Person C (Labeling Lead) Person D (Annotation) Person E (Model Lead) Person F (Eval & Writing)
Mon Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader) Set up 8-K extraction (sec-8k-item105) Draft labeling rubric v1 from SEC rule Read SEC rule + Gibson Dunn survey Download ModernBERT-large, set up training env Outline evaluation plan, identify breach databases
Tue Begin bulk 10-K download (FY2023 cycle) Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01) Pilot rubric on 30 paragraphs with Claude Opus Pilot rubric on same 30 paragraphs independently Download PleIAs/SEC corpus, prepare DAPT data Download PRC Mendeley dataset, VCDB, set up cross-ref
Wed Continue download (FY2024 cycle), begin Item 1C parsing Build company metadata table (CIK → ticker → GICS sector → market cap) Compare pilot labels with Person D, revise rubric boundary rules Compute initial inter-rater agreement, flag problem areas Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090) Map VCDB incidents to SEC filers by name matching
Thu Paragraph segmentation pipeline, quality checks Merge 8-K incidents with Board Cybersecurity Tracker data Rubric v2 finalized; set up batch API calls for dual annotation Begin gold set sampling (300-500 paragraphs, stratified) DAPT continues (monitor loss, checkpoint) Draft presentation outline
Fri Milestone: Full paragraph corpus ready (~50K+ paragraphs) Milestone: 8-K incident dataset complete Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus Continue gold set labeling (target: finish 150/300) DAPT continues Milestone: Evaluation framework + breach cross-ref ready

Week 2: Labeling + Training

Day Person A Person B Person C Person D Person E Person F
Mon Data cleaning — fix extraction errors, handle edge cases Assist Person D with gold set labeling (second annotator) Monitor dual annotation results (should be ~60% complete) Continue gold set labeling, begin second pass DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning Compute gold set inter-rater reliability (Kappa, Alpha)
Tue Build train/holdout split logic (stratified by industry, year, specificity) Continue gold set second-annotator pass Dual annotation complete → extract disagreements (~17%) Finish gold set, adjudicate disagreements with Person C Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning Analyze gold set confusion patterns, recommend rubric tweaks
Wed Final dataset assembly Assist Person C with judge panel setup Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements Run MMC pipeline on gold set, compare with human labels ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning Milestone: Gold set validated, Kappa computed
Thu Prepare HuggingFace dataset card Begin active learning — cluster low-confidence cases Judge panel results in; assemble final labeled dataset Human-review ~500 low-confidence cases from active learning SEC-ModernBERT-large done; begin NeoBERT experiment Robustness split analysis (by industry, year, filing size)
Fri Milestone: Labeled dataset finalized (~50K paragraphs) Milestone: Active learning pass complete QA final labels — spot-check 100 random samples Assist Person E with evaluation Begin ensemble experiment (SEC-ModernBERT + DeBERTa) Milestone: All baseline + ablation training complete

Week 3: Evaluation + Presentation

Day Person A Person B Person C Person D Person E Person F
Mon Publish dataset to HuggingFace Run breach prediction validation (PRC + VCDB cross-ref) Write labeling methodology section Calibration plots for all models Final ensemble tuning; publish model weights to HuggingFace Compile all metrics into evaluation tables
Tue Write data acquisition section Run known-groups validity (industry, size effects) Write GenAI labeling section Boilerplate index validation (cosine similarity) Write model strategy section Draft full results section
Wed Code cleanup, README for extraction pipeline Market reaction analysis if feasible (optional) Review/edit all written sections Create figures: confusion matrices, calibration plots Review/edit model section Assemble presentation slides
Thu Full team: review presentation, rehearse, polish
Fri Presentation day

Critical Path & Dependencies

Week 1:
  Data extraction (A,B) ──────────────────┐
  Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri)
  DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2)
  Eval framework (F) ─────────────────────────────────────→ (ready for Week 2)

Week 2:
  GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri)
  Gold set (D + B) ──────────────────────→ Validated (Wed)
  Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
  Metrics (F) ───────────────────→ Robustness splits

Week 3:
  Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index
  Writing (all) ──────────────→ Sections → Review → Presentation
  Release (A,E) ──────────────→ HuggingFace dataset + model weights

9. Budget

Item Cost
GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch) ~$115
GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch) ~$55
Prompt caching savings -$30 to -$40
SEC EDGAR data $0 (public domain)
Breach databases (PRC open data, VCDB, CISA KEV) $0
Compute (RTX 3090, already owned) $0
Total ~$130-170

For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated.


SEC Rule & Guidance

Law Firm Surveys & Analysis

Data Extraction Tools

Datasets

Models

Key Papers

Methodological Playbook