41 KiB
Project 3: SEC Cybersecurity Disclosure Quality Classifier
Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer
Project: Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale.
Methodology: Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.
Why this project: No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce three publishable artifacts: a novel dataset, a labeling methodology, and a SOTA classifier.
Table of Contents
- Regulatory Background
- Labeling Rubric
- Data Acquisition
- GenAI Labeling Pipeline
- Model Strategy
- Evaluation & Validation
- Release Artifacts
- 3-Week Schedule (6 People)
- Budget
- Reference Links
1. Regulatory Background
The Rule: SEC Release 33-11216 (July 2023)
The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification.
Full rule PDF: https://www.sec.gov/files/rules/final/2023/33-11216.pdf Fact sheet: https://www.sec.gov/files/33-11216-fact-sheet.pdf
Item 1C — Annual Disclosure (10-K)
Appears as Regulation S-K Item 106, reported in Item 1C of the 10-K. Two mandated subsections:
Item 106(b) — Risk Management and Strategy:
- Processes for assessing, identifying, and managing material cybersecurity risks
- Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM)
- Whether the company engages external assessors, consultants, or auditors
- Processes to oversee/identify risks from third-party service providers
- Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition
Item 106(c) — Governance:
Board Oversight (106(c)(1)):
- Description of board's oversight of cybersecurity risks
- Identification of responsible board committee/subcommittee
- Processes by which the board/committee is informed about risks
Management's Role (106(c)(2)):
- Which management positions/committees are responsible
- Relevant expertise of those persons
- How management monitors prevention, detection, mitigation, and remediation
- Whether and how frequently management reports to the board
Key design note: The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures.
Item 1.05 — Incident Disclosure (8-K)
Required within 4 business days of determining a cybersecurity incident is material:
- Material aspects of the nature, scope, and timing of the incident
- Material impact or reasonably likely material impact on the registrant
Key nuances:
- The 4-day clock starts at the materiality determination, not the incident itself
- Companies explicitly do NOT need to disclose technical details that would impede response/remediation
- The AG can delay disclosure up to 120 days for national security
- Companies must amend the 8-K when new material information becomes available
The May 2024 shift: After SEC Director Erik Gerding clarified that Item 1.05 is only for material incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures:
- Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01
- Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01
Our extraction must capture all three item types.
Compliance Timeline
| Date | Milestone |
|---|---|
| Jul 26, 2023 | Rule adopted |
| Sep 5, 2023 | Rule effective |
| Dec 15, 2023 | Item 1C required in 10-Ks (FY ending on/after this date) |
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
| Dec 18, 2024 | iXBRL tagging of 8-K Item 1.05 required |
iXBRL CYD Taxonomy
The SEC published the Cybersecurity Disclosure (CYD) Taxonomy on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the cyd prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping.
Taxonomy schema: http://xbrl.sec.gov/cyd/2024
Taxonomy guide: https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf
Corpus Size
| Filing Type | Estimated Count (as of early 2026) |
|---|---|
| 10-K with Item 1C (FY2023 cycle) | ~4,500 |
| 10-K with Item 1C (FY2024 cycle) | ~4,500 |
| 8-K cybersecurity incidents | ~80 filings (55 incidents + amendments) |
| Total filings | ~9,000-10,000 |
| Estimated paragraphs (from Item 1C) | ~50,000-80,000 |
2. Labeling Rubric
Dimension 1: Content Category (single-label per paragraph)
Derived directly from the SEC rule structure. Each paragraph receives exactly one category:
| Category | SEC Basis | What It Covers | Example Markers |
|---|---|---|---|
| Board Governance | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise | "Audit Committee," "Board of Directors oversees," "quarterly briefings" |
| Management Role | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure | "Chief Information Security Officer," "reports to," "years of experience" |
| Risk Management Process | 106(b) | Assessment/identification processes, ERM integration, framework references | "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management" |
| Third-Party Risk | 106(b) | Vendor oversight, external assessors/consultants, supply chain risk | "third-party," "service providers," "penetration testing by," "external auditors" |
| Incident Disclosure | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation | "unauthorized access," "detected," "incident," "remediation," "impacted" |
| Strategy Integration | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation | "business strategy," "insurance," "investment," "material," "financial condition" |
| None/Other | — | Boilerplate intros, legal disclaimers, non-cybersecurity content | Forward-looking statement disclaimers, general risk language |
Dimension 2: Specificity (4-point ordinal per paragraph)
Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers:
| Level | Label | Definition | Decision Test |
|---|---|---|---|
| 1 | Generic Boilerplate | Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice. | "Could I paste this into a different company's filing unchanged?" → Yes |
| 2 | Sector-Adapted | References industry context or named frameworks (NIST, ISO) but no firm-specific detail. | "Does this name something specific but not unique to THIS company?" → Yes |
| 3 | Firm-Specific | Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability. | "Does this contain at least one fact unique to THIS company?" → Yes |
| 4 | Quantified-Verifiable | Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail. | "Could an outsider verify a specific claim in this paragraph?" → Yes |
Boundary rules for annotators:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4
Important: EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. Pilot test this 4-level scale on 50 paragraphs early. Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor.
Boilerplate vs. Substantive Markers (from the literature)
Boilerplate indicators:
- Conditional language: "may," "could," "might"
- Generic risk statements without company-specific context
- No named individuals, committees, or frameworks
- Identical language across same-industry filings (cosine similarity > 0.8)
- Passive voice: "cybersecurity risks are managed"
Substantive indicators:
- Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee")
- Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS)
- Concrete processes (penetration testing frequency, tabletop exercises)
- Quantification (dollar investment, headcount, incident counts, training completion rates)
- Third-party names or types of assessments
- Temporal specificity (dates, frequencies, durations)
Mapping to NIST CSF 2.0
For academic grounding, our content categories map to NIST CSF 2.0 functions:
| Our Category | NIST CSF 2.0 |
|---|---|
| Board Governance | GOVERN (GV.OV, GV.RR) |
| Management Role | GOVERN (GV.RR, GV.RM) |
| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) |
| Third-Party Risk | GOVERN (GV.SC) |
| Incident Disclosure | DETECT, RESPOND, RECOVER |
| Strategy Integration | GOVERN (GV.OC, GV.RM) |
3. Data Acquisition
3.1 Extracting 10-K Item 1C
Recommended pipeline:
sec-edgar-downloader → edgar-crawler → paragraph segmentation → dataset
(bulk download) (parse Item 1C) (split into units)
Tools:
| Tool | Purpose | Install | Notes |
|---|---|---|---|
sec-edgar-downloader |
Bulk download 10-K filings by CIK | pip install sec-edgar-downloader |
Pure downloader, no parsing |
edgar-crawler |
Extract specific item sections to JSON | git clone github.com/lefterisloukas/edgar-crawler |
Best for bulk extraction; configure ['1C'] in items list |
edgartools |
Interactive exploration, XBRL parsing | pip install edgartools |
tenk['Item 1C'] accessor; great for prototyping |
sec-api |
Commercial API, zero parsing headaches | pip install sec-api |
extractorApi.get_section(url, "1C", "text") — paid, free tier available |
EDGAR API requirements:
- Rate limit: 10 requests/second
- Required: Custom
User-Agentheader with name and email (e.g.,"TeamName team@email.com") - SEC blocks requests without proper User-Agent (returns 403)
For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories.
Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself.
3.2 Extracting 8-K Incident Disclosures
| Tool | Purpose | URL |
|---|---|---|
sec-8k-item105 |
Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback | github.com/JMousqueton/sec-8k-item105 |
SECurityTr8Ker |
Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts | github.com/pancak3lullz/SECurityTr8Ker |
| Debevoise 8-K Tracker | Curated list with filing links, dates, amendments | debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/ |
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK, impact assessments | board-cybersecurity.com/incidents/tracker |
Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift).
3.3 Paragraph Segmentation
Once Item 1C text is extracted, segment into paragraphs:
- Split on double newlines or
<p>tags (depending on extraction format) - Minimum paragraph length: 20 words (filter out headers, whitespace)
- Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year
Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = ~50,000-70,000 paragraphs
3.4 Pre-Existing Datasets and Resources
| Resource | What It Is | URL |
|---|---|---|
| PleIAs/SEC | 373K full 10-K texts (CC0) | huggingface.co/datasets/PleIAs/SEC |
| EDGAR-CORPUS | 220K filings with sections pre-parsed (Apache 2.0) | huggingface.co/datasets/eloukas/edgar-corpus |
| Board Cybersecurity 23-Feature Analysis | Regex-based extraction of 23 governance/security features from 4,538 10-Ks | board-cybersecurity.com/research/insights/ |
| Gibson Dunn S&P 100 Survey | Detailed feature analysis of disclosure content | corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-... |
| Florackis et al. (2023) "Cybersecurity Risk" | Firm-level cyber risk measure from 10-K text, RFS publication | SSRN: 3725130, data companion: 4319606 |
| zeroshot/cybersecurity-corpus | General cybersecurity text (not SEC-specific, useful for DAPT) | huggingface.co/datasets/zeroshot/cybersecurity-corpus |
4. GenAI Labeling Pipeline
4.1 Multi-Model Consensus (EvasionBench Architecture)
We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators.
Stage 1 — Dual Independent Annotation (all ~50K paragraphs):
- Annotator A: Claude Sonnet 4.6 (batch API — $1.50/$7.50 per M input/output tokens)
- Annotator B: Gemini 2.5 Flash ($0.30/$2.50 per M tokens)
- Architectural diversity (Anthropic vs. Google) minimizes correlated errors
- ~83% of paragraphs will have immediate agreement
Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):
- Judge 1: Claude Opus 4.6 (batch — $2.50/$12.50 per M tokens)
- Judge 2: GPT-5 (batch — $0.63/$5.00 per M tokens)
- Judge 3: Gemini 2.5 Pro (~$2-4/$12-18 per M tokens)
- Majority vote (2/3) resolves disagreements
- Anti-bias: randomize label presentation order
Stage 3 — Active Learning Pass:
- Cluster remaining low-confidence cases
- Human-review ~5% (~2,500 cases) to identify systematic errors
- Iterate rubric if needed, re-run affected subsets
4.2 Prompt Template
SYSTEM PROMPT:
You are an expert annotator classifying paragraphs from SEC cybersecurity
disclosures (10-K Item 1C and 8-K Item 1.05 filings).
For each paragraph, assign:
(a) content_category: exactly one of ["Board Governance", "Management Role",
"Risk Management Process", "Third-Party Risk", "Incident Disclosure",
"Strategy Integration", "None/Other"]
(b) specificity_level: integer 1-4
CONTENT CATEGORIES:
- Board Governance: Board/committee oversight of cybersecurity risks, briefing
frequency, board member cyber expertise
- Management Role: CISO/CTO/CIO identification, qualifications, reporting
structure, management committees
- Risk Management Process: Risk assessment methodology, framework adoption
(NIST, ISO, etc.), vulnerability management, monitoring, incident response
planning, tabletop exercises
- Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement,
contractual security requirements, supply chain risk
- Incident Disclosure: Description of cybersecurity incidents, scope, timing,
impact, remediation actions
- Strategy Integration: Material impact on business strategy or financials,
cyber insurance, investment/resource allocation
- None/Other: Boilerplate introductions, legal disclaimers, forward-looking
statement warnings, non-cybersecurity content
SPECIFICITY SCALE:
1 - Generic Boilerplate: Could apply to any company. Conditional language
("may," "could"). No named entities.
Example: "We face cybersecurity risks that could materially affect our
business operations."
2 - Sector-Adapted: References industry context or named frameworks but no
firm-specific details.
Example: "We employ a cybersecurity framework aligned with the NIST
Cybersecurity Framework to manage cyber risk."
3 - Firm-Specific: Contains facts unique to this company — named roles,
committees, specific programs, reporting lines.
Example: "Our CISO reports quarterly to the Audit Committee on
cybersecurity risk posture and incident trends."
4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates,
frequencies, third-party audit references, or independently verifiable facts.
Example: "Following the March 2024 incident affecting our payment systems,
we engaged CrowdStrike and implemented network segmentation at a cost of
$4.2M, completing remediation in Q3 2024."
BOUNDARY RULES:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry
term?" If yes → 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?"
If yes → 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable
facts?" If yes → 4
Respond with valid JSON only. Include a brief reasoning field.
USER PROMPT:
Company: {company_name}
Filing Date: {filing_date}
Paragraph:
{paragraph_text}
Expected output:
{
"content_category": "Board Governance",
"specificity_level": 3,
"reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts."
}
4.3 Practical Labeling Notes
- Always use Batch API. Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time.
- Prompt caching: The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price.
- Structured output mode: Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%.
- Reasoning models (o3, extended thinking): Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption.
4.4 Gold Set Protocol
Non-negotiable for publication quality.
-
Sample 300-500 paragraphs, stratified by:
- Expected content category (ensure all 7 represented)
- Expected specificity level (ensure all 4 represented)
- Industry (financial services, tech, healthcare, manufacturing)
- Filing year (FY2023 vs FY2024)
-
Two team members independently label the full gold set
-
Compute:
- Cohen's Kappa (binary/nominal categories)
- Krippendorff's Alpha (ordinal specificity scale)
- Per-class confusion matrices
- Target: Kappa > 0.75 ("substantial agreement")
-
Adjudicate disagreements with a third team member
-
Run the full MMC pipeline on the gold set and compare
5. Model Strategy
5.1 Primary: SEC-ModernBERT-large
This model does not exist publicly. Building it is a core contribution.
Base model: answerdotai/ModernBERT-large
- 395M parameters
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
- RoPE + alternating local/global attention + FlashAttention
- 2-4x faster than DeBERTa-v3-large
- Apache 2.0 license
- GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4)
Step 1 — Domain-Adaptive Pre-Training (DAPT):
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- Training corpus: 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
- MLM objective: 30% masking rate (ModernBERT convention)
- Learning rate: ~5e-5 (much lower than from-scratch pre-training)
- Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
- VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
Evidence DAPT works:
- Gururangan et al. (2020): consistent improvements across all tested domains
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens
- Databricks customer report: 70% → 95% accuracy with domain-specific pre-training
Step 2 — Classification Fine-Tuning:
Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs:
- Sequence length: 2048 tokens (captures full regulatory paragraphs that 512-token models truncate)
- Two classification heads: content_category (7-class softmax) + specificity_level (4-class ordinal or softmax)
- Add supervised contrastive loss (SCL): Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes.
- VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
- 3090 supports bf16 natively via Ampere Tensor Cores. Use
bf16=Truein HuggingFace Trainer. No loss scaling needed (unlike fp16).
5.2 Dark Horse: NeoBERT
chandar-lab/NeoBERT
- 250M parameters (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large)
- 4,096-token context
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
- GLUE: 89.0 (close to DeBERTa-v3-large's 91.4)
- MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9)
- MIT license
- Requires
trust_remote_code=True - Almost nobody is using it for domain-specific tasks
Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM.
5.3 Baseline: DeBERTa-v3-large
microsoft/deberta-v3-large
- 304M backbone + 131M embedding = ~435M total
- 512-token native context (can push to ~1024)
- Disentangled attention + ELECTRA-style RTD pre-training
- GLUE: 91.4 — still the highest among all encoders
- MIT license
- Weakness: no long context support, completely fails at retrieval tasks
Include as baseline to show improvement from (a) long context and (b) DAPT.
5.4 Ablation Design
| Experiment | Model | Context | DAPT | SCL | Purpose |
|---|---|---|---|---|---|
| Baseline | DeBERTa-v3-large | 512 | No | No | "Standard" approach per syllabus |
| + Long context | ModernBERT-large | 2048 | No | No | Shows context window benefit |
| + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | Shows DAPT benefit |
| + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | Shows SCL benefit |
| Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params, comparable? |
| Ensemble | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |
The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors.
5.5 Training Framework
- Encoder fine-tuning: HuggingFace
transformers+TrainerwithAutoModelForSequenceClassification - DAPT continued pre-training: HuggingFace
transformerswithDataCollatorForLanguageModeling - SCL implementation: Custom training loop or modify Trainer with dual loss
- Few-shot prototyping:
SetFit(sentence-transformers based) for rapid baseline in <30 seconds
Key reference: Phil Schmid's ModernBERT fine-tuning tutorial: https://www.philschmid.de/fine-tune-modern-bert-in-2025
5.6 Domain-Specific Encoder Models (for comparison only)
These exist but are all BERT-base (110M params, 512 context) — architecturally outdated:
| Model | HuggingFace ID | Domain | Params |
|---|---|---|---|
| SEC-BERT | nlpaueb/sec-bert-base |
260K 10-K filings | 110M |
| SEC-BERT-SHAPE | nlpaueb/sec-bert-shape |
Same, with number normalization | 110M |
| FinBERT | ProsusAI/finbert |
Financial sentiment | 110M |
| Legal-BERT | nlpaueb/legal-bert-base-uncased |
12GB legal text | 110M |
| SecureBERT | arXiv:2204.02685 | Cybersecurity text | 110M |
Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits.
6. Evaluation & Validation
6.1 Required Metrics (from syllabus)
| Metric | Target | Notes |
|---|---|---|
| Macro-F1 on human holdout | Report per-class and overall | Minimum 1.2K holdout examples |
| Per-class F1 | Identify weak categories | Expect "None/Other" to be noisiest |
| Krippendorff's Alpha | > 0.67 (adequate), > 0.75 (good) | GenAI labels vs. human gold set |
| Calibration plots | Reliability diagrams | For probabilistic outputs (softmax) |
| Robustness splits | Report by time period, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |
6.2 Downstream Validity Tests
These demonstrate that the classifier's predictions correlate with real-world outcomes:
Test 1 — Breach Prediction (strongest):
- Do firms with lower specificity scores subsequently appear in breach databases?
- Cross-reference with:
- Privacy Rights Clearinghouse (80K+ breaches; Mendeley dataset provides ticker/CIK matching:
doi.org/10.17632/w33nhh3282.1) - VCDB (8K+ incidents, VERIS schema:
github.com/vz-risk/VCDB) - Board Cybersecurity Incident Tracker (direct SEC filing links:
board-cybersecurity.com/incidents/tracker) - CISA KEV Catalog (known exploited vulnerabilities:
cisa.gov/known-exploited-vulnerabilities-catalog)
- Privacy Rights Clearinghouse (80K+ breaches; Mendeley dataset provides ticker/CIK matching:
Test 2 — Market Reaction (if time permits):
- Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing
- Does prior Item 1C disclosure quality predict magnitude of reaction?
- Small sample (~55 incidents) but high signal
- Regression: CAR = f(specificity_score, incident_severity, firm_size, industry)
Test 3 — Known-Groups Validity (easy, always include):
- Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures?
- Do larger firms (by market cap) have more specific disclosures?
- These are expected results — confirming them validates the measure
Test 4 — Boilerplate Index (easy, always include):
- Compute cosine similarity of each company's Item 1C to the industry-median disclosure
- Does our specificity score inversely correlate with this similarity measure?
- This is an independent, construct-free validation of the "uniqueness" dimension
6.3 External Benchmark
Per syllabus: "include an external benchmark approach (i.e., previous best practice)."
- Board Cybersecurity's 23-feature regex extraction is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure.
- Florackis et al. (2023) cybersecurity risk measure from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule).
7. Release Artifacts
By project end, publish:
- HuggingFace Dataset: Extracted Item 1C paragraphs with labels — first public dataset of its kind
- SEC-ModernBERT-large: Domain-adapted model weights — first SEC-specific ModernBERT
- Fine-tuned classifiers: Content category + specificity models, ready to deploy
- Labeling rubric + prompt templates: Reusable for future SEC disclosure research
- Extraction pipeline code: EDGAR → structured paragraphs → labeled dataset
- Evaluation notebook: All metrics, ablations, validation tests
8. 3-Week Schedule (6 People)
Team Roles
| Role | Person(s) | Primary Responsibility |
|---|---|---|
| Data Lead | Person A | EDGAR extraction pipeline, paragraph segmentation, data cleaning |
| Data Support | Person B | 8-K extraction, breach database cross-referencing, dataset QA |
| Labeling Lead | Person C | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration |
| Annotation | Person D | Gold set human labeling, inter-rater reliability, active learning review |
| Model Lead | Person E | DAPT pre-training, classification fine-tuning, ablation experiments |
| Eval & Writing | Person F | Validation tests, metrics computation, final presentation, documentation |
Week 1: Data + Rubric
| Day | Person A (Data Lead) | Person B (Data Support) | Person C (Labeling Lead) | Person D (Annotation) | Person E (Model Lead) | Person F (Eval & Writing) |
|---|---|---|---|---|---|---|
| Mon | Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader) | Set up 8-K extraction (sec-8k-item105) | Draft labeling rubric v1 from SEC rule | Read SEC rule + Gibson Dunn survey | Download ModernBERT-large, set up training env | Outline evaluation plan, identify breach databases |
| Tue | Begin bulk 10-K download (FY2023 cycle) | Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01) | Pilot rubric on 30 paragraphs with Claude Opus | Pilot rubric on same 30 paragraphs independently | Download PleIAs/SEC corpus, prepare DAPT data | Download PRC Mendeley dataset, VCDB, set up cross-ref |
| Wed | Continue download (FY2024 cycle), begin Item 1C parsing | Build company metadata table (CIK → ticker → GICS sector → market cap) | Compare pilot labels with Person D, revise rubric boundary rules | Compute initial inter-rater agreement, flag problem areas | Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090) | Map VCDB incidents to SEC filers by name matching |
| Thu | Paragraph segmentation pipeline, quality checks | Merge 8-K incidents with Board Cybersecurity Tracker data | Rubric v2 finalized; set up batch API calls for dual annotation | Begin gold set sampling (300-500 paragraphs, stratified) | DAPT continues (monitor loss, checkpoint) | Draft presentation outline |
| Fri | Milestone: Full paragraph corpus ready (~50K+ paragraphs) | Milestone: 8-K incident dataset complete | Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus | Continue gold set labeling (target: finish 150/300) | DAPT continues | Milestone: Evaluation framework + breach cross-ref ready |
Week 2: Labeling + Training
| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|---|---|---|---|---|---|---|
| Mon | Data cleaning — fix extraction errors, handle edge cases | Assist Person D with gold set labeling (second annotator) | Monitor dual annotation results (should be ~60% complete) | Continue gold set labeling, begin second pass | DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning | Compute gold set inter-rater reliability (Kappa, Alpha) |
| Tue | Build train/holdout split logic (stratified by industry, year, specificity) | Continue gold set second-annotator pass | Dual annotation complete → extract disagreements (~17%) | Finish gold set, adjudicate disagreements with Person C | Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning | Analyze gold set confusion patterns, recommend rubric tweaks |
| Wed | Final dataset assembly | Assist Person C with judge panel setup | Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements | Run MMC pipeline on gold set, compare with human labels | ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning | Milestone: Gold set validated, Kappa computed |
| Thu | Prepare HuggingFace dataset card | Begin active learning — cluster low-confidence cases | Judge panel results in; assemble final labeled dataset | Human-review ~500 low-confidence cases from active learning | SEC-ModernBERT-large done; begin NeoBERT experiment | Robustness split analysis (by industry, year, filing size) |
| Fri | Milestone: Labeled dataset finalized (~50K paragraphs) | Milestone: Active learning pass complete | QA final labels — spot-check 100 random samples | Assist Person E with evaluation | Begin ensemble experiment (SEC-ModernBERT + DeBERTa) | Milestone: All baseline + ablation training complete |
Week 3: Evaluation + Presentation
| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|---|---|---|---|---|---|---|
| Mon | Publish dataset to HuggingFace | Run breach prediction validation (PRC + VCDB cross-ref) | Write labeling methodology section | Calibration plots for all models | Final ensemble tuning; publish model weights to HuggingFace | Compile all metrics into evaluation tables |
| Tue | Write data acquisition section | Run known-groups validity (industry, size effects) | Write GenAI labeling section | Boilerplate index validation (cosine similarity) | Write model strategy section | Draft full results section |
| Wed | Code cleanup, README for extraction pipeline | Market reaction analysis if feasible (optional) | Review/edit all written sections | Create figures: confusion matrices, calibration plots | Review/edit model section | Assemble presentation slides |
| Thu | Full team: review presentation, rehearse, polish | |||||
| Fri | Presentation day |
Critical Path & Dependencies
Week 1:
Data extraction (A,B) ──────────────────┐
Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri)
DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2)
Eval framework (F) ─────────────────────────────────────→ (ready for Week 2)
Week 2:
GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri)
Gold set (D + B) ──────────────────────→ Validated (Wed)
Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
Metrics (F) ───────────────────→ Robustness splits
Week 3:
Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index
Writing (all) ──────────────→ Sections → Review → Presentation
Release (A,E) ──────────────→ HuggingFace dataset + model weights
9. Budget
| Item | Cost |
|---|---|
| GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch) | ~$115 |
| GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 |
| Prompt caching savings | -$30 to -$40 |
| SEC EDGAR data | $0 (public domain) |
| Breach databases (PRC open data, VCDB, CISA KEV) | $0 |
| Compute (RTX 3090, already owned) | $0 |
| Total | ~$130-170 |
For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated.
10. Reference Links
SEC Rule & Guidance
- SEC Final Rule 33-11216 (PDF)
- SEC Fact Sheet
- SEC Small Business Compliance Guide
- CYD iXBRL Taxonomy Guide (PDF)
Law Firm Surveys & Analysis
- Gibson Dunn S&P 100 Survey (Harvard Law Forum)
- PwC First Wave of 10-K Cyber Disclosures
- Debevoise 8-K Lessons Learned
- Greenberg Traurig 2025 Trends Update
- Known Trends: First Year of 8-K Filings
- NYU: Lessons Learned from 8-K Reporting
Data Extraction Tools
- edgar-crawler (GitHub)
- edgartools (GitHub)
- sec-edgar-downloader (PyPI)
- sec-8k-item105 (GitHub)
- SECurityTr8Ker (GitHub)
- SEC EDGAR APIs
- SEC EDGAR Full-Text Search
Datasets
- PleIAs/SEC — 373K 10-K texts (HuggingFace, CC0)
- EDGAR-CORPUS — 220K filings, sections parsed (HuggingFace, Apache 2.0)
- Board Cybersecurity 23-Feature Analysis
- Board Cybersecurity Incident Tracker
- PRC Mendeley Breach Dataset (with tickers)
- VCDB (GitHub)
- CISA KEV Catalog
- zeroshot/cybersecurity-corpus (HuggingFace)
Models
- ModernBERT-large (HuggingFace, Apache 2.0)
- ModernBERT-base (HuggingFace, Apache 2.0)
- NeoBERT (HuggingFace, MIT)
- DeBERTa-v3-large (HuggingFace, MIT)
- SEC-BERT (HuggingFace)
- ProsusAI FinBERT (HuggingFace)
- EvasionBench Eva-4B-V2 (HuggingFace)
Key Papers
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — SSRN:4542949
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — arXiv:2602.15312
- Ma et al. (2026), "EvasionBench" — arXiv:2601.09142
- Florackis et al. (2023), "Cybersecurity Risk" — SSRN:3725130
- Gururangan et al. (2020), "Don't Stop Pretraining" — arXiv:2004.10964
- ModernBERT paper — arXiv:2412.13663
- NeoBERT paper — arXiv:2502.19587
- ModernBERT vs DeBERTa-v3 comparison — arXiv:2504.08716
- Patent domain ModernBERT DAPT — arXiv:2509.14926
- SEC filing scaling laws for continued pre-training — arXiv:2512.12384
- Gunel et al. (2020), Supervised Contrastive Learning for fine-tuning — OpenReview
- Phil Schmid, "Fine-tune classifier with ModernBERT in 2025" — philschmid.de
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
- Li, No, and Boritz (2023), BERT-based classification of cybersecurity disclosures
- Scalable 10-K Analysis with LLMs — arXiv:2409.17581
- SecureBERT — arXiv:2204.02685
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" (PNAS) — arXiv:2303.15056
- Pangakis et al. (2023), "Automated Annotation Requires Validation" — arXiv:2306.00176