2026-03-31 16:27:47 -04:00

41 KiB

Raw Permalink Blame History

Project 3: SEC Cybersecurity Disclosure Quality Classifier

Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer

Project: Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale.

Methodology: Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.

Why this project: No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce three publishable artifacts: a novel dataset, a labeling methodology, and a SOTA classifier.

Regulatory Background
Labeling Rubric
Data Acquisition
GenAI Labeling Pipeline
Model Strategy
Evaluation & Validation
Release Artifacts
3-Week Schedule (6 People)
Budget
Reference Links

1. Regulatory Background

The Rule: SEC Release 33-11216 (July 2023)

The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification.

Full rule PDF: https://www.sec.gov/files/rules/final/2023/33-11216.pdf Fact sheet: https://www.sec.gov/files/33-11216-fact-sheet.pdf

Item 1C — Annual Disclosure (10-K)

Appears as Regulation S-K Item 106, reported in Item 1C of the 10-K. Two mandated subsections:

Item 106(b) — Risk Management and Strategy:

Processes for assessing, identifying, and managing material cybersecurity risks
Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM)
Whether the company engages external assessors, consultants, or auditors
Processes to oversee/identify risks from third-party service providers
Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition

Item 106(c) — Governance:

Board Oversight (106(c)(1)):

Description of board's oversight of cybersecurity risks
Identification of responsible board committee/subcommittee
Processes by which the board/committee is informed about risks

Management's Role (106(c)(2)):

Which management positions/committees are responsible
Relevant expertise of those persons
How management monitors prevention, detection, mitigation, and remediation
Whether and how frequently management reports to the board

Key design note: The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures.

Item 1.05 — Incident Disclosure (8-K)

Required within 4 business days of determining a cybersecurity incident is material:

Material aspects of the nature, scope, and timing of the incident
Material impact or reasonably likely material impact on the registrant

Key nuances:

The 4-day clock starts at the materiality determination, not the incident itself
Companies explicitly do NOT need to disclose technical details that would impede response/remediation
The AG can delay disclosure up to 120 days for national security
Companies must amend the 8-K when new material information becomes available

The May 2024 shift: After SEC Director Erik Gerding clarified that Item 1.05 is only for material incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures:

Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01
Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01

Our extraction must capture all three item types.

Compliance Timeline

Date	Milestone
Jul 26, 2023	Rule adopted
Sep 5, 2023	Rule effective
Dec 15, 2023	Item 1C required in 10-Ks (FY ending on/after this date)
Dec 18, 2023	Item 1.05 required in 8-Ks
Jun 15, 2024	Item 1.05 required for smaller reporting companies
Dec 15, 2024	iXBRL tagging of Item 106 (CYD taxonomy) required
Dec 18, 2024	iXBRL tagging of 8-K Item 1.05 required

iXBRL CYD Taxonomy

The SEC published the Cybersecurity Disclosure (CYD) Taxonomy on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the cyd prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping.

Taxonomy schema: http://xbrl.sec.gov/cyd/2024 Taxonomy guide: https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf

Corpus Size

Filing Type	Estimated Count (as of early 2026)
10-K with Item 1C (FY2023 cycle)	~4,500
10-K with Item 1C (FY2024 cycle)	~4,500
8-K cybersecurity incidents	~80 filings (55 incidents + amendments)
Total filings	~9,000-10,000
Estimated paragraphs (from Item 1C)	~50,000-80,000

2. Labeling Rubric

Dimension 1: Content Category (single-label per paragraph)

Derived directly from the SEC rule structure. Each paragraph receives exactly one category:

Category	SEC Basis	What It Covers	Example Markers
Board Governance	106(c)(1)	Board/committee oversight, briefing frequency, board cyber expertise	"Audit Committee," "Board of Directors oversees," "quarterly briefings"
Management Role	106(c)(2)	CISO/CTO identification, qualifications, reporting structure	"Chief Information Security Officer," "reports to," "years of experience"
Risk Management Process	106(b)	Assessment/identification processes, ERM integration, framework references	"NIST CSF," "ISO 27001," "risk assessment," "vulnerability management"
Third-Party Risk	106(b)	Vendor oversight, external assessors/consultants, supply chain risk	"third-party," "service providers," "penetration testing by," "external auditors"
Incident Disclosure	8-K 1.05	Nature/scope/timing of incidents, material impact, remediation	"unauthorized access," "detected," "incident," "remediation," "impacted"
Strategy Integration	106(b)(2)	Material impact on business strategy, cyber insurance, resource allocation	"business strategy," "insurance," "investment," "material," "financial condition"
None/Other	—	Boilerplate intros, legal disclaimers, non-cybersecurity content	Forward-looking statement disclaimers, general risk language

Dimension 2: Specificity (4-point ordinal per paragraph)

Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers:

Level	Label	Definition	Decision Test
1	Generic Boilerplate	Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice.	"Could I paste this into a different company's filing unchanged?" → Yes
2	Sector-Adapted	References industry context or named frameworks (NIST, ISO) but no firm-specific detail.	"Does this name something specific but not unique to THIS company?" → Yes
3	Firm-Specific	Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability.	"Does this contain at least one fact unique to THIS company?" → Yes
4	Quantified-Verifiable	Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail.	"Could an outsider verify a specific claim in this paragraph?" → Yes

Boundary rules for annotators:

If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2
If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3
If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4

Important: EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. Pilot test this 4-level scale on 50 paragraphs early. Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor.

Boilerplate vs. Substantive Markers (from the literature)

Boilerplate indicators:

Conditional language: "may," "could," "might"
Generic risk statements without company-specific context
No named individuals, committees, or frameworks
Identical language across same-industry filings (cosine similarity > 0.8)
Passive voice: "cybersecurity risks are managed"

Substantive indicators:

Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee")
Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS)
Concrete processes (penetration testing frequency, tabletop exercises)
Quantification (dollar investment, headcount, incident counts, training completion rates)
Third-party names or types of assessments
Temporal specificity (dates, frequencies, durations)

Mapping to NIST CSF 2.0

For academic grounding, our content categories map to NIST CSF 2.0 functions:

Our Category	NIST CSF 2.0
Board Governance	GOVERN (GV.OV, GV.RR)
Management Role	GOVERN (GV.RR, GV.RM)
Risk Management Process	IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all)
Third-Party Risk	GOVERN (GV.SC)
Incident Disclosure	DETECT, RESPOND, RECOVER
Strategy Integration	GOVERN (GV.OC, GV.RM)

3. Data Acquisition

3.1 Extracting 10-K Item 1C

Recommended pipeline:

sec-edgar-downloader  →  edgar-crawler  →  paragraph segmentation  →  dataset
  (bulk download)       (parse Item 1C)    (split into units)

Tools:

Tool	Purpose	Install	Notes
`sec-edgar-downloader`	Bulk download 10-K filings by CIK	`pip install sec-edgar-downloader`	Pure downloader, no parsing
`edgar-crawler`	Extract specific item sections to JSON	`git clone github.com/lefterisloukas/edgar-crawler`	Best for bulk extraction; configure `['1C']` in items list
`edgartools`	Interactive exploration, XBRL parsing	`pip install edgartools`	`tenk['Item 1C']` accessor; great for prototyping
`sec-api`	Commercial API, zero parsing headaches	`pip install sec-api`	`extractorApi.get_section(url, "1C", "text")` — paid, free tier available

EDGAR API requirements:

Rate limit: 10 requests/second
Required: Custom User-Agent header with name and email (e.g., "TeamName team@email.com")
SEC blocks requests without proper User-Agent (returns 403)

For iXBRL-tagged filings (2025+): Use edgartools XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories.

Fallback corpus: PleIAs/SEC on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself.

3.2 Extracting 8-K Incident Disclosures

Tool	Purpose	URL
`sec-8k-item105`	Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback	`github.com/JMousqueton/sec-8k-item105`
`SECurityTr8Ker`	Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts	`github.com/pancak3lullz/SECurityTr8Ker`
Debevoise 8-K Tracker	Curated list with filing links, dates, amendments	`debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/`
Board Cybersecurity Tracker	Links filings to MITRE ATT&CK, impact assessments	`board-cybersecurity.com/incidents/tracker`

Critical: Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift).

3.3 Paragraph Segmentation

Once Item 1C text is extracted, segment into paragraphs:

Split on double newlines or <p> tags (depending on extraction format)
Minimum paragraph length: 20 words (filter out headers, whitespace)
Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
Preserve metadata: company name, CIK, ticker, filing date, fiscal year

Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = ~50,000-70,000 paragraphs

3.4 Pre-Existing Datasets and Resources

Resource	What It Is	URL
PleIAs/SEC	373K full 10-K texts (CC0)	`huggingface.co/datasets/PleIAs/SEC`
EDGAR-CORPUS	220K filings with sections pre-parsed (Apache 2.0)	`huggingface.co/datasets/eloukas/edgar-corpus`
Board Cybersecurity 23-Feature Analysis	Regex-based extraction of 23 governance/security features from 4,538 10-Ks	`board-cybersecurity.com/research/insights/`
Gibson Dunn S&P 100 Survey	Detailed feature analysis of disclosure content	`corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-...`
Florackis et al. (2023) "Cybersecurity Risk"	Firm-level cyber risk measure from 10-K text, RFS publication	SSRN: 3725130, data companion: 4319606
zeroshot/cybersecurity-corpus	General cybersecurity text (not SEC-specific, useful for DAPT)	`huggingface.co/datasets/zeroshot/cybersecurity-corpus`

4. GenAI Labeling Pipeline

4.1 Multi-Model Consensus (EvasionBench Architecture)

We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators.

Stage 1 — Dual Independent Annotation (all ~50K paragraphs):

Annotator A: Claude Sonnet 4.6 (batch API — $1.50/$7.50 per M input/output tokens)
Annotator B: Gemini 2.5 Flash ($0.30/$2.50 per M tokens)
Architectural diversity (Anthropic vs. Google) minimizes correlated errors
~83% of paragraphs will have immediate agreement

Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):

Judge 1: Claude Opus 4.6 (batch — $2.50/$12.50 per M tokens)
Judge 2: GPT-5 (batch — $0.63/$5.00 per M tokens)
Judge 3: Gemini 2.5 Pro (~$2-4/$12-18 per M tokens)
Majority vote (2/3) resolves disagreements
Anti-bias: randomize label presentation order

Stage 3 — Active Learning Pass:

Cluster remaining low-confidence cases
Human-review ~5% (~2,500 cases) to identify systematic errors
Iterate rubric if needed, re-run affected subsets

4.2 Prompt Template

SYSTEM PROMPT:
You are an expert annotator classifying paragraphs from SEC cybersecurity
disclosures (10-K Item 1C and 8-K Item 1.05 filings).

For each paragraph, assign:
(a) content_category: exactly one of ["Board Governance", "Management Role",
    "Risk Management Process", "Third-Party Risk", "Incident Disclosure",
    "Strategy Integration", "None/Other"]
(b) specificity_level: integer 1-4

CONTENT CATEGORIES:
- Board Governance: Board/committee oversight of cybersecurity risks, briefing
  frequency, board member cyber expertise
- Management Role: CISO/CTO/CIO identification, qualifications, reporting
  structure, management committees
- Risk Management Process: Risk assessment methodology, framework adoption
  (NIST, ISO, etc.), vulnerability management, monitoring, incident response
  planning, tabletop exercises
- Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement,
  contractual security requirements, supply chain risk
- Incident Disclosure: Description of cybersecurity incidents, scope, timing,
  impact, remediation actions
- Strategy Integration: Material impact on business strategy or financials,
  cyber insurance, investment/resource allocation
- None/Other: Boilerplate introductions, legal disclaimers, forward-looking
  statement warnings, non-cybersecurity content

SPECIFICITY SCALE:
1 - Generic Boilerplate: Could apply to any company. Conditional language
    ("may," "could"). No named entities.
    Example: "We face cybersecurity risks that could materially affect our
    business operations."

2 - Sector-Adapted: References industry context or named frameworks but no
    firm-specific details.
    Example: "We employ a cybersecurity framework aligned with the NIST
    Cybersecurity Framework to manage cyber risk."

3 - Firm-Specific: Contains facts unique to this company — named roles,
    committees, specific programs, reporting lines.
    Example: "Our CISO reports quarterly to the Audit Committee on
    cybersecurity risk posture and incident trends."

4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates,
    frequencies, third-party audit references, or independently verifiable facts.
    Example: "Following the March 2024 incident affecting our payment systems,
    we engaged CrowdStrike and implemented network segmentation at a cost of
    $4.2M, completing remediation in Q3 2024."

BOUNDARY RULES:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry
  term?" If yes → 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?"
  If yes → 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable
  facts?" If yes → 4

Respond with valid JSON only. Include a brief reasoning field.

USER PROMPT:
Company: {company_name}
Filing Date: {filing_date}
Paragraph:
{paragraph_text}

Expected output:

{
  "content_category": "Board Governance",
  "specificity_level": 3,
  "reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts."
}

4.3 Practical Labeling Notes

Always use Batch API. Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time.
Prompt caching: The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price.
Structured output mode: Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%.
Reasoning models (o3, extended thinking): Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption.

4.4 Gold Set Protocol

Non-negotiable for publication quality.

Sample 300-500 paragraphs, stratified by:
- Expected content category (ensure all 7 represented)
- Expected specificity level (ensure all 4 represented)
- Industry (financial services, tech, healthcare, manufacturing)
- Filing year (FY2023 vs FY2024)
Two team members independently label the full gold set
Compute:
- Cohen's Kappa (binary/nominal categories)
- Krippendorff's Alpha (ordinal specificity scale)
- Per-class confusion matrices
- Target: Kappa > 0.75 ("substantial agreement")
Adjudicate disagreements with a third team member
Run the full MMC pipeline on the gold set and compare

5. Model Strategy

5.1 Primary: SEC-ModernBERT-large

This model does not exist publicly. Building it is a core contribution.

Base model: answerdotai/ModernBERT-large

395M parameters
8,192-token native context (vs. 512 for DeBERTa-v3-large)
RoPE + alternating local/global attention + FlashAttention
2-4x faster than DeBERTa-v3-large
Apache 2.0 license
GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4)

Step 1 — Domain-Adaptive Pre-Training (DAPT):

Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":

Training corpus: 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
MLM objective: 30% masking rate (ModernBERT convention)
Learning rate: ~5e-5 (much lower than from-scratch pre-training)
Hardware (RTX 3090): bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
VRAM estimate: ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090

Evidence DAPT works:

Gururangan et al. (2020): consistent improvements across all tested domains
Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens
Databricks customer report: 70% → 95% accuracy with domain-specific pre-training

Step 2 — Classification Fine-Tuning:

Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs:

Sequence length: 2048 tokens (captures full regulatory paragraphs that 512-token models truncate)
Two classification heads: content_category (7-class softmax) + specificity_level (4-class ordinal or softmax)
Add supervised contrastive loss (SCL): Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes.
VRAM: ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
3090 supports bf16 natively via Ampere Tensor Cores. Use bf16=True in HuggingFace Trainer. No loss scaling needed (unlike fp16).

5.2 Dark Horse: NeoBERT

chandar-lab/NeoBERT

250M parameters (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large)
4,096-token context
SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
GLUE: 89.0 (close to DeBERTa-v3-large's 91.4)
MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9)
MIT license
Requires trust_remote_code=True
Almost nobody is using it for domain-specific tasks

Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM.

5.3 Baseline: DeBERTa-v3-large

microsoft/deberta-v3-large

304M backbone + 131M embedding = ~435M total
512-token native context (can push to ~1024)
Disentangled attention + ELECTRA-style RTD pre-training
GLUE: 91.4 — still the highest among all encoders
MIT license
Weakness: no long context support, completely fails at retrieval tasks

Include as baseline to show improvement from (a) long context and (b) DAPT.

5.4 Ablation Design

Experiment	Model	Context	DAPT	SCL	Purpose
Baseline	DeBERTa-v3-large	512	No	No	"Standard" approach per syllabus
+ Long context	ModernBERT-large	2048	No	No	Shows context window benefit
+ Domain adapt	SEC-ModernBERT-large	2048	Yes	No	Shows DAPT benefit
+ Contrastive	SEC-ModernBERT-large	2048	Yes	Yes	Shows SCL benefit
Efficiency	NeoBERT (+ DAPT)	2048	Yes	Yes	40% fewer params, comparable?
Ensemble	SEC-ModernBERT + DeBERTa	mixed	mixed	—	Maximum performance

The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors.

5.5 Training Framework

Encoder fine-tuning: HuggingFace transformers + Trainer with AutoModelForSequenceClassification
DAPT continued pre-training: HuggingFace transformers with DataCollatorForLanguageModeling
SCL implementation: Custom training loop or modify Trainer with dual loss
Few-shot prototyping: SetFit (sentence-transformers based) for rapid baseline in <30 seconds

Key reference: Phil Schmid's ModernBERT fine-tuning tutorial: https://www.philschmid.de/fine-tune-modern-bert-in-2025

5.6 Domain-Specific Encoder Models (for comparison only)

These exist but are all BERT-base (110M params, 512 context) — architecturally outdated:

Model	HuggingFace ID	Domain	Params
SEC-BERT	`nlpaueb/sec-bert-base`	260K 10-K filings	110M
SEC-BERT-SHAPE	`nlpaueb/sec-bert-shape`	Same, with number normalization	110M
FinBERT	`ProsusAI/finbert`	Financial sentiment	110M
Legal-BERT	`nlpaueb/legal-bert-base-uncased`	12GB legal text	110M
SecureBERT	arXiv:2204.02685	Cybersecurity text	110M

Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits.

6. Evaluation & Validation

6.1 Required Metrics (from syllabus)

Metric	Target	Notes
Macro-F1 on human holdout	Report per-class and overall	Minimum 1.2K holdout examples
Per-class F1	Identify weak categories	Expect "None/Other" to be noisiest
Krippendorff's Alpha	> 0.67 (adequate), > 0.75 (good)	GenAI labels vs. human gold set
Calibration plots	Reliability diagrams	For probabilistic outputs (softmax)
Robustness splits	Report by time period, industry, filing size	FY2023 vs FY2024; GICS sector; word count quartiles

6.2 Downstream Validity Tests

These demonstrate that the classifier's predictions correlate with real-world outcomes:

Test 1 — Breach Prediction (strongest):

Do firms with lower specificity scores subsequently appear in breach databases?
Cross-reference with:
- Privacy Rights Clearinghouse (80K+ breaches; Mendeley dataset provides ticker/CIK matching: doi.org/10.17632/w33nhh3282.1)
- VCDB (8K+ incidents, VERIS schema: github.com/vz-risk/VCDB)
- Board Cybersecurity Incident Tracker (direct SEC filing links: board-cybersecurity.com/incidents/tracker)
- CISA KEV Catalog (known exploited vulnerabilities: cisa.gov/known-exploited-vulnerabilities-catalog)

Test 2 — Market Reaction (if time permits):

Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing
Does prior Item 1C disclosure quality predict magnitude of reaction?
Small sample (~55 incidents) but high signal
Regression: CAR = f(specificity_score, incident_severity, firm_size, industry)

Test 3 — Known-Groups Validity (easy, always include):

Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures?
Do larger firms (by market cap) have more specific disclosures?
These are expected results — confirming them validates the measure

Test 4 — Boilerplate Index (easy, always include):

Compute cosine similarity of each company's Item 1C to the industry-median disclosure
Does our specificity score inversely correlate with this similarity measure?
This is an independent, construct-free validation of the "uniqueness" dimension

6.3 External Benchmark

Per syllabus: "include an external benchmark approach (i.e., previous best practice)."

Board Cybersecurity's 23-feature regex extraction is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure.
Florackis et al. (2023) cybersecurity risk measure from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule).

7. Release Artifacts

By project end, publish:

HuggingFace Dataset: Extracted Item 1C paragraphs with labels — first public dataset of its kind
SEC-ModernBERT-large: Domain-adapted model weights — first SEC-specific ModernBERT
Fine-tuned classifiers: Content category + specificity models, ready to deploy
Labeling rubric + prompt templates: Reusable for future SEC disclosure research
Extraction pipeline code: EDGAR → structured paragraphs → labeled dataset
Evaluation notebook: All metrics, ablations, validation tests

8. 3-Week Schedule (6 People)

Team Roles

Role	Person(s)	Primary Responsibility
Data Lead	Person A	EDGAR extraction pipeline, paragraph segmentation, data cleaning
Data Support	Person B	8-K extraction, breach database cross-referencing, dataset QA
Labeling Lead	Person C	Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration
Annotation	Person D	Gold set human labeling, inter-rater reliability, active learning review
Model Lead	Person E	DAPT pre-training, classification fine-tuning, ablation experiments
Eval & Writing	Person F	Validation tests, metrics computation, final presentation, documentation

Week 1: Data + Rubric

Day	Person A (Data Lead)	Person B (Data Support)	Person C (Labeling Lead)	Person D (Annotation)	Person E (Model Lead)	Person F (Eval & Writing)
Mon	Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader)	Set up 8-K extraction (sec-8k-item105)	Draft labeling rubric v1 from SEC rule	Read SEC rule + Gibson Dunn survey	Download ModernBERT-large, set up training env	Outline evaluation plan, identify breach databases
Tue	Begin bulk 10-K download (FY2023 cycle)	Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01)	Pilot rubric on 30 paragraphs with Claude Opus	Pilot rubric on same 30 paragraphs independently	Download PleIAs/SEC corpus, prepare DAPT data	Download PRC Mendeley dataset, VCDB, set up cross-ref
Wed	Continue download (FY2024 cycle), begin Item 1C parsing	Build company metadata table (CIK → ticker → GICS sector → market cap)	Compare pilot labels with Person D, revise rubric boundary rules	Compute initial inter-rater agreement, flag problem areas	Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090)	Map VCDB incidents to SEC filers by name matching
Thu	Paragraph segmentation pipeline, quality checks	Merge 8-K incidents with Board Cybersecurity Tracker data	Rubric v2 finalized; set up batch API calls for dual annotation	Begin gold set sampling (300-500 paragraphs, stratified)	DAPT continues (monitor loss, checkpoint)	Draft presentation outline
Fri	Milestone: Full paragraph corpus ready (~50K+ paragraphs)	Milestone: 8-K incident dataset complete	Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus	Continue gold set labeling (target: finish 150/300)	DAPT continues	Milestone: Evaluation framework + breach cross-ref ready

Week 2: Labeling + Training

Day	Person A	Person B	Person C	Person D	Person E	Person F
Mon	Data cleaning — fix extraction errors, handle edge cases	Assist Person D with gold set labeling (second annotator)	Monitor dual annotation results (should be ~60% complete)	Continue gold set labeling, begin second pass	DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning	Compute gold set inter-rater reliability (Kappa, Alpha)
Tue	Build train/holdout split logic (stratified by industry, year, specificity)	Continue gold set second-annotator pass	Dual annotation complete → extract disagreements (~17%)	Finish gold set, adjudicate disagreements with Person C	Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning	Analyze gold set confusion patterns, recommend rubric tweaks
Wed	Final dataset assembly	Assist Person C with judge panel setup	Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements	Run MMC pipeline on gold set, compare with human labels	ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning	Milestone: Gold set validated, Kappa computed
Thu	Prepare HuggingFace dataset card	Begin active learning — cluster low-confidence cases	Judge panel results in; assemble final labeled dataset	Human-review ~500 low-confidence cases from active learning	SEC-ModernBERT-large done; begin NeoBERT experiment	Robustness split analysis (by industry, year, filing size)
Fri	Milestone: Labeled dataset finalized (~50K paragraphs)	Milestone: Active learning pass complete	QA final labels — spot-check 100 random samples	Assist Person E with evaluation	Begin ensemble experiment (SEC-ModernBERT + DeBERTa)	Milestone: All baseline + ablation training complete

Week 3: Evaluation + Presentation

Day	Person A	Person B	Person C	Person D	Person E	Person F
Mon	Publish dataset to HuggingFace	Run breach prediction validation (PRC + VCDB cross-ref)	Write labeling methodology section	Calibration plots for all models	Final ensemble tuning; publish model weights to HuggingFace	Compile all metrics into evaluation tables
Tue	Write data acquisition section	Run known-groups validity (industry, size effects)	Write GenAI labeling section	Boilerplate index validation (cosine similarity)	Write model strategy section	Draft full results section
Wed	Code cleanup, README for extraction pipeline	Market reaction analysis if feasible (optional)	Review/edit all written sections	Create figures: confusion matrices, calibration plots	Review/edit model section	Assemble presentation slides
Thu	Full team: review presentation, rehearse, polish
Fri	Presentation day

Critical Path & Dependencies

Week 1:
  Data extraction (A,B) ──────────────────┐
  Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri)
  DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2)
  Eval framework (F) ─────────────────────────────────────→ (ready for Week 2)

Week 2:
  GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri)
  Gold set (D + B) ──────────────────────→ Validated (Wed)
  Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
  Metrics (F) ───────────────────→ Robustness splits

Week 3:
  Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index
  Writing (all) ──────────────→ Sections → Review → Presentation
  Release (A,E) ──────────────→ HuggingFace dataset + model weights

9. Budget

Item	Cost
GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch)	~$115
GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch)	~$55
Prompt caching savings	-$30 to -$40
SEC EDGAR data	$0 (public domain)
Breach databases (PRC open data, VCDB, CISA KEV)	$0
Compute (RTX 3090, already owned)	$0
Total	~$130-170

For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated.

10. Reference Links

SEC Rule & Guidance

Law Firm Surveys & Analysis

Data Extraction Tools

Datasets

Models

Key Papers

Ringel (2023), "Creating Synthetic Experts with Generative AI" — SSRN:4542949
Ludwig et al. (2026), "Extracting Consumer Insight from Text" — arXiv:2602.15312
Ma et al. (2026), "EvasionBench" — arXiv:2601.09142
Florackis et al. (2023), "Cybersecurity Risk" — SSRN:3725130
Gururangan et al. (2020), "Don't Stop Pretraining" — arXiv:2004.10964
ModernBERT paper — arXiv:2412.13663
NeoBERT paper — arXiv:2502.19587
ModernBERT vs DeBERTa-v3 comparison — arXiv:2504.08716
Patent domain ModernBERT DAPT — arXiv:2509.14926
SEC filing scaling laws for continued pre-training — arXiv:2512.12384
Gunel et al. (2020), Supervised Contrastive Learning for fine-tuning — OpenReview
Phil Schmid, "Fine-tune classifier with ModernBERT in 2025" — philschmid.de
Berkman et al. (2018), Cybersecurity disclosure quality scoring
Li, No, and Boritz (2023), BERT-based classification of cybersecurity disclosures
Scalable 10-K Analysis with LLMs — arXiv:2409.17581
SecureBERT — arXiv:2204.02685
Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" (PNAS) — arXiv:2303.15056
Pangakis et al. (2023), "Automated Annotation Requires Validation" — arXiv:2306.00176

41 KiB Raw Permalink Blame History Unescape Escape