docs & reference files

This commit is contained in:
Joey Eamigh 2026-03-31 16:27:47 -04:00
parent 32cd5ecfa8
commit 96246d0197
No known key found for this signature in database
GPG Key ID: CE8C05DFFC53C9CB
7 changed files with 5666 additions and 13 deletions

View File

@ -795,7 +795,24 @@ The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the
**Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked.
**Training time:** ~2,139 steps/epoch × 5 epochs = ~10,695 total steps. At ~1.84 it/s on the 3090, ~1.6 hours total.
**Training time:** ~2,139 steps/epoch × 5 epochs = ~10,695 total steps. 50 minutes on the RTX 3090 at ~3.56 steps/s (averaged over full run including torch.compile warmup).
### TAPT Results
| Metric | Value |
|--------|-------|
| Epochs | 5 |
| Total steps | 10,695 |
| Training time | 50 minutes |
| Initial loss | 1.46 |
| Final train loss (avg) | 0.6428 |
| Final eval loss | 1.0754 |
| Final perplexity | 2.11 |
| Throughput | 114 samples/s, 3.56 steps/s |
Loss dropped from 1.46 → 1.08 over 5 epochs. For comparison, DAPT ended at eval loss 0.72 with standard subword masking at the same 30% rate — the gap reflects the harder whole-word masking objective (no subword hints), not a weaker model. The model learns to predict masked domain terms ("CISO", "materiality", "tabletop") from surrounding paragraph context alone, which is exactly the inductive bias TAPT is designed to create.
The TAPT checkpoint is saved at `checkpoints/tapt/modernbert-large/final/` and is ready for fine-tuning.
### TAPT Launch — Whole-Word Masking Bugs
@ -851,8 +868,8 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
| Documentation + narrative | ~2h | Codebook updates, narrative writing, technical guide updates |
| Labelapp build + infrastructure | ~8h | Monorepo restructure, Next.js app, quiz/warmup/labeling flows, BIBD assignment, sampling, Docker deployment, timer + migration infrastructure |
| DAPT pre-training | ~14.5h GPU | 1 epoch on 500M tokens, RTX 3090. Two sessions (resumed from checkpoint-1280). |
| TAPT debugging + pre-training | ~2h dev + ~1.6h GPU | 4 bugs in transformers whole-word masking + Python 3.14 rollback. Training: 5 epochs on 72K paragraphs. |
| **Total to date** | **~53h** | Includes ~16h GPU time |
| TAPT debugging + pre-training | ~2h dev + ~50min GPU | 4 bugs in transformers whole-word masking + Python 3.14 rollback. Training: 5 epochs on 72K paragraphs, 50 min. |
| **Total to date** | **~52h** | Includes ~15.3h GPU time |
### Remaining Work (estimated)

View File

@ -21,7 +21,8 @@
- [x] DAPT corpus: 14,568 documents, ~1.056B tokens, cleaned (XBRL, URLs, page numbers stripped)
- [x] DAPT training complete: eval loss 0.7250, perplexity 1.65. 1 epoch on 500M tokens, ~14.5h on RTX 3090.
- [x] DAPT checkpoint at `checkpoints/dapt/modernbert-large/final/`
- [x] TAPT config: 5 epochs, whole-word masking, seq_len=512, batch=32
- [x] TAPT training complete: eval loss 1.0754, perplexity 2.11. 5 epochs, whole-word masking, ~50 min on RTX 3090. Loss: 1.46 → 1.08.
- [x] TAPT checkpoint at `checkpoints/tapt/modernbert-large/final/`
- [x] Custom `WholeWordMaskCollator` (upstream `transformers` collator broken for BPE tokenizers)
- [x] Python 3.14 → 3.13 rollback (dill/datasets pickle incompatibility)
- [x] Procedure documented in `docs/DAPT-PROCEDURE.md`
@ -30,16 +31,10 @@
- [x] `docs/DATA-QUALITY-AUDIT.md` — full audit with all patches and quality tiers
- [x] `docs/EDGAR-FILING-GENERATORS.md` — 14 generators with signatures and quality profiles
- [x] `docs/DAPT-PROCEDURE.md` — pre-flight checklist, commands, monitoring guide
- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT launch
- [x] `docs/NARRATIVE.md` — 11 phases documented through TAPT completion
## What's In Progress
### TAPT Training — Running
Training on 72K Item 1C paragraphs using DAPT checkpoint. 5 epochs, whole-word masking, seq_len=512, batch=32. Early loss: 1.46 → 1.40 (first 1% of training). Expected ~1.6h total on RTX 3090. Expecting final loss ~1.0-1.2.
```bash
bun run py:train dapt --config configs/tapt/modernbert.yaml
```
### Human Labeling (139/1,200)
- 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed
- Deployed via labelapp with quiz gating + warmup
@ -82,7 +77,7 @@ Full GenAI benchmark (9 models) on 1,200 holdout. Comparison tables. Write-up.
## Parallel Tracks
```
Track A (GPU): DAPT ✓ → TAPT (running) → Fine-tuning → Eval
Track A (GPU): DAPT ✓ → TAPT → Fine-tuning → Eval
Track B (API): Judge v3 → Judge run ───────────┤
@ -91,7 +86,7 @@ Track C (Human): Labeling (139/1200) → Gold set validation
Track D (Code): Fine-tune pipeline build ───────┘
```
TAPT finishes in ~1.5h. Track D (fine-tune pipeline) can proceed now. Track B can start (prompt update) but production run waits for Track C. Everything converges at fine-tuning.
DAPT + TAPT complete. Track D (fine-tune pipeline) can proceed now. Track B can start (prompt update) but production run waits for Track C. Everything converges at fine-tuning.
## Key File Locations

View File

@ -0,0 +1,449 @@
# Capstone: Build a Business-Grade Text Classifier
**Due:** April 23 by 12pm | **Points:** 35
---
## Team Assignment: Build a Business-Grade Text Classifier
**Team size:** 56 students: at least half must be en rolled in COMP488 or BUSI488.
One-sentence summary: Your team will build an end-to-end system that turns raw text (reviews, filings, reports, speeches, etc.) into reliable business labels—and youll compare two approaches: genAI labeling vs a fine-tuned specialist model.
## Form Your Team
**People > Groups > CAPSTONE : People**
- Each team member must join an EXISTING project group for THEIR SECTION
- DO NOT create your own (additional) projects groups.
- Each group must satisfy the following:
- Have at least two students enrolled in COMP488
- Have at least two students enrolled in BUSI488
- You can have at most three students enrolled in COMP488 and/or BUSI488 in your team, conditional on:
- All other teams having at least two of each in their team (see above).
- If other teams have less than two COMP488 and BUSI488 students, you cannot join a team that already has this minimum requirement.
- **ALL TEAM MEMBERS MUST BE IN THE SAME SECTION (1 vs. 2)**
## Why This Assignment Matters (Business Reality)
### Businesses Are Drowning in Text
- Customer reviews, support chats, emails
- Social posts, news, press releases
- 10-K/10-Q filings, earnings calls, ESG reports
- Policies, regulatory communications, safety notices
- Research abstracts, patents, job postings
### Text Classification Enables Business Value
Text classification is one of the most common, high-ROI uses of AI because it converts messy language into structured signals that businesses can:
- Count — How many "billing issues" this week?
- Trend — Are "delivery failures" rising?
- Segment — Which regions complain about what?
- Route — Send messages/tickets to the right team automatically
- Flag risk — Safety/adverse events, compliance issues, reputational threats
- Inform strategy — Competitor moves, pricing tactics, CX weaknesses
- Input into (predictive) models — As independent variables
### Why Not Manual Reading?
Firms don't just "read it manually" because volume is too high, humans are slow and inconsistent across people/time, and decisions often need speed (minutes/hours, not weeks). Leadership needs dashboards and measurable KPIs.
This will feel like a stretch—and that's the point. You will be supported by structure, templates, and teamwork.
## What You Will Produce (End Result)
A working classifier for a well-documented, established, theoretically grounded construct (your choice), plus a business-style memo explaining:
- What the construct of interest is that you selected from the seven options:
https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
- How it was established and where it is theoretically anchored and motivated
- Why the business should care
- How well your classifier works
- What it costs (time + money)
- Whether it's reliable and reproducible
## Choose a Construct of Interest: Meaningful and Labelable
Your construct should be:
- **Business-relevant** — addresses a real business decision
- **Theoretically grounded** — anchored in established literature
- **Well documented** — clearly defined in academic or industry sources
- **Observable in text** — detectable in your data source
- **Definable with clear rules** — specific enough for reliable labeling
- **Complex & nuanced** — more than just sentiment (not positive/negative)
You must pick one of the seven provided constructs of interest from here:
https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
## Potential Public Data Sources
Below are places to obtain public text data for your classifier (downloads and/or official APIs). For each of the seven constructs of interest, there are already some data sources suggested. Below are additional sources to consider. Use sources with clear educational/research access terms whenever possible. Or use your own source and data (but observe guardrails & ethics as outlined below).
**Important:** Choose a source that fits your construct and is realistically useful to a firm.
### Customer Reviews & Product Voice
**Amazon Reviews (research datasets)**
- McAuley Lab Amazon datasets
- Stanford SNAP Amazon dataset page
- Hugging Face: Amazon Reviews 2023
**Yelp Open Dataset**
- Kaggle: Yelp Open Dataset
### Corporate Filings & Investor Communications
**SEC EDGAR (10-K / 10-Q / 8-K etc.) — Official APIs**
- EDGAR Application Programming Interfaces
- SEC Developer Resources
- SEC Disclosure Data API announcement
### Regulation, Enforcement, Safety & Compliance
**FDA Warning Letters**
- FDA: Warning Letters
- Data.gov: Warning Letters dataset
**U.S. Consumer Product Safety Commission (CPSC) Recalls — API**
- CPSC Recalls API information
### Health & Public Policy (Public)
**ClinicalTrials.gov (Modernized API)**
- ClinicalTrials.gov Data API
**PubMed / NCBI (E-utilities API)**
- NCBI E-utilities documentation
### Research & Innovation
**arXiv (bulk data)**
- arXiv bulk data help
- arXiv bulk data on AWS S3
**Patents (USPTO + PatentsView)**
- USPTO Open Data Portal: bulk data
- PatentsView: bulk downloads
### Government Documents & Speeches
**govinfo API + documentation** (Congressional docs, Federal Register, etc.)
- govinfo API overview
- govinfo API docs
**Congress.gov API**
- Congress.gov API (GPO)
**U.S. Presidential speeches** (Miller Center data portal)
- Miller Center: Presidential Speech Data
### News / Media Monitoring
**GDELT** (global news monitoring)
- GDELT data downloads
- GDELT 2.0 API documentation (blog)
**Common Crawl News dataset** (CC-NEWS)
- Common Crawl: News dataset announcement
### Security / Technical Risk
**NIST National Vulnerability Database (NVD)** — CVE APIs & feeds
- NVD: Vulnerabilities API
- NVD: Data feeds
### Earnings Calls / Transcripts (public option)
**Academic/open dataset option**
- STRUX dataset page
### Tips & Rules
- **Tip:** Before committing, do a quick pilot on 100200 texts to confirm your construct appears in the source and that your labels are workable
- **Rule:** Use public data or properly de-identified data only. No sensitive internal company data.
---
## The Workflow: What You Must Do (Step-by-Step)
### Step 1 — Construct Choice + Business Case
**Goal:** Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale.
https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
**Include:**
- Who is the stakeholder? (CX lead, compliance, product manager, investor relations, etc.)
- What decision/action will classification enable?
- What happens if the classifier is wrong? (false positives vs false negatives)
- Why now? (volume, speed, competitive need)
**Deliverable:** 12 page concept brief
### Step 2 — Define the Construct Precisely
**Goal:** Turn the concept into labels humans can apply consistently.
**You must create:**
- Label set (classes)
- Clear definitions and decision rules
- Borderline cases: what to do when unclear
- "None/other" policy IF APPLICABLE (multi-class yes, multi-label no)
- 2-3 example texts per label (your own)
**Decide:** multi-class (one label) vs multi-label (multiple labels can apply).
**Deliverable:** Labeling codebook (PDF)
### Step 3 — Identify and Justify the Data Source
**Goal:** Show that your source actually contains your construct.
**Include:**
- Why this source fits the business purpose
- How a firm would use it regularly (weekly monitoring, quarterly reporting, etc.)
- Basic access plan (download/API/scrape—follow site rules)
**Deliverable:** Data source plan + sampling approach
### Step 4 — Collect Your Dataset
**Targets:**
- 20,000 texts total for train/test
- 1,200 texts for a locked holdout set
**Deliverable:** Raw dataset file + collection notebook/script + documentation
### Step 5 — Preprocess and Split
**You must:**
- Clean text (remove duplicates, empty, obvious spam)
- Create: train/test totaling 20,000; holdout = 1,200 (lock it in early—don't tune on it)
- **Beware of label imbalance!** May need to curate strategically!
- Report length stats and basic summaries
- Plan for imbalance (it's normal—just don't ignore it)
**Deliverable:** Preprocessing notebook + split files + basic descriptive stats
### Step 6 — Human Label the Holdout
**Requirements:**
- The 1,200 holdout must be labeled by humans
- At least 3 people must label each example
- Either independent labels + agreement report (Krippendorff's alpha recommended), or discussion-based consensus + documented process
**Deliverable:** Holdout with 3+ labels per item (or consensus + notes) + reliability report
### Step 7 — Benchmark GenAI Labeling (≥6 Models)
**Goal:** Treat genAI as alternative "labelers" and compare.
**Requirements:**
- Run at least 6 models from at least 3 different suppliers (e.g., OpenAI, Anthropic, Meta, xAI, Google, FireworksAI, Deepseek, Moonshot)
- Fix prompts to make them comparable across models
- Track runtime, estimated cost, and reproducibility strategy
**Metrics to report:**
- Krippendorff's alpha (vs human labels)
- F1 (macro and per-class if possible)
- AUC (where applicable)
- MCC
**Deliverable:** Benchmark notebook + results table + cost/time log
### Step 8 — Select GenAI Labeling Strategy and Label Train/Test
Choose the best single model or a combination (e.g., majority vote). Then label your 20,000 train/test set.
**Deliverable:** Labeled train/test dataset + labeling script + total cost/time summary
### Step 9 — Fine-Tune a Specialist Model
**Goal:** Create a smaller, task-specific classifier that can match/exceed genAI labeling quality.
**Minimum requirements:**
- Fine-tune at least one pretrained model (RoBERTa or similar)
- Try at least four configurations (model choice or hyperparameters)
- Evaluate on holdout (only at the end)
**Deliverable:** Training notebook + saved model + evaluation on holdout
### Step 10 — Final Comparison: GenAI vs Specialist
Your final analysis must answer:
- Which is better on holdout and by how much?
- Which is cheaper per 1,000,000 texts?
- Which is faster?
- Which is more consistent and reproducible?
- What would you recommend a firm deploy, and why?
**Deliverable:** Final comparison section + recommendation
---
## What To Submit (Deliverables Checklist)
### A) Executive Memo (Executive Style, Max 5 Pages)
- Construct + why it matters + theoretically grounded and well documented
- Data source + governance/ethics
- Label schema overview
- Results summary: best genAI vs best specialist
- Cost/time/reproducibility comparison
- Recommendation for a real firm
### B) Technical Appendix (Slides or PDF)
- Pipeline diagram (data → labels → model → evaluation)
- Label codebook (or link/appendix)
- Benchmark table (≥6 genAI models)
- Fine-tuning experiments + results
- Error analysis: where does it fail and why?
### C) Code + Artifacts (Datasets)
- Colab notebooks (reproducible)
- **Datasets:**
- Holdout with human labels
- Train/test with genAI labels
- Any other data used
- All labels of all models for each run + majority labels
- Saved fine-tuned model + inference script (provide link to webspace/Google Drive/Dropbox; **do not upload to Canvas!**)
- Cost/time log
### D) IGNITE Talk in Class
- **20 PPTX Slides**, 15 sec per slide (automatic transitions), 5 min total
- Submit your PPTX slides (not PDF, not Google Slides, not other format!) set to auto transitions after 15 sec on Canvas
- I will have all slides ready for you to go in class
- **Key messages and insights only**
- **Every team member must present at least 2 slides**
- There will be a 3 min Q&A after each IGNITE talk
**Learn more:**
- https://en.wikipedia.org/wiki/Ignite_(event)
- https://www.ignitetalks.io/
- https://robbiesenbach.com/deliver-successful-ignite-talk/
---
- **Start simple, then improve:** Your first codebook will be imperfect. Iterate.
- **Design labels for reliability:** If humans disagree a lot, the model will struggle. Fix definitions before scaling.
- **Lock the holdout early:** It counts 35% of your grade. Don't change direction repeatedly—fix it early and make sure it's representative for what you ultimately want to achieve from a business perspective.
- **Do a small pilot before spending money:** Test prompts and schema on 100200 examples first.
- **Don't hide class imbalance:** Imbalance is normal. But it can damage evaluation and training. **You must solve this!** Use macro F1 and MCC, and document prevalence.
## Team Structure (Recommended Roles)
- **Project lead:** Scope, project plan, milestones, quality control
- **Data lead:** Collection, cleaning, dataset documentation
- **Labeling lead:** Codebook, human labeling workflow, reliability stats
- **GenAI lead:** API scripts, benchmarking, cost/time tracking
- **Modeling lead:** Fine-tuning, hyperparameters, reproducibility, evaluation
- **Delivery lead:** Assemble all outputs, organize write-ups, final quality control
## Grading Rubric (100 Points)
| Criterion | Points |
| ----------------------------------------- | ------ |
| Business framing & construct clarity | 20 |
| Data pipeline quality + documentation | 15 |
| Human labeling process + reliability | 15 |
| GenAI benchmarking rigor | 20 |
| Fine-tuning rigor + evaluation discipline | 20 |
| Final comparison + recommendation quality | 10 |
### Minimum Requirements Per Letter Grade:
#### C (- to +)
- Fine-tuned model with F1 score > 0.80
- Performance comparison genAI vs. fine-tuned model
- Labeled datasets
- Documentation
- Python notebook(s) to replicate pre-processing, training, and evaluation
#### B (- to +)
All of the above plus **at least three** of the following:
- Cost, time, reproducibility analysis
- Comparison of 6 or more models from at least 3 different suppliers
- Contemporary data that you collected (not an off-the-shelf dataset)
- Compelling use-case for your classifier with complete business case
#### A (- to A)
All of the above plus **at least three** of the following:
- Error-analysis (corner cases, rare or complex texts)
- Mitigation strategy and implementation to overcome identified model weaknesses
- Additional baselines (e.g., alternative classification approaches like dictionaries, topic models)
- Comparison to amateur labels
## Guardrails (Ethics & Compliance)
- Public data or approved de-identified data only
- Remove or avoid sensitive personal information
- Document limitations and potential bias
- If your construct is sensitive (health, safety, harassment), include a brief risk statement and mitigation steps
## Estimated Effort
| Task | Hours |
| --------------------------------------------------------- | ----- |
| Construct + codebook v1 + data source plan | 4 |
| Data collection + preprocessing + splits | 4 |
| 1.2K human labeling + reliability + codebook v2 | 8 |
| GenAI benchmarking (≥6 models) + choose labeling strategy | 3 |
| Label 20k + fine-tune specialist (2+ configs) | 2 |
| Final evaluation + memo + presentation | 3 |
**Total: ~24 hours per student**
## Best Work Featured in Vertical AI Paper
I will select (and improve) the best 12 classifiers to be featured in my Vertical AI Paper where each team member will be acknowledged for their application of the synthetic expert/specialist approach. I will ask members whether they want to be named or not (choosing to remain anonymous has not impacted your grade).
## Getting Started
To help you with your Capstone, I wrote a full pipeline in a Python notebook that does all the key steps you need for your Capstone Project by example of classifying 10K sentences into business functions. This includes querying genAI via API at scale, creating holdout and training datasets, fine-tuning a pretrained LLM, and evaluating the performance of genAI and your fine-tuned (vertical AI) model.
**What it does not do** is give you a construct of interest, collect your data, clean and preprocess your data, or draw conclusions and write reports for you.
**Python notebook:** http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb
**Zip file with outputs:** http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip (excludes the actual trained vertical AI because it is 1.5GB)
All subfolders and datasets are included. This is a great blueprint for what data you need to deliver with your capstone on a shared drive (provide link to me) or uploaded if sufficiently small (less than 20MB).
The contents of the zip file also help you see what the expected output is (by example of a multi-label classification problem). You will need to adapt this code to your problem. Use genAI (e.g., Claude Opus 4.6) for this. The pipeline gives you a solid base to work off.

View File

@ -0,0 +1,688 @@
# Project 3: SEC Cybersecurity Disclosure Quality Classifier
## Capstone 2026 — BUSI488/COMP488 — Team Knowledge Transfer
**Project:** Build a validated, reusable classifier that labels SEC cybersecurity disclosures by content category and specificity level, then fine-tune an open-weights model for deployment at scale.
**Methodology:** Ringel (2023) "Synthetic Experts" pipeline — use frontier LLMs to generate training labels, then distill into a small open-weights encoder model.
**Why this project:** No HuggingFace dataset of extracted Item 1C disclosures exists. No trained classifier for cybersecurity disclosure quality exists. No domain-adapted ModernBERT on SEC filings exists. The iXBRL CYD taxonomy just went live (Dec 2024). We produce **three publishable artifacts**: a novel dataset, a labeling methodology, and a SOTA classifier.
---
## Table of Contents
1. [Regulatory Background](#1-regulatory-background)
2. [Labeling Rubric](#2-labeling-rubric)
3. [Data Acquisition](#3-data-acquisition)
4. [GenAI Labeling Pipeline](#4-genai-labeling-pipeline)
5. [Model Strategy](#5-model-strategy)
6. [Evaluation & Validation](#6-evaluation--validation)
7. [Release Artifacts](#7-release-artifacts)
8. [3-Week Schedule (6 People)](#8-3-week-schedule-6-people)
9. [Budget](#9-budget)
10. [Reference Links](#10-reference-links)
---
## 1. Regulatory Background
### The Rule: SEC Release 33-11216 (July 2023)
The SEC adopted final rules requiring public companies to disclose cybersecurity risk management, strategy, governance, and material incidents. This created a massive new text corpus with natural variation in quality — perfect for classification.
Full rule PDF: <https://www.sec.gov/files/rules/final/2023/33-11216.pdf>
Fact sheet: <https://www.sec.gov/files/33-11216-fact-sheet.pdf>
### Item 1C — Annual Disclosure (10-K)
Appears as **Regulation S-K Item 106**, reported in **Item 1C** of the 10-K. Two mandated subsections:
**Item 106(b) — Risk Management and Strategy:**
1. Processes for assessing, identifying, and managing material cybersecurity risks
2. Whether/how cybersecurity processes integrate into overall enterprise risk management (ERM)
3. Whether the company engages external assessors, consultants, or auditors
4. Processes to oversee/identify risks from third-party service providers
5. Whether cybersecurity risks (including prior incidents) have materially affected or are reasonably likely to affect business strategy, results, or financial condition
**Item 106(c) — Governance:**
*Board Oversight (106(c)(1)):*
- Description of board's oversight of cybersecurity risks
- Identification of responsible board committee/subcommittee
- Processes by which the board/committee is informed about risks
*Management's Role (106(c)(2)):*
- Which management positions/committees are responsible
- Relevant expertise of those persons
- How management monitors prevention, detection, mitigation, and remediation
- Whether and how frequently management reports to the board
**Key design note:** The SEC uses "describe" — it does not prescribe specific items. The enumerated sub-items are non-exclusive suggestions. This principles-based approach creates natural variation in specificity and content, which is exactly what our rubric captures.
### Item 1.05 — Incident Disclosure (8-K)
Required within **4 business days** of determining a cybersecurity incident is material:
1. Material aspects of the nature, scope, and timing of the incident
2. Material impact or reasonably likely material impact on the registrant
**Key nuances:**
- The 4-day clock starts at the **materiality determination**, not the incident itself
- Companies explicitly do NOT need to disclose technical details that would impede response/remediation
- The AG can delay disclosure up to 120 days for national security
- Companies must amend the 8-K when new material information becomes available
**The May 2024 shift:** After SEC Director Erik Gerding clarified that Item 1.05 is only for *material* incidents, companies pivoted from Item 1.05 to Items 8.01/7.01 for non-material disclosures:
- Pre-guidance: 72% used Item 1.05, 28% used 8.01/7.01
- Post-guidance: 34% used Item 1.05, 66% used 8.01/7.01
**Our extraction must capture all three item types.**
### Compliance Timeline
| Date | Milestone |
|------|-----------|
| Jul 26, 2023 | Rule adopted |
| Sep 5, 2023 | Rule effective |
| Dec 15, 2023 | Item 1C required in 10-Ks (FY ending on/after this date) |
| Dec 18, 2023 | Item 1.05 required in 8-Ks |
| Jun 15, 2024 | Item 1.05 required for smaller reporting companies |
| Dec 15, 2024 | iXBRL tagging of Item 106 (CYD taxonomy) required |
| Dec 18, 2024 | iXBRL tagging of 8-K Item 1.05 required |
### iXBRL CYD Taxonomy
The SEC published the **Cybersecurity Disclosure (CYD) Taxonomy** on Sep 16, 2024. Starting with filings after Dec 15, 2024, Item 1C disclosures are tagged in Inline XBRL using the `cyd` prefix. This means 2025 filings can be parsed programmatically via XBRL rather than HTML scraping.
Taxonomy schema: `http://xbrl.sec.gov/cyd/2024`
Taxonomy guide: <https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf>
### Corpus Size
| Filing Type | Estimated Count (as of early 2026) |
|-------------|-----------------------------------|
| 10-K with Item 1C (FY2023 cycle) | ~4,500 |
| 10-K with Item 1C (FY2024 cycle) | ~4,500 |
| 8-K cybersecurity incidents | ~80 filings (55 incidents + amendments) |
| **Total filings** | **~9,000-10,000** |
| **Estimated paragraphs** (from Item 1C) | **~50,000-80,000** |
---
## 2. Labeling Rubric
### Dimension 1: Content Category (single-label per paragraph)
Derived directly from the SEC rule structure. Each paragraph receives exactly one category:
| Category | SEC Basis | What It Covers | Example Markers |
|----------|-----------|----------------|-----------------|
| **Board Governance** | 106(c)(1) | Board/committee oversight, briefing frequency, board cyber expertise | "Audit Committee," "Board of Directors oversees," "quarterly briefings" |
| **Management Role** | 106(c)(2) | CISO/CTO identification, qualifications, reporting structure | "Chief Information Security Officer," "reports to," "years of experience" |
| **Risk Management Process** | 106(b) | Assessment/identification processes, ERM integration, framework references | "NIST CSF," "ISO 27001," "risk assessment," "vulnerability management" |
| **Third-Party Risk** | 106(b) | Vendor oversight, external assessors/consultants, supply chain risk | "third-party," "service providers," "penetration testing by," "external auditors" |
| **Incident Disclosure** | 8-K 1.05 | Nature/scope/timing of incidents, material impact, remediation | "unauthorized access," "detected," "incident," "remediation," "impacted" |
| **Strategy Integration** | 106(b)(2) | Material impact on business strategy, cyber insurance, resource allocation | "business strategy," "insurance," "investment," "material," "financial condition" |
| **None/Other** | — | Boilerplate intros, legal disclaimers, non-cybersecurity content | Forward-looking statement disclaimers, general risk language |
### Dimension 2: Specificity (4-point ordinal per paragraph)
Grounded in Berkman et al. (2018), Gibson Dunn surveys, and PwC quality tiers:
| Level | Label | Definition | Decision Test |
|-------|-------|------------|---------------|
| **1** | **Generic Boilerplate** | Could apply to any company. Conditional language ("may," "could"). No named entities. Passive voice. | "Could I paste this into a different company's filing unchanged?" → Yes |
| **2** | **Sector-Adapted** | References industry context or named frameworks (NIST, ISO) but no firm-specific detail. | "Does this name something specific but not unique to THIS company?" → Yes |
| **3** | **Firm-Specific** | Names roles (CISO by name), committees, reporting lines, specific programs, or processes unique to the firm. Active voice with accountability. | "Does this contain at least one fact unique to THIS company?" → Yes |
| **4** | **Quantified-Verifiable** | Includes metrics, dollar amounts, dates, frequencies, third-party audit references, or independently verifiable facts. Multiple firm-specific facts with operational detail. | "Could an outsider verify a specific claim in this paragraph?" → Yes |
**Boundary rules for annotators:**
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry term?" → Yes = 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?" → Yes = 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable facts?" → Yes = 4
**Important:** EvasionBench (Ma et al., 2026) found that a 5-level ordinal scale failed (kappa < 0.5) and had to be collapsed to 3 levels. **Pilot test this 4-level scale on 50 paragraphs early.** Be prepared to merge levels 1-2 or 3-4 if inter-annotator agreement is poor.
### Boilerplate vs. Substantive Markers (from the literature)
**Boilerplate indicators:**
- Conditional language: "may," "could," "might"
- Generic risk statements without company-specific context
- No named individuals, committees, or frameworks
- Identical language across same-industry filings (cosine similarity > 0.8)
- Passive voice: "cybersecurity risks are managed"
**Substantive indicators:**
- Named roles and reporting structures ("Our CISO, Jane Smith, reports quarterly to the Audit Committee")
- Specific frameworks by name (NIST CSF, ISO 27001, SOC 2, PCI-DSS)
- Concrete processes (penetration testing frequency, tabletop exercises)
- Quantification (dollar investment, headcount, incident counts, training completion rates)
- Third-party names or types of assessments
- Temporal specificity (dates, frequencies, durations)
### Mapping to NIST CSF 2.0
For academic grounding, our content categories map to NIST CSF 2.0 functions:
| Our Category | NIST CSF 2.0 |
|-------------|-------------|
| Board Governance | GOVERN (GV.OV, GV.RR) |
| Management Role | GOVERN (GV.RR, GV.RM) |
| Risk Management Process | IDENTIFY (ID.RA), GOVERN (GV.RM), PROTECT (all) |
| Third-Party Risk | GOVERN (GV.SC) |
| Incident Disclosure | DETECT, RESPOND, RECOVER |
| Strategy Integration | GOVERN (GV.OC, GV.RM) |
---
## 3. Data Acquisition
### 3.1 Extracting 10-K Item 1C
**Recommended pipeline:**
```
sec-edgar-downloader → edgar-crawler → paragraph segmentation → dataset
(bulk download) (parse Item 1C) (split into units)
```
**Tools:**
| Tool | Purpose | Install | Notes |
|------|---------|---------|-------|
| `sec-edgar-downloader` | Bulk download 10-K filings by CIK | `pip install sec-edgar-downloader` | Pure downloader, no parsing |
| `edgar-crawler` | Extract specific item sections to JSON | `git clone github.com/lefterisloukas/edgar-crawler` | Best for bulk extraction; configure `['1C']` in items list |
| `edgartools` | Interactive exploration, XBRL parsing | `pip install edgartools` | `tenk['Item 1C']` accessor; great for prototyping |
| `sec-api` | Commercial API, zero parsing headaches | `pip install sec-api` | `extractorApi.get_section(url, "1C", "text")` — paid, free tier available |
**EDGAR API requirements:**
- Rate limit: 10 requests/second
- Required: Custom `User-Agent` header with name and email (e.g., `"TeamName team@email.com"`)
- SEC blocks requests without proper User-Agent (returns 403)
**For iXBRL-tagged filings (2025+):** Use `edgartools` XBRL parser to extract CYD taxonomy elements directly. This gives pre-structured data aligned with regulatory categories.
**Fallback corpus:** `PleIAs/SEC` on HuggingFace (373K 10-K full texts, CC0 license) — but sections are NOT pre-parsed; you must extract Item 1C yourself.
### 3.2 Extracting 8-K Incident Disclosures
| Tool | Purpose | URL |
|------|---------|-----|
| `sec-8k-item105` | Extract Item 1.05 from 8-Ks, iXBRL + HTML fallback | `github.com/JMousqueton/sec-8k-item105` |
| `SECurityTr8Ker` | Monitor SEC RSS for new cyber 8-Ks, Slack/Teams alerts | `github.com/pancak3lullz/SECurityTr8Ker` |
| Debevoise 8-K Tracker | Curated list with filing links, dates, amendments | `debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/` |
| Board Cybersecurity Tracker | Links filings to MITRE ATT&CK, impact assessments | `board-cybersecurity.com/incidents/tracker` |
**Critical:** Must capture Item 1.05 AND Items 8.01/7.01 (post-May 2024 shift).
### 3.3 Paragraph Segmentation
Once Item 1C text is extracted, segment into paragraphs:
- Split on double newlines or `<p>` tags (depending on extraction format)
- Minimum paragraph length: 20 words (filter out headers, whitespace)
- Maximum paragraph length: 500 words (split longer blocks at sentence boundaries)
- Preserve metadata: company name, CIK, ticker, filing date, fiscal year
Expected yield: ~5-8 paragraphs per Item 1C disclosure × ~9,000 filings = **~50,000-70,000 paragraphs**
### 3.4 Pre-Existing Datasets and Resources
| Resource | What It Is | URL |
|----------|-----------|-----|
| PleIAs/SEC | 373K full 10-K texts (CC0) | `huggingface.co/datasets/PleIAs/SEC` |
| EDGAR-CORPUS | 220K filings with sections pre-parsed (Apache 2.0) | `huggingface.co/datasets/eloukas/edgar-corpus` |
| Board Cybersecurity 23-Feature Analysis | Regex-based extraction of 23 governance/security features from 4,538 10-Ks | `board-cybersecurity.com/research/insights/` |
| Gibson Dunn S&P 100 Survey | Detailed feature analysis of disclosure content | `corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-...` |
| Florackis et al. (2023) "Cybersecurity Risk" | Firm-level cyber risk measure from 10-K text, RFS publication | SSRN: 3725130, data companion: 4319606 |
| zeroshot/cybersecurity-corpus | General cybersecurity text (not SEC-specific, useful for DAPT) | `huggingface.co/datasets/zeroshot/cybersecurity-corpus` |
---
## 4. GenAI Labeling Pipeline
### 4.1 Multi-Model Consensus (EvasionBench Architecture)
We follow Ma et al. (2026, arXiv:2601.09142) — the EvasionBench pipeline designed for an almost identical task (ordinal classification of financial text). Their approach achieved Cohen's Kappa = 0.835 with human annotators.
**Stage 1 — Dual Independent Annotation (all ~50K paragraphs):**
- Annotator A: **Claude Sonnet 4.6** (batch API — $1.50/$7.50 per M input/output tokens)
- Annotator B: **Gemini 2.5 Flash** ($0.30/$2.50 per M tokens)
- Architectural diversity (Anthropic vs. Google) minimizes correlated errors
- ~83% of paragraphs will have immediate agreement
**Stage 2 — Judge Panel for Disagreements (~17% = ~8,500 cases):**
- Judge 1: **Claude Opus 4.6** (batch — $2.50/$12.50 per M tokens)
- Judge 2: **GPT-5** (batch — $0.63/$5.00 per M tokens)
- Judge 3: **Gemini 2.5 Pro** (~$2-4/$12-18 per M tokens)
- Majority vote (2/3) resolves disagreements
- Anti-bias: randomize label presentation order
**Stage 3 — Active Learning Pass:**
- Cluster remaining low-confidence cases
- Human-review ~5% (~2,500 cases) to identify systematic errors
- Iterate rubric if needed, re-run affected subsets
### 4.2 Prompt Template
```
SYSTEM PROMPT:
You are an expert annotator classifying paragraphs from SEC cybersecurity
disclosures (10-K Item 1C and 8-K Item 1.05 filings).
For each paragraph, assign:
(a) content_category: exactly one of ["Board Governance", "Management Role",
"Risk Management Process", "Third-Party Risk", "Incident Disclosure",
"Strategy Integration", "None/Other"]
(b) specificity_level: integer 1-4
CONTENT CATEGORIES:
- Board Governance: Board/committee oversight of cybersecurity risks, briefing
frequency, board member cyber expertise
- Management Role: CISO/CTO/CIO identification, qualifications, reporting
structure, management committees
- Risk Management Process: Risk assessment methodology, framework adoption
(NIST, ISO, etc.), vulnerability management, monitoring, incident response
planning, tabletop exercises
- Third-Party Risk: Vendor/supplier risk oversight, external assessor engagement,
contractual security requirements, supply chain risk
- Incident Disclosure: Description of cybersecurity incidents, scope, timing,
impact, remediation actions
- Strategy Integration: Material impact on business strategy or financials,
cyber insurance, investment/resource allocation
- None/Other: Boilerplate introductions, legal disclaimers, forward-looking
statement warnings, non-cybersecurity content
SPECIFICITY SCALE:
1 - Generic Boilerplate: Could apply to any company. Conditional language
("may," "could"). No named entities.
Example: "We face cybersecurity risks that could materially affect our
business operations."
2 - Sector-Adapted: References industry context or named frameworks but no
firm-specific details.
Example: "We employ a cybersecurity framework aligned with the NIST
Cybersecurity Framework to manage cyber risk."
3 - Firm-Specific: Contains facts unique to this company — named roles,
committees, specific programs, reporting lines.
Example: "Our CISO reports quarterly to the Audit Committee on
cybersecurity risk posture and incident trends."
4 - Quantified-Verifiable: Includes metrics, dollar amounts, dates,
frequencies, third-party audit references, or independently verifiable facts.
Example: "Following the March 2024 incident affecting our payment systems,
we engaged CrowdStrike and implemented network segmentation at a cost of
$4.2M, completing remediation in Q3 2024."
BOUNDARY RULES:
- If torn between 1 and 2: "Does it name ANY framework, standard, or industry
term?" If yes → 2
- If torn between 2 and 3: "Does it mention anything unique to THIS company?"
If yes → 3
- If torn between 3 and 4: "Does it contain TWO OR MORE specific, verifiable
facts?" If yes → 4
Respond with valid JSON only. Include a brief reasoning field.
USER PROMPT:
Company: {company_name}
Filing Date: {filing_date}
Paragraph:
{paragraph_text}
```
**Expected output:**
```json
{
"content_category": "Board Governance",
"specificity_level": 3,
"reasoning": "Identifies Audit Committee by name and describes quarterly briefing cadence, both firm-specific facts."
}
```
### 4.3 Practical Labeling Notes
- **Always use Batch API.** Both OpenAI and Anthropic offer 50% discount for async/batch processing (24-hour turnaround). No reason to use real-time.
- **Prompt caching:** The system prompt (~800 tokens) is identical for every request. With Anthropic's prompt caching, cached reads cost 10% of base price. Combined with batch discount = 5% of standard price.
- **Structured output mode:** Use JSON mode / structured outputs on all providers. Reduces parsing errors by ~90%.
- **Reasoning models (o3, extended thinking):** Use ONLY as judges for disagreement cases, not as primary annotators. They're overkill for clear-cut classification and expensive due to reasoning token consumption.
### 4.4 Gold Set Protocol
**Non-negotiable for publication quality.**
1. Sample 300-500 paragraphs, stratified by:
- Expected content category (ensure all 7 represented)
- Expected specificity level (ensure all 4 represented)
- Industry (financial services, tech, healthcare, manufacturing)
- Filing year (FY2023 vs FY2024)
2. Two team members independently label the full gold set
3. Compute:
- Cohen's Kappa (binary/nominal categories)
- Krippendorff's Alpha (ordinal specificity scale)
- Per-class confusion matrices
- Target: Kappa > 0.75 ("substantial agreement")
4. Adjudicate disagreements with a third team member
5. Run the full MMC pipeline on the gold set and compare
---
## 5. Model Strategy
### 5.1 Primary: SEC-ModernBERT-large
**This model does not exist publicly. Building it is a core contribution.**
**Base model:** `answerdotai/ModernBERT-large`
- 395M parameters
- 8,192-token native context (vs. 512 for DeBERTa-v3-large)
- RoPE + alternating local/global attention + FlashAttention
- 2-4x faster than DeBERTa-v3-large
- Apache 2.0 license
- GLUE: 90.4 (only 1 point behind DeBERTa-v3-large's 91.4)
**Step 1 — Domain-Adaptive Pre-Training (DAPT):**
Continue MLM pre-training on SEC filing text to create "SEC-ModernBERT-large":
- **Training corpus:** 200-500M tokens of SEC filings (from PleIAs/SEC or your own EDGAR download). Include 10-Ks, 10-Qs, 8-Ks, proxy statements.
- **MLM objective:** 30% masking rate (ModernBERT convention)
- **Learning rate:** ~5e-5 (much lower than from-scratch pre-training)
- **Hardware (RTX 3090):** bf16, gradient checkpointing, seq_len=1024-2048, batch_size=2-4 + gradient accumulation to effective batch 16-32
- **VRAM estimate:** ~12-15GB at batch=4, seq=2048 with gradient checkpointing — fits on 3090
**Evidence DAPT works:**
- Gururangan et al. (2020): consistent improvements across all tested domains
- Patent domain ModernBERT (arXiv:2509.14926): +0.9 to +2.8 F1 from continued pre-training on 31.6B tokens
- Scaling-law analysis on SEC filings (arXiv:2512.12384): consistent improvement with largest gains in first 200M tokens
- Databricks customer report: 70% → 95% accuracy with domain-specific pre-training
**Step 2 — Classification Fine-Tuning:**
Fine-tune SEC-ModernBERT-large on the 50K labeled paragraphs:
- **Sequence length:** 2048 tokens (captures full regulatory paragraphs that 512-token models truncate)
- **Two classification heads:** content_category (7-class softmax) + specificity_level (4-class ordinal or softmax)
- **Add supervised contrastive loss (SCL):** Combine standard cross-entropy with SCL that pulls same-class embeddings together. Gunel et al. (2020) showed +0.5-1.5% improvement, especially for rare/imbalanced classes.
- **VRAM:** ~11-13GB at batch=8, seq=2048 in bf16 — comfortable on 3090
- **3090 supports bf16** natively via Ampere Tensor Cores. Use `bf16=True` in HuggingFace Trainer. No loss scaling needed (unlike fp16).
### 5.2 Dark Horse: NeoBERT
`chandar-lab/NeoBERT`
- **250M parameters** (100M fewer than ModernBERT-large, 185M fewer than DeBERTa-v3-large)
- 4,096-token context
- SwiGLU, RoPE, Pre-RMSNorm, FlashAttention
- GLUE: 89.0 (close to DeBERTa-v3-large's 91.4)
- MTEB: 51.3 (crushes everything else — ModernBERT-large is 46.9)
- MIT license
- Requires `trust_remote_code=True`
- Almost nobody is using it for domain-specific tasks
Same DAPT + fine-tuning pipeline as ModernBERT-large, with even less VRAM.
### 5.3 Baseline: DeBERTa-v3-large
`microsoft/deberta-v3-large`
- 304M backbone + 131M embedding = ~435M total
- 512-token native context (can push to ~1024)
- Disentangled attention + ELECTRA-style RTD pre-training
- GLUE: **91.4** — still the highest among all encoders
- MIT license
- **Weakness:** no long context support, completely fails at retrieval tasks
Include as baseline to show improvement from (a) long context and (b) DAPT.
### 5.4 Ablation Design
| Experiment | Model | Context | DAPT | SCL | Purpose |
|-----------|-------|---------|------|-----|---------|
| Baseline | DeBERTa-v3-large | 512 | No | No | "Standard" approach per syllabus |
| + Long context | ModernBERT-large | 2048 | No | No | Shows context window benefit |
| + Domain adapt | SEC-ModernBERT-large | 2048 | Yes | No | Shows DAPT benefit |
| + Contrastive | SEC-ModernBERT-large | 2048 | Yes | Yes | Shows SCL benefit |
| Efficiency | NeoBERT (+ DAPT) | 2048 | Yes | Yes | 40% fewer params, comparable? |
| **Ensemble** | SEC-ModernBERT + DeBERTa | mixed | mixed | — | Maximum performance |
The ensemble averages logits from SEC-ModernBERT-large (long context, domain-adapted) and DeBERTa-v3-large (highest raw NLU). Their architecturally different attention mechanisms mean uncorrelated errors.
### 5.5 Training Framework
- **Encoder fine-tuning:** HuggingFace `transformers` + `Trainer` with `AutoModelForSequenceClassification`
- **DAPT continued pre-training:** HuggingFace `transformers` with `DataCollatorForLanguageModeling`
- **SCL implementation:** Custom training loop or modify Trainer with dual loss
- **Few-shot prototyping:** `SetFit` (sentence-transformers based) for rapid baseline in <30 seconds
**Key reference:** Phil Schmid's ModernBERT fine-tuning tutorial: <https://www.philschmid.de/fine-tune-modern-bert-in-2025>
### 5.6 Domain-Specific Encoder Models (for comparison only)
These exist but are all BERT-base (110M params, 512 context) — architecturally outdated:
| Model | HuggingFace ID | Domain | Params |
|-------|---------------|--------|--------|
| SEC-BERT | `nlpaueb/sec-bert-base` | 260K 10-K filings | 110M |
| SEC-BERT-SHAPE | `nlpaueb/sec-bert-shape` | Same, with number normalization | 110M |
| FinBERT | `ProsusAI/finbert` | Financial sentiment | 110M |
| Legal-BERT | `nlpaueb/legal-bert-base-uncased` | 12GB legal text | 110M |
| SecureBERT | arXiv:2204.02685 | Cybersecurity text | 110M |
Our DAPT approach on a modern architecture (ModernBERT-large or NeoBERT) will outperform all of these. Include SEC-BERT as an additional baseline if time permits.
---
## 6. Evaluation & Validation
### 6.1 Required Metrics (from syllabus)
| Metric | Target | Notes |
|--------|--------|-------|
| **Macro-F1** on human holdout | Report per-class and overall | Minimum 1.2K holdout examples |
| **Per-class F1** | Identify weak categories | Expect "None/Other" to be noisiest |
| **Krippendorff's Alpha** | > 0.67 (adequate), > 0.75 (good) | GenAI labels vs. human gold set |
| **Calibration plots** | Reliability diagrams | For probabilistic outputs (softmax) |
| **Robustness splits** | Report by time period, industry, filing size | FY2023 vs FY2024; GICS sector; word count quartiles |
### 6.2 Downstream Validity Tests
These demonstrate that the classifier's predictions correlate with real-world outcomes:
**Test 1 — Breach Prediction (strongest):**
- Do firms with lower specificity scores subsequently appear in breach databases?
- Cross-reference with:
- **Privacy Rights Clearinghouse** (80K+ breaches; Mendeley dataset provides ticker/CIK matching: `doi.org/10.17632/w33nhh3282.1`)
- **VCDB** (8K+ incidents, VERIS schema: `github.com/vz-risk/VCDB`)
- **Board Cybersecurity Incident Tracker** (direct SEC filing links: `board-cybersecurity.com/incidents/tracker`)
- **CISA KEV Catalog** (known exploited vulnerabilities: `cisa.gov/known-exploited-vulnerabilities-catalog`)
**Test 2 — Market Reaction (if time permits):**
- Event study: abnormal returns in [-1, +3] window around 8-K Item 1.05 filing
- Does prior Item 1C disclosure quality predict magnitude of reaction?
- Small sample (~55 incidents) but high signal
- Regression: CAR = f(specificity_score, incident_severity, firm_size, industry)
**Test 3 — Known-Groups Validity (easy, always include):**
- Do regulated industries (financial services under NYDFS, healthcare under HIPAA) produce systematically higher-specificity disclosures?
- Do larger firms (by market cap) have more specific disclosures?
- These are expected results — confirming them validates the measure
**Test 4 — Boilerplate Index (easy, always include):**
- Compute cosine similarity of each company's Item 1C to the industry-median disclosure
- Does our specificity score inversely correlate with this similarity measure?
- This is an independent, construct-free validation of the "uniqueness" dimension
### 6.3 External Benchmark
Per syllabus: "include an external benchmark approach (i.e., previous best practice)."
- **Board Cybersecurity's 23-feature regex extraction** is the natural benchmark. Their binary (present/absent) feature coding is the prior best practice. Our classifier should capture everything their regex captures plus the quality/specificity dimension they cannot measure.
- **Florackis et al. (2023) cybersecurity risk measure** from Item 1A text is another comparison — different section (1A vs 1C), different methodology (dictionary vs. classifier), different era (pre-rule vs. post-rule).
---
## 7. Release Artifacts
By project end, publish:
1. **HuggingFace Dataset:** Extracted Item 1C paragraphs with labels — first public dataset of its kind
2. **SEC-ModernBERT-large:** Domain-adapted model weights — first SEC-specific ModernBERT
3. **Fine-tuned classifiers:** Content category + specificity models, ready to deploy
4. **Labeling rubric + prompt templates:** Reusable for future SEC disclosure research
5. **Extraction pipeline code:** EDGAR → structured paragraphs → labeled dataset
6. **Evaluation notebook:** All metrics, ablations, validation tests
---
## 8. 3-Week Schedule (6 People)
### Team Roles
| Role | Person(s) | Primary Responsibility |
|------|-----------|----------------------|
| **Data Lead** | Person A | EDGAR extraction pipeline, paragraph segmentation, data cleaning |
| **Data Support** | Person B | 8-K extraction, breach database cross-referencing, dataset QA |
| **Labeling Lead** | Person C | Rubric refinement, GenAI prompt engineering, MMC pipeline orchestration |
| **Annotation** | Person D | Gold set human labeling, inter-rater reliability, active learning review |
| **Model Lead** | Person E | DAPT pre-training, classification fine-tuning, ablation experiments |
| **Eval & Writing** | Person F | Validation tests, metrics computation, final presentation, documentation |
### Week 1: Data + Rubric
| Day | Person A (Data Lead) | Person B (Data Support) | Person C (Labeling Lead) | Person D (Annotation) | Person E (Model Lead) | Person F (Eval & Writing) |
|-----|---------------------|------------------------|-------------------------|----------------------|----------------------|--------------------------|
| **Mon** | Set up EDGAR extraction pipeline (edgar-crawler + sec-edgar-downloader) | Set up 8-K extraction (sec-8k-item105) | Draft labeling rubric v1 from SEC rule | Read SEC rule + Gibson Dunn survey | Download ModernBERT-large, set up training env | Outline evaluation plan, identify breach databases |
| **Tue** | Begin bulk 10-K download (FY2023 cycle) | Extract all 8-K cyber filings (Items 1.05, 8.01, 7.01) | Pilot rubric on 30 paragraphs with Claude Opus | Pilot rubric on same 30 paragraphs independently | Download PleIAs/SEC corpus, prepare DAPT data | Download PRC Mendeley dataset, VCDB, set up cross-ref |
| **Wed** | Continue download (FY2024 cycle), begin Item 1C parsing | Build company metadata table (CIK → ticker → GICS sector → market cap) | Compare pilot labels with Person D, revise rubric boundary rules | Compute initial inter-rater agreement, flag problem areas | Begin DAPT pre-training (SEC-ModernBERT-large, ~2-3 days on 3090) | Map VCDB incidents to SEC filers by name matching |
| **Thu** | Paragraph segmentation pipeline, quality checks | Merge 8-K incidents with Board Cybersecurity Tracker data | Rubric v2 finalized; set up batch API calls for dual annotation | Begin gold set sampling (300-500 paragraphs, stratified) | DAPT continues (monitor loss, checkpoint) | Draft presentation outline |
| **Fri** | **Milestone: Full paragraph corpus ready (~50K+ paragraphs)** | **Milestone: 8-K incident dataset complete** | Launch Stage 1 dual annotation (Sonnet + Gemini Flash) on full corpus | Continue gold set labeling (target: finish 150/300) | DAPT continues | **Milestone: Evaluation framework + breach cross-ref ready** |
### Week 2: Labeling + Training
| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|-----|----------|----------|----------|----------|----------|----------|
| **Mon** | Data cleaning — fix extraction errors, handle edge cases | Assist Person D with gold set labeling (second annotator) | Monitor dual annotation results (should be ~60% complete) | Continue gold set labeling, begin second pass | DAPT finishes; begin DeBERTa-v3-large baseline fine-tuning | Compute gold set inter-rater reliability (Kappa, Alpha) |
| **Tue** | Build train/holdout split logic (stratified by industry, year, specificity) | Continue gold set second-annotator pass | Dual annotation complete → extract disagreements (~17%) | Finish gold set, adjudicate disagreements with Person C | Baseline results in; begin ModernBERT-large (no DAPT) fine-tuning | Analyze gold set confusion patterns, recommend rubric tweaks |
| **Wed** | Final dataset assembly | Assist Person C with judge panel setup | Launch Stage 2 judge panel (Opus + GPT-5 + Gemini Pro) on disagreements | Run MMC pipeline on gold set, compare with human labels | ModernBERT-large done; begin SEC-ModernBERT-large fine-tuning | **Milestone: Gold set validated, Kappa computed** |
| **Thu** | Prepare HuggingFace dataset card | Begin active learning — cluster low-confidence cases | Judge panel results in; assemble final labeled dataset | Human-review ~500 low-confidence cases from active learning | SEC-ModernBERT-large done; begin NeoBERT experiment | Robustness split analysis (by industry, year, filing size) |
| **Fri** | **Milestone: Labeled dataset finalized (~50K paragraphs)** | **Milestone: Active learning pass complete** | QA final labels — spot-check 100 random samples | Assist Person E with evaluation | Begin ensemble experiment (SEC-ModernBERT + DeBERTa) | **Milestone: All baseline + ablation training complete** |
### Week 3: Evaluation + Presentation
| Day | Person A | Person B | Person C | Person D | Person E | Person F |
|-----|----------|----------|----------|----------|----------|----------|
| **Mon** | Publish dataset to HuggingFace | Run breach prediction validation (PRC + VCDB cross-ref) | Write labeling methodology section | Calibration plots for all models | Final ensemble tuning; publish model weights to HuggingFace | Compile all metrics into evaluation tables |
| **Tue** | Write data acquisition section | Run known-groups validity (industry, size effects) | Write GenAI labeling section | Boilerplate index validation (cosine similarity) | Write model strategy section | Draft full results section |
| **Wed** | Code cleanup, README for extraction pipeline | Market reaction analysis if feasible (optional) | Review/edit all written sections | Create figures: confusion matrices, calibration plots | Review/edit model section | Assemble presentation slides |
| **Thu** | **Full team: review presentation, rehearse, polish** | | | | | |
| **Fri** | **Presentation day** | | | | | |
### Critical Path & Dependencies
```
Week 1:
Data extraction (A,B) ──────────────────┐
Rubric design (C,D) ───→ Pilot test ───→ Rubric v2 ──→ GenAI labeling launch (Fri)
DAPT pre-training (E) ──────────────────────────────────→ (continues into Week 2)
Eval framework (F) ─────────────────────────────────────→ (ready for Week 2)
Week 2:
GenAI labeling (C) ───→ Judge panel ───→ Active learning ───→ Final labels (Fri)
Gold set (D + B) ──────────────────────→ Validated (Wed)
Fine-tuning experiments (E) ───→ Baseline → ModernBERT → SEC-ModernBERT → NeoBERT → Ensemble
Metrics (F) ───────────────────→ Robustness splits
Week 3:
Validation tests (B,D,F) ───→ Breach prediction, known-groups, boilerplate index
Writing (all) ──────────────→ Sections → Review → Presentation
Release (A,E) ──────────────→ HuggingFace dataset + model weights
```
---
## 9. Budget
| Item | Cost |
|------|------|
| GenAI labeling — Stage 1 dual annotation (50K × 2 models, batch) | ~$115 |
| GenAI labeling — Stage 2 judge panel (~8.5K × 3 models, batch) | ~$55 |
| Prompt caching savings | -$30 to -$40 |
| SEC EDGAR data | $0 (public domain) |
| Breach databases (PRC open data, VCDB, CISA KEV) | $0 |
| Compute (RTX 3090, already owned) | $0 |
| **Total** | **~$130-170** |
For comparison, human annotation at $0.50/label would cost $25,000+ for single-annotated, $75,000+ for triple-annotated.
---
## 10. Reference Links
### SEC Rule & Guidance
- [SEC Final Rule 33-11216 (PDF)](https://www.sec.gov/files/rules/final/2023/33-11216.pdf)
- [SEC Fact Sheet](https://www.sec.gov/files/33-11216-fact-sheet.pdf)
- [SEC Small Business Compliance Guide](https://www.sec.gov/resources-small-businesses/small-business-compliance-guides/cybersecurity-risk-management-strategy-governance-incident-disclosure)
- [CYD iXBRL Taxonomy Guide (PDF)](https://xbrl.sec.gov/cyd/2024/cyd-taxonomy-guide-2024-09-16.pdf)
### Law Firm Surveys & Analysis
- [Gibson Dunn S&P 100 Survey (Harvard Law Forum)](https://corpgov.law.harvard.edu/2025/01/09/cybersecurity-disclosure-overview-a-survey-of-form-10-k-cybersecurity-disclosures-by-sp-100-companies/)
- [PwC First Wave of 10-K Cyber Disclosures](https://www.pwc.com/us/en/services/consulting/cybersecurity-risk-regulatory/sec-final-cybersecurity-disclosure-rules/sec-10-k-cyber-disclosures.html)
- [Debevoise 8-K Lessons Learned](https://www.debevoisedatablog.com/2024/03/06/cybersecurity-form-8-k-tracker/)
- [Greenberg Traurig 2025 Trends Update](https://www.gtlaw.com/en/insights/2025/2/sec-cybersecurity-disclosure-trends-2025-update-on-corporate-reporting-practices)
- [Known Trends: First Year of 8-K Filings](https://www.knowntrends.com/2025/02/snapshot-the-first-year-of-cybersecurity-incident-filings-on-form-8-k-since-adoption-of-new-rules/)
- [NYU: Lessons Learned from 8-K Reporting](https://wp.nyu.edu/compliance_enforcement/2025/03/25/lessons-learned-one-year-of-form-8-k-material-cybersecurity-incident-reporting/)
### Data Extraction Tools
- [edgar-crawler (GitHub)](https://github.com/lefterisloukas/edgar-crawler)
- [edgartools (GitHub)](https://github.com/dgunning/edgartools)
- [sec-edgar-downloader (PyPI)](https://pypi.org/project/sec-edgar-downloader/)
- [sec-8k-item105 (GitHub)](https://github.com/JMousqueton/sec-8k-item105)
- [SECurityTr8Ker (GitHub)](https://github.com/pancak3lullz/SECurityTr8Ker)
- [SEC EDGAR APIs](https://www.sec.gov/search-filings/edgar-application-programming-interfaces)
- [SEC EDGAR Full-Text Search](https://efts.sec.gov/LATEST/search-index)
### Datasets
- [PleIAs/SEC — 373K 10-K texts (HuggingFace, CC0)](https://huggingface.co/datasets/PleIAs/SEC)
- [EDGAR-CORPUS — 220K filings, sections parsed (HuggingFace, Apache 2.0)](https://huggingface.co/datasets/eloukas/edgar-corpus)
- [Board Cybersecurity 23-Feature Analysis](https://www.board-cybersecurity.com/research/insights/risk-frameworks-security-standards-in-10k-item-1c-cybersecurity-disclosures-through-2024-06-30/)
- [Board Cybersecurity Incident Tracker](https://www.board-cybersecurity.com/incidents/tracker)
- [PRC Mendeley Breach Dataset (with tickers)](http://dx.doi.org/10.17632/w33nhh3282.1)
- [VCDB (GitHub)](https://github.com/vz-risk/VCDB)
- [CISA KEV Catalog](https://www.cisa.gov/known-exploited-vulnerabilities-catalog)
- [zeroshot/cybersecurity-corpus (HuggingFace)](https://huggingface.co/datasets/zeroshot/cybersecurity-corpus)
### Models
- [ModernBERT-large (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-large)
- [ModernBERT-base (HuggingFace, Apache 2.0)](https://huggingface.co/answerdotai/ModernBERT-base)
- [NeoBERT (HuggingFace, MIT)](https://huggingface.co/chandar-lab/NeoBERT)
- [DeBERTa-v3-large (HuggingFace, MIT)](https://huggingface.co/microsoft/deberta-v3-large)
- [SEC-BERT (HuggingFace)](https://huggingface.co/nlpaueb/sec-bert-base)
- [ProsusAI FinBERT (HuggingFace)](https://huggingface.co/ProsusAI/finbert)
- [EvasionBench Eva-4B-V2 (HuggingFace)](https://huggingface.co/FutureMa/Eva-4B-V2)
### Key Papers
- Ringel (2023), "Creating Synthetic Experts with Generative AI" — [SSRN:4542949](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4542949)
- Ludwig et al. (2026), "Extracting Consumer Insight from Text" — [arXiv:2602.15312](https://arxiv.org/abs/2602.15312)
- Ma et al. (2026), "EvasionBench" — [arXiv:2601.09142](https://arxiv.org/abs/2601.09142)
- Florackis et al. (2023), "Cybersecurity Risk" — [SSRN:3725130](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3725130)
- Gururangan et al. (2020), "Don't Stop Pretraining" — [arXiv:2004.10964](https://arxiv.org/abs/2004.10964)
- ModernBERT paper — [arXiv:2412.13663](https://arxiv.org/abs/2412.13663)
- NeoBERT paper — [arXiv:2502.19587](https://arxiv.org/abs/2502.19587)
- ModernBERT vs DeBERTa-v3 comparison — [arXiv:2504.08716](https://arxiv.org/abs/2504.08716)
- Patent domain ModernBERT DAPT — [arXiv:2509.14926](https://arxiv.org/abs/2509.14926)
- SEC filing scaling laws for continued pre-training — [arXiv:2512.12384](https://arxiv.org/abs/2512.12384)
- Gunel et al. (2020), Supervised Contrastive Learning for fine-tuning — [OpenReview](https://openreview.net/forum?id=cu7IUiOhujH)
- Phil Schmid, "Fine-tune classifier with ModernBERT in 2025" — [philschmid.de](https://www.philschmid.de/fine-tune-modern-bert-in-2025)
- Berkman et al. (2018), Cybersecurity disclosure quality scoring
- Li, No, and Boritz (2023), BERT-based classification of cybersecurity disclosures
- Scalable 10-K Analysis with LLMs — [arXiv:2409.17581](https://arxiv.org/abs/2409.17581)
- SecureBERT — [arXiv:2204.02685](https://arxiv.org/abs/2204.02685)
- Gilardi et al. (2023), "ChatGPT Outperforms Crowd-Workers" (PNAS) — [arXiv:2303.15056](https://arxiv.org/abs/2303.15056)
- Pangakis et al. (2023), "Automated Annotation Requires Validation" — [arXiv:2306.00176](https://arxiv.org/abs/2306.00176)
### Methodological Playbook
- [Ringel 2026 Capstone Pipeline Example (ZIP)](http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip)
- [Class 21 Exemplary Presentation (PDF)](http://www.ringel.ai/UNC/2026/BUSI488/Class21/Ringel_488-2026_Class21.pdf)

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -0,0 +1,71 @@
# CAPSTONE: Construct of Interest and Data Sign-off by Dr. D.
**Due:** April 7 by 11:59pm | **Points:** 0 | **Submission:** File upload
---
## Overview
Your capstone project requires you to classify a construct of interest in data at scale.
You must get approval for your construct of interest and data from your instructor.
This is an ungraded assignment. However, it is the prerequisite to your capstone project.
## Goal
Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale.
### Your Construct Must Be:
- **Business-relevant** — addresses a real business decision
- **Theoretically grounded** — anchored in established literature
- **Well documented** — clearly defined in academic or industry sources
- **Observable in text** — detectable in your data source
- **Definable with clear rules** — specific enough for reliable labeling
- **Complex & nuanced** — more than just sentiment (not positive/negative)
You must pick one of the seven provided constructs of interest from here: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
## Define Your Construct Precisely
Turn the concept into labels humans can apply consistently.
### You Must Create:
- **Label set** (classes/categories)
- **Clear definitions and decision rules** for each label
- **Borderline cases** — guidance for unclear examples
- **None/Other policy** — if applicable (multi-class yes, multi-label no)
- **2-3 example texts** per label (your own examples)
- **Decision:** multi-class (one label per item) vs multi-label (multiple labels can apply)
### Consider Your Data Source
Before finalizing, ask yourself:
- Do these data, when classified, inform and improve a business decision?
- Can the construct of interest (all its labels/classes) be sufficiently found in these data?
- Are these data abundantly available and do they need to be analyzed frequently and/or at scale to justify building a vertical AI?
## Important Guardrails
- **Choose a data source** that fits your construct and is realistically useful to a firm
- **Use public data or properly de-identified data only** — no sensitive internal company data
- **Pilot test first:** Before committing, do a quick manual pilot on 100200 texts in the developer platform playground or ChatGPT to confirm your construct appears in the source and that your labels are workable
## Deliverable for This Assignment
**Two-page maximum, double-spaced documentation** containing:
1. **Definition** of your construct of interest and its labels/classes
2. **Sources & Citations** that support your construct of interest (and its classes/labels), demonstrating that it is:
- Theoretically founded
- Well-established in literature
- Meaningful to decision makers
3. **Data Description** explaining:
- What data you will identify it in
- How you will acquire these data
- Why identifying your construct at scale/frequently in these data is valuable (justifies the need for a vertical AI)