joey/SEC-cyBERT

Fork 0

Joey Eamigh 96246d0197

docs & reference files

2026-03-31 16:27:47 -04:00

17 KiB

Raw Blame History

Capstone: Build a Business-Grade Text Classifier

Due: April 23 by 12pm | Points: 35

Team Assignment: Build a Business-Grade Text Classifier

Team size: 5–6 students: at least half must be en rolled in COMP488 or BUSI488.

One-sentence summary: Your team will build an end-to-end system that turns raw text (reviews, filings, reports, speeches, etc.) into reliable business labels—and you’ll compare two approaches: genAI labeling vs a fine-tuned specialist model.

Form Your Team

People > Groups > CAPSTONE : People

Each team member must join an EXISTING project group for THEIR SECTION
DO NOT create your own (additional) projects groups.
Each group must satisfy the following:
- Have at least two students enrolled in COMP488
- Have at least two students enrolled in BUSI488
You can have at most three students enrolled in COMP488 and/or BUSI488 in your team, conditional on:
- All other teams having at least two of each in their team (see above).
- If other teams have less than two COMP488 and BUSI488 students, you cannot join a team that already has this minimum requirement.
ALL TEAM MEMBERS MUST BE IN THE SAME SECTION (1 vs. 2)

Why This Assignment Matters (Business Reality)

Businesses Are Drowning in Text

Customer reviews, support chats, emails
Social posts, news, press releases
10-K/10-Q filings, earnings calls, ESG reports
Policies, regulatory communications, safety notices
Research abstracts, patents, job postings

Text Classification Enables Business Value

Text classification is one of the most common, high-ROI uses of AI because it converts messy language into structured signals that businesses can:

Count — How many "billing issues" this week?
Trend — Are "delivery failures" rising?
Segment — Which regions complain about what?
Route — Send messages/tickets to the right team automatically
Flag risk — Safety/adverse events, compliance issues, reputational threats
Inform strategy — Competitor moves, pricing tactics, CX weaknesses
Input into (predictive) models — As independent variables

Why Not Manual Reading?

Firms don't just "read it manually" because volume is too high, humans are slow and inconsistent across people/time, and decisions often need speed (minutes/hours, not weeks). Leadership needs dashboards and measurable KPIs.

This will feel like a stretch—and that's the point. You will be supported by structure, templates, and teamwork.

What You Will Produce (End Result)

A working classifier for a well-documented, established, theoretically grounded construct (your choice), plus a business-style memo explaining:

What the construct of interest is that you selected from the seven options: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
How it was established and where it is theoretically anchored and motivated
Why the business should care
How well your classifier works
What it costs (time + money)
Whether it's reliable and reproducible

Choose a Construct of Interest: Meaningful and Labelable

Your construct should be:

Business-relevant — addresses a real business decision
Theoretically grounded — anchored in established literature
Well documented — clearly defined in academic or industry sources
Observable in text — detectable in your data source
Definable with clear rules — specific enough for reliable labeling
Complex & nuanced — more than just sentiment (not positive/negative)

You must pick one of the seven provided constructs of interest from here: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

Potential Public Data Sources

Below are places to obtain public text data for your classifier (downloads and/or official APIs). For each of the seven constructs of interest, there are already some data sources suggested. Below are additional sources to consider. Use sources with clear educational/research access terms whenever possible. Or use your own source and data (but observe guardrails & ethics as outlined below).

Important: Choose a source that fits your construct and is realistically useful to a firm.

Customer Reviews & Product Voice

Amazon Reviews (research datasets)

McAuley Lab Amazon datasets
Stanford SNAP Amazon dataset page
Hugging Face: Amazon Reviews 2023

Yelp Open Dataset

Kaggle: Yelp Open Dataset

Corporate Filings & Investor Communications

SEC EDGAR (10-K / 10-Q / 8-K etc.) — Official APIs

EDGAR Application Programming Interfaces
SEC Developer Resources
SEC Disclosure Data API announcement

Regulation, Enforcement, Safety & Compliance

FDA Warning Letters

FDA: Warning Letters
Data.gov: Warning Letters dataset

U.S. Consumer Product Safety Commission (CPSC) Recalls — API

CPSC Recalls API information

Health & Public Policy (Public)

ClinicalTrials.gov (Modernized API)

ClinicalTrials.gov Data API

PubMed / NCBI (E-utilities API)

NCBI E-utilities documentation

Research & Innovation

arXiv (bulk data)

arXiv bulk data help
arXiv bulk data on AWS S3

Patents (USPTO + PatentsView)

USPTO Open Data Portal: bulk data
PatentsView: bulk downloads

Government Documents & Speeches

govinfo API + documentation (Congressional docs, Federal Register, etc.)

govinfo API overview
govinfo API docs

Congress.gov API

Congress.gov API (GPO)

U.S. Presidential speeches (Miller Center data portal)

Miller Center: Presidential Speech Data

News / Media Monitoring

GDELT (global news monitoring)

GDELT data downloads
GDELT 2.0 API documentation (blog)

Common Crawl News dataset (CC-NEWS)

Common Crawl: News dataset announcement

Security / Technical Risk

NIST National Vulnerability Database (NVD) — CVE APIs & feeds

NVD: Vulnerabilities API
NVD: Data feeds

Earnings Calls / Transcripts (public option)

Academic/open dataset option

STRUX dataset page

Tips & Rules

Tip: Before committing, do a quick pilot on 100–200 texts to confirm your construct appears in the source and that your labels are workable
Rule: Use public data or properly de-identified data only. No sensitive internal company data.

The Workflow: What You Must Do (Step-by-Step)

Step 1 — Construct Choice + Business Case

Goal: Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale. https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

Include:

Who is the stakeholder? (CX lead, compliance, product manager, investor relations, etc.)
What decision/action will classification enable?
What happens if the classifier is wrong? (false positives vs false negatives)
Why now? (volume, speed, competitive need)

Deliverable: 1–2 page concept brief

Step 2 — Define the Construct Precisely

Goal: Turn the concept into labels humans can apply consistently.

You must create:

Label set (classes)
Clear definitions and decision rules
Borderline cases: what to do when unclear
"None/other" policy IF APPLICABLE (multi-class yes, multi-label no)
2-3 example texts per label (your own)

Decide: multi-class (one label) vs multi-label (multiple labels can apply).

Deliverable: Labeling codebook (PDF)

Step 3 — Identify and Justify the Data Source

Goal: Show that your source actually contains your construct.

Include:

Why this source fits the business purpose
How a firm would use it regularly (weekly monitoring, quarterly reporting, etc.)
Basic access plan (download/API/scrape—follow site rules)

Deliverable: Data source plan + sampling approach

Step 4 — Collect Your Dataset

Targets:

20,000 texts total for train/test
1,200 texts for a locked holdout set

Deliverable: Raw dataset file + collection notebook/script + documentation

Step 5 — Preprocess and Split

You must:

Clean text (remove duplicates, empty, obvious spam)
Create: train/test totaling 20,000; holdout = 1,200 (lock it in early—don't tune on it)
- Beware of label imbalance! May need to curate strategically!
Report length stats and basic summaries
Plan for imbalance (it's normal—just don't ignore it)

Deliverable: Preprocessing notebook + split files + basic descriptive stats

Step 6 — Human Label the Holdout

Requirements:

The 1,200 holdout must be labeled by humans
At least 3 people must label each example
Either independent labels + agreement report (Krippendorff's alpha recommended), or discussion-based consensus + documented process

Deliverable: Holdout with 3+ labels per item (or consensus + notes) + reliability report

Step 7 — Benchmark GenAI Labeling (≥6 Models)

Goal: Treat genAI as alternative "labelers" and compare.

Requirements:

Run at least 6 models from at least 3 different suppliers (e.g., OpenAI, Anthropic, Meta, xAI, Google, FireworksAI, Deepseek, Moonshot)
Fix prompts to make them comparable across models
Track runtime, estimated cost, and reproducibility strategy

Metrics to report:

Krippendorff's alpha (vs human labels)
F1 (macro and per-class if possible)
AUC (where applicable)
MCC

Deliverable: Benchmark notebook + results table + cost/time log

Step 8 — Select GenAI Labeling Strategy and Label Train/Test

Choose the best single model or a combination (e.g., majority vote). Then label your 20,000 train/test set.

Deliverable: Labeled train/test dataset + labeling script + total cost/time summary

Step 9 — Fine-Tune a Specialist Model

Goal: Create a smaller, task-specific classifier that can match/exceed genAI labeling quality.

Minimum requirements:

Fine-tune at least one pretrained model (RoBERTa or similar)
Try at least four configurations (model choice or hyperparameters)
Evaluate on holdout (only at the end)

Deliverable: Training notebook + saved model + evaluation on holdout

Step 10 — Final Comparison: GenAI vs Specialist

Your final analysis must answer:

Which is better on holdout and by how much?
Which is cheaper per 1,000,000 texts?
Which is faster?
Which is more consistent and reproducible?
What would you recommend a firm deploy, and why?

Deliverable: Final comparison section + recommendation

What To Submit (Deliverables Checklist)

A) Executive Memo (Executive Style, Max 5 Pages)

Construct + why it matters + theoretically grounded and well documented
Data source + governance/ethics
Label schema overview
Results summary: best genAI vs best specialist
Cost/time/reproducibility comparison
Recommendation for a real firm

B) Technical Appendix (Slides or PDF)

Pipeline diagram (data → labels → model → evaluation)
Label codebook (or link/appendix)
Benchmark table (≥6 genAI models)
Fine-tuning experiments + results
Error analysis: where does it fail and why?

C) Code + Artifacts (Datasets)

Colab notebooks (reproducible)
Datasets:
- Holdout with human labels
- Train/test with genAI labels
- Any other data used
- All labels of all models for each run + majority labels
Saved fine-tuned model + inference script (provide link to webspace/Google Drive/Dropbox; do not upload to Canvas!)
Cost/time log

D) IGNITE Talk in Class

20 PPTX Slides, 15 sec per slide (automatic transitions), 5 min total
Submit your PPTX slides (not PDF, not Google Slides, not other format!) set to auto transitions after 15 sec on Canvas
I will have all slides ready for you to go in class
Key messages and insights only
Every team member must present at least 2 slides
There will be a 3 min Q&A after each IGNITE talk

Learn more:

Start simple, then improve: Your first codebook will be imperfect. Iterate.
Design labels for reliability: If humans disagree a lot, the model will struggle. Fix definitions before scaling.
Lock the holdout early: It counts 35% of your grade. Don't change direction repeatedly—fix it early and make sure it's representative for what you ultimately want to achieve from a business perspective.
Do a small pilot before spending money: Test prompts and schema on 100–200 examples first.
Don't hide class imbalance: Imbalance is normal. But it can damage evaluation and training. You must solve this! Use macro F1 and MCC, and document prevalence.

Team Structure (Recommended Roles)

Project lead: Scope, project plan, milestones, quality control
Data lead: Collection, cleaning, dataset documentation
Labeling lead: Codebook, human labeling workflow, reliability stats
GenAI lead: API scripts, benchmarking, cost/time tracking
Modeling lead: Fine-tuning, hyperparameters, reproducibility, evaluation
Delivery lead: Assemble all outputs, organize write-ups, final quality control

Grading Rubric (100 Points)

Criterion	Points
Business framing & construct clarity	20
Data pipeline quality + documentation	15
Human labeling process + reliability	15
GenAI benchmarking rigor	20
Fine-tuning rigor + evaluation discipline	20
Final comparison + recommendation quality	10

Minimum Requirements Per Letter Grade:

C (- to +)

Fine-tuned model with F1 score > 0.80
Performance comparison genAI vs. fine-tuned model
Labeled datasets
Documentation
Python notebook(s) to replicate pre-processing, training, and evaluation

B (- to +)

All of the above plus at least three of the following:

Cost, time, reproducibility analysis
Comparison of 6 or more models from at least 3 different suppliers
Contemporary data that you collected (not an off-the-shelf dataset)
Compelling use-case for your classifier with complete business case

A (- to A)

All of the above plus at least three of the following:

Error-analysis (corner cases, rare or complex texts)
Mitigation strategy and implementation to overcome identified model weaknesses
Additional baselines (e.g., alternative classification approaches like dictionaries, topic models)
Comparison to amateur labels

Guardrails (Ethics & Compliance)

Public data or approved de-identified data only
Remove or avoid sensitive personal information
Document limitations and potential bias
If your construct is sensitive (health, safety, harassment), include a brief risk statement and mitigation steps

Estimated Effort

Task	Hours
Construct + codebook v1 + data source plan	4
Data collection + preprocessing + splits	4
1.2K human labeling + reliability + codebook v2	8
GenAI benchmarking (≥6 models) + choose labeling strategy	3
Label 20k + fine-tune specialist (2+ configs)	2
Final evaluation + memo + presentation	3

Total: ~24 hours per student

Best Work Featured in Vertical AI Paper

I will select (and improve) the best 1–2 classifiers to be featured in my Vertical AI Paper where each team member will be acknowledged for their application of the synthetic expert/specialist approach. I will ask members whether they want to be named or not (choosing to remain anonymous has not impacted your grade).

Getting Started

To help you with your Capstone, I wrote a full pipeline in a Python notebook that does all the key steps you need for your Capstone Project by example of classifying 10K sentences into business functions. This includes querying genAI via API at scale, creating holdout and training datasets, fine-tuning a pretrained LLM, and evaluating the performance of genAI and your fine-tuned (vertical AI) model.

What it does not do is give you a construct of interest, collect your data, clean and preprocess your data, or draw conclusions and write reports for you.

Python notebook: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb

Zip file with outputs: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip (excludes the actual trained vertical AI because it is 1.5GB)

All subfolders and datasets are included. This is a great blueprint for what data you need to deliver with your capstone on a shared drive (provide link to me) or uploaded if sufficiently small (less than 20MB).

The contents of the zip file also help you see what the expected output is (by example of a multi-label classification problem). You will need to adapt this code to your problem. Use genAI (e.g., Claude Opus 4.6) for this. The pipeline gives you a solid base to work off.

17 KiB Raw Blame History Unescape Escape

Capstone: Build a Business-Grade Text Classifier

Team Assignment: Build a Business-Grade Text Classifier

Form Your Team

Why This Assignment Matters (Business Reality)

Businesses Are Drowning in Text

Text Classification Enables Business Value

Why Not Manual Reading?

What You Will Produce (End Result)

Choose a Construct of Interest: Meaningful and Labelable

Potential Public Data Sources

Customer Reviews & Product Voice

Corporate Filings & Investor Communications

Regulation, Enforcement, Safety & Compliance

Health & Public Policy (Public)

Research & Innovation

Government Documents & Speeches

News / Media Monitoring

Security / Technical Risk

Earnings Calls / Transcripts (public option)

Tips & Rules

The Workflow: What You Must Do (Step-by-Step)

Step 1 — Construct Choice + Business Case

Step 2 — Define the Construct Precisely

Step 3 — Identify and Justify the Data Source

Step 4 — Collect Your Dataset

Step 5 — Preprocess and Split

Step 6 — Human Label the Holdout

Step 7 — Benchmark GenAI Labeling (≥6 Models)

Step 8 — Select GenAI Labeling Strategy and Label Train/Test

Step 9 — Fine-Tune a Specialist Model

Step 10 — Final Comparison: GenAI vs Specialist

What To Submit (Deliverables Checklist)

A) Executive Memo (Executive Style, Max 5 Pages)

B) Technical Appendix (Slides or PDF)

C) Code + Artifacts (Datasets)

D) IGNITE Talk in Class

Team Structure (Recommended Roles)

Grading Rubric (100 Points)

Minimum Requirements Per Letter Grade:

C (- to +)

B (- to +)

A (- to A)

Guardrails (Ethics & Compliance)

Estimated Effort

Best Work Featured in Vertical AI Paper

Getting Started

17 KiB

Raw Blame History