SEC-cyBERT/docs/reference/Capstone_assn_instructions.md
2026-03-31 16:27:47 -04:00

17 KiB
Raw Blame History

Capstone: Build a Business-Grade Text Classifier

Due: April 23 by 12pm | Points: 35


Team Assignment: Build a Business-Grade Text Classifier

Team size: 56 students: at least half must be en rolled in COMP488 or BUSI488.

One-sentence summary: Your team will build an end-to-end system that turns raw text (reviews, filings, reports, speeches, etc.) into reliable business labels—and youll compare two approaches: genAI labeling vs a fine-tuned specialist model.

Form Your Team

People > Groups > CAPSTONE : People

  • Each team member must join an EXISTING project group for THEIR SECTION
  • DO NOT create your own (additional) projects groups.
  • Each group must satisfy the following:
    • Have at least two students enrolled in COMP488
    • Have at least two students enrolled in BUSI488
  • You can have at most three students enrolled in COMP488 and/or BUSI488 in your team, conditional on:
    • All other teams having at least two of each in their team (see above).
    • If other teams have less than two COMP488 and BUSI488 students, you cannot join a team that already has this minimum requirement.
  • ALL TEAM MEMBERS MUST BE IN THE SAME SECTION (1 vs. 2)

Why This Assignment Matters (Business Reality)

Businesses Are Drowning in Text

  • Customer reviews, support chats, emails
  • Social posts, news, press releases
  • 10-K/10-Q filings, earnings calls, ESG reports
  • Policies, regulatory communications, safety notices
  • Research abstracts, patents, job postings

Text Classification Enables Business Value

Text classification is one of the most common, high-ROI uses of AI because it converts messy language into structured signals that businesses can:

  • Count — How many "billing issues" this week?
  • Trend — Are "delivery failures" rising?
  • Segment — Which regions complain about what?
  • Route — Send messages/tickets to the right team automatically
  • Flag risk — Safety/adverse events, compliance issues, reputational threats
  • Inform strategy — Competitor moves, pricing tactics, CX weaknesses
  • Input into (predictive) models — As independent variables

Why Not Manual Reading?

Firms don't just "read it manually" because volume is too high, humans are slow and inconsistent across people/time, and decisions often need speed (minutes/hours, not weeks). Leadership needs dashboards and measurable KPIs.

This will feel like a stretch—and that's the point. You will be supported by structure, templates, and teamwork.

What You Will Produce (End Result)

A working classifier for a well-documented, established, theoretically grounded construct (your choice), plus a business-style memo explaining:

Choose a Construct of Interest: Meaningful and Labelable

Your construct should be:

  • Business-relevant — addresses a real business decision
  • Theoretically grounded — anchored in established literature
  • Well documented — clearly defined in academic or industry sources
  • Observable in text — detectable in your data source
  • Definable with clear rules — specific enough for reliable labeling
  • Complex & nuanced — more than just sentiment (not positive/negative)

You must pick one of the seven provided constructs of interest from here: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

Potential Public Data Sources

Below are places to obtain public text data for your classifier (downloads and/or official APIs). For each of the seven constructs of interest, there are already some data sources suggested. Below are additional sources to consider. Use sources with clear educational/research access terms whenever possible. Or use your own source and data (but observe guardrails & ethics as outlined below).

Important: Choose a source that fits your construct and is realistically useful to a firm.

Customer Reviews & Product Voice

Amazon Reviews (research datasets)

  • McAuley Lab Amazon datasets
  • Stanford SNAP Amazon dataset page
  • Hugging Face: Amazon Reviews 2023

Yelp Open Dataset

  • Kaggle: Yelp Open Dataset

Corporate Filings & Investor Communications

SEC EDGAR (10-K / 10-Q / 8-K etc.) — Official APIs

  • EDGAR Application Programming Interfaces
  • SEC Developer Resources
  • SEC Disclosure Data API announcement

Regulation, Enforcement, Safety & Compliance

FDA Warning Letters

  • FDA: Warning Letters
  • Data.gov: Warning Letters dataset

U.S. Consumer Product Safety Commission (CPSC) Recalls — API

  • CPSC Recalls API information

Health & Public Policy (Public)

ClinicalTrials.gov (Modernized API)

  • ClinicalTrials.gov Data API

PubMed / NCBI (E-utilities API)

  • NCBI E-utilities documentation

Research & Innovation

arXiv (bulk data)

  • arXiv bulk data help
  • arXiv bulk data on AWS S3

Patents (USPTO + PatentsView)

  • USPTO Open Data Portal: bulk data
  • PatentsView: bulk downloads

Government Documents & Speeches

govinfo API + documentation (Congressional docs, Federal Register, etc.)

  • govinfo API overview
  • govinfo API docs

Congress.gov API

  • Congress.gov API (GPO)

U.S. Presidential speeches (Miller Center data portal)

  • Miller Center: Presidential Speech Data

News / Media Monitoring

GDELT (global news monitoring)

  • GDELT data downloads
  • GDELT 2.0 API documentation (blog)

Common Crawl News dataset (CC-NEWS)

  • Common Crawl: News dataset announcement

Security / Technical Risk

NIST National Vulnerability Database (NVD) — CVE APIs & feeds

  • NVD: Vulnerabilities API
  • NVD: Data feeds

Earnings Calls / Transcripts (public option)

Academic/open dataset option

  • STRUX dataset page

Tips & Rules

  • Tip: Before committing, do a quick pilot on 100200 texts to confirm your construct appears in the source and that your labels are workable
  • Rule: Use public data or properly de-identified data only. No sensitive internal company data.

The Workflow: What You Must Do (Step-by-Step)

Step 1 — Construct Choice + Business Case

Goal: Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale. https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

Include:

  • Who is the stakeholder? (CX lead, compliance, product manager, investor relations, etc.)
  • What decision/action will classification enable?
  • What happens if the classifier is wrong? (false positives vs false negatives)
  • Why now? (volume, speed, competitive need)

Deliverable: 12 page concept brief

Step 2 — Define the Construct Precisely

Goal: Turn the concept into labels humans can apply consistently.

You must create:

  • Label set (classes)
  • Clear definitions and decision rules
  • Borderline cases: what to do when unclear
  • "None/other" policy IF APPLICABLE (multi-class yes, multi-label no)
  • 2-3 example texts per label (your own)

Decide: multi-class (one label) vs multi-label (multiple labels can apply).

Deliverable: Labeling codebook (PDF)

Step 3 — Identify and Justify the Data Source

Goal: Show that your source actually contains your construct.

Include:

  • Why this source fits the business purpose
  • How a firm would use it regularly (weekly monitoring, quarterly reporting, etc.)
  • Basic access plan (download/API/scrape—follow site rules)

Deliverable: Data source plan + sampling approach

Step 4 — Collect Your Dataset

Targets:

  • 20,000 texts total for train/test
  • 1,200 texts for a locked holdout set

Deliverable: Raw dataset file + collection notebook/script + documentation

Step 5 — Preprocess and Split

You must:

  • Clean text (remove duplicates, empty, obvious spam)
  • Create: train/test totaling 20,000; holdout = 1,200 (lock it in early—don't tune on it)
    • Beware of label imbalance! May need to curate strategically!
  • Report length stats and basic summaries
  • Plan for imbalance (it's normal—just don't ignore it)

Deliverable: Preprocessing notebook + split files + basic descriptive stats

Step 6 — Human Label the Holdout

Requirements:

  • The 1,200 holdout must be labeled by humans
  • At least 3 people must label each example
  • Either independent labels + agreement report (Krippendorff's alpha recommended), or discussion-based consensus + documented process

Deliverable: Holdout with 3+ labels per item (or consensus + notes) + reliability report

Step 7 — Benchmark GenAI Labeling (≥6 Models)

Goal: Treat genAI as alternative "labelers" and compare.

Requirements:

  • Run at least 6 models from at least 3 different suppliers (e.g., OpenAI, Anthropic, Meta, xAI, Google, FireworksAI, Deepseek, Moonshot)
  • Fix prompts to make them comparable across models
  • Track runtime, estimated cost, and reproducibility strategy

Metrics to report:

  • Krippendorff's alpha (vs human labels)
  • F1 (macro and per-class if possible)
  • AUC (where applicable)
  • MCC

Deliverable: Benchmark notebook + results table + cost/time log

Step 8 — Select GenAI Labeling Strategy and Label Train/Test

Choose the best single model or a combination (e.g., majority vote). Then label your 20,000 train/test set.

Deliverable: Labeled train/test dataset + labeling script + total cost/time summary

Step 9 — Fine-Tune a Specialist Model

Goal: Create a smaller, task-specific classifier that can match/exceed genAI labeling quality.

Minimum requirements:

  • Fine-tune at least one pretrained model (RoBERTa or similar)
  • Try at least four configurations (model choice or hyperparameters)
  • Evaluate on holdout (only at the end)

Deliverable: Training notebook + saved model + evaluation on holdout

Step 10 — Final Comparison: GenAI vs Specialist

Your final analysis must answer:

  • Which is better on holdout and by how much?
  • Which is cheaper per 1,000,000 texts?
  • Which is faster?
  • Which is more consistent and reproducible?
  • What would you recommend a firm deploy, and why?

Deliverable: Final comparison section + recommendation


What To Submit (Deliverables Checklist)

A) Executive Memo (Executive Style, Max 5 Pages)

  • Construct + why it matters + theoretically grounded and well documented
  • Data source + governance/ethics
  • Label schema overview
  • Results summary: best genAI vs best specialist
  • Cost/time/reproducibility comparison
  • Recommendation for a real firm

B) Technical Appendix (Slides or PDF)

  • Pipeline diagram (data → labels → model → evaluation)
  • Label codebook (or link/appendix)
  • Benchmark table (≥6 genAI models)
  • Fine-tuning experiments + results
  • Error analysis: where does it fail and why?

C) Code + Artifacts (Datasets)

  • Colab notebooks (reproducible)
  • Datasets:
    • Holdout with human labels
    • Train/test with genAI labels
    • Any other data used
    • All labels of all models for each run + majority labels
  • Saved fine-tuned model + inference script (provide link to webspace/Google Drive/Dropbox; do not upload to Canvas!)
  • Cost/time log

D) IGNITE Talk in Class

  • 20 PPTX Slides, 15 sec per slide (automatic transitions), 5 min total
  • Submit your PPTX slides (not PDF, not Google Slides, not other format!) set to auto transitions after 15 sec on Canvas
  • I will have all slides ready for you to go in class
  • Key messages and insights only
  • Every team member must present at least 2 slides
  • There will be a 3 min Q&A after each IGNITE talk

Learn more:


  • Start simple, then improve: Your first codebook will be imperfect. Iterate.
  • Design labels for reliability: If humans disagree a lot, the model will struggle. Fix definitions before scaling.
  • Lock the holdout early: It counts 35% of your grade. Don't change direction repeatedly—fix it early and make sure it's representative for what you ultimately want to achieve from a business perspective.
  • Do a small pilot before spending money: Test prompts and schema on 100200 examples first.
  • Don't hide class imbalance: Imbalance is normal. But it can damage evaluation and training. You must solve this! Use macro F1 and MCC, and document prevalence.
  • Project lead: Scope, project plan, milestones, quality control
  • Data lead: Collection, cleaning, dataset documentation
  • Labeling lead: Codebook, human labeling workflow, reliability stats
  • GenAI lead: API scripts, benchmarking, cost/time tracking
  • Modeling lead: Fine-tuning, hyperparameters, reproducibility, evaluation
  • Delivery lead: Assemble all outputs, organize write-ups, final quality control

Grading Rubric (100 Points)

Criterion Points
Business framing & construct clarity 20
Data pipeline quality + documentation 15
Human labeling process + reliability 15
GenAI benchmarking rigor 20
Fine-tuning rigor + evaluation discipline 20
Final comparison + recommendation quality 10

Minimum Requirements Per Letter Grade:

C (- to +)

  • Fine-tuned model with F1 score > 0.80
  • Performance comparison genAI vs. fine-tuned model
  • Labeled datasets
  • Documentation
  • Python notebook(s) to replicate pre-processing, training, and evaluation

B (- to +)

All of the above plus at least three of the following:

  • Cost, time, reproducibility analysis
  • Comparison of 6 or more models from at least 3 different suppliers
  • Contemporary data that you collected (not an off-the-shelf dataset)
  • Compelling use-case for your classifier with complete business case

A (- to A)

All of the above plus at least three of the following:

  • Error-analysis (corner cases, rare or complex texts)
  • Mitigation strategy and implementation to overcome identified model weaknesses
  • Additional baselines (e.g., alternative classification approaches like dictionaries, topic models)
  • Comparison to amateur labels

Guardrails (Ethics & Compliance)

  • Public data or approved de-identified data only
  • Remove or avoid sensitive personal information
  • Document limitations and potential bias
  • If your construct is sensitive (health, safety, harassment), include a brief risk statement and mitigation steps

Estimated Effort

Task Hours
Construct + codebook v1 + data source plan 4
Data collection + preprocessing + splits 4
1.2K human labeling + reliability + codebook v2 8
GenAI benchmarking (≥6 models) + choose labeling strategy 3
Label 20k + fine-tune specialist (2+ configs) 2
Final evaluation + memo + presentation 3

Total: ~24 hours per student

I will select (and improve) the best 12 classifiers to be featured in my Vertical AI Paper where each team member will be acknowledged for their application of the synthetic expert/specialist approach. I will ask members whether they want to be named or not (choosing to remain anonymous has not impacted your grade).

Getting Started

To help you with your Capstone, I wrote a full pipeline in a Python notebook that does all the key steps you need for your Capstone Project by example of classifying 10K sentences into business functions. This includes querying genAI via API at scale, creating holdout and training datasets, fine-tuning a pretrained LLM, and evaluating the performance of genAI and your fine-tuned (vertical AI) model.

What it does not do is give you a construct of interest, collect your data, clean and preprocess your data, or draw conclusions and write reports for you.

Python notebook: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb

Zip file with outputs: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip (excludes the actual trained vertical AI because it is 1.5GB)

All subfolders and datasets are included. This is a great blueprint for what data you need to deliver with your capstone on a shared drive (provide link to me) or uploaded if sufficiently small (less than 20MB).

The contents of the zip file also help you see what the expected output is (by example of a multi-label classification problem). You will need to adapt this code to your problem. Use genAI (e.g., Claude Opus 4.6) for this. The pipeline gives you a solid base to work off.