17 KiB
Capstone: Build a Business-Grade Text Classifier
Due: April 23 by 12pm | Points: 35
Team Assignment: Build a Business-Grade Text Classifier
Team size: 5–6 students: at least half must be en rolled in COMP488 or BUSI488.
One-sentence summary: Your team will build an end-to-end system that turns raw text (reviews, filings, reports, speeches, etc.) into reliable business labels—and you’ll compare two approaches: genAI labeling vs a fine-tuned specialist model.
Form Your Team
People > Groups > CAPSTONE : People
- Each team member must join an EXISTING project group for THEIR SECTION
- DO NOT create your own (additional) projects groups.
- Each group must satisfy the following:
- Have at least two students enrolled in COMP488
- Have at least two students enrolled in BUSI488
- You can have at most three students enrolled in COMP488 and/or BUSI488 in your team, conditional on:
- All other teams having at least two of each in their team (see above).
- If other teams have less than two COMP488 and BUSI488 students, you cannot join a team that already has this minimum requirement.
- ALL TEAM MEMBERS MUST BE IN THE SAME SECTION (1 vs. 2)
Why This Assignment Matters (Business Reality)
Businesses Are Drowning in Text
- Customer reviews, support chats, emails
- Social posts, news, press releases
- 10-K/10-Q filings, earnings calls, ESG reports
- Policies, regulatory communications, safety notices
- Research abstracts, patents, job postings
Text Classification Enables Business Value
Text classification is one of the most common, high-ROI uses of AI because it converts messy language into structured signals that businesses can:
- Count — How many "billing issues" this week?
- Trend — Are "delivery failures" rising?
- Segment — Which regions complain about what?
- Route — Send messages/tickets to the right team automatically
- Flag risk — Safety/adverse events, compliance issues, reputational threats
- Inform strategy — Competitor moves, pricing tactics, CX weaknesses
- Input into (predictive) models — As independent variables
Why Not Manual Reading?
Firms don't just "read it manually" because volume is too high, humans are slow and inconsistent across people/time, and decisions often need speed (minutes/hours, not weeks). Leadership needs dashboards and measurable KPIs.
This will feel like a stretch—and that's the point. You will be supported by structure, templates, and teamwork.
What You Will Produce (End Result)
A working classifier for a well-documented, established, theoretically grounded construct (your choice), plus a business-style memo explaining:
- What the construct of interest is that you selected from the seven options: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
- How it was established and where it is theoretically anchored and motivated
- Why the business should care
- How well your classifier works
- What it costs (time + money)
- Whether it's reliable and reproducible
Choose a Construct of Interest: Meaningful and Labelable
Your construct should be:
- Business-relevant — addresses a real business decision
- Theoretically grounded — anchored in established literature
- Well documented — clearly defined in academic or industry sources
- Observable in text — detectable in your data source
- Definable with clear rules — specific enough for reliable labeling
- Complex & nuanced — more than just sentiment (not positive/negative)
You must pick one of the seven provided constructs of interest from here: https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
Potential Public Data Sources
Below are places to obtain public text data for your classifier (downloads and/or official APIs). For each of the seven constructs of interest, there are already some data sources suggested. Below are additional sources to consider. Use sources with clear educational/research access terms whenever possible. Or use your own source and data (but observe guardrails & ethics as outlined below).
Important: Choose a source that fits your construct and is realistically useful to a firm.
Customer Reviews & Product Voice
Amazon Reviews (research datasets)
- McAuley Lab Amazon datasets
- Stanford SNAP Amazon dataset page
- Hugging Face: Amazon Reviews 2023
Yelp Open Dataset
- Kaggle: Yelp Open Dataset
Corporate Filings & Investor Communications
SEC EDGAR (10-K / 10-Q / 8-K etc.) — Official APIs
- EDGAR Application Programming Interfaces
- SEC Developer Resources
- SEC Disclosure Data API announcement
Regulation, Enforcement, Safety & Compliance
FDA Warning Letters
- FDA: Warning Letters
- Data.gov: Warning Letters dataset
U.S. Consumer Product Safety Commission (CPSC) Recalls — API
- CPSC Recalls API information
Health & Public Policy (Public)
ClinicalTrials.gov (Modernized API)
- ClinicalTrials.gov Data API
PubMed / NCBI (E-utilities API)
- NCBI E-utilities documentation
Research & Innovation
arXiv (bulk data)
- arXiv bulk data help
- arXiv bulk data on AWS S3
Patents (USPTO + PatentsView)
- USPTO Open Data Portal: bulk data
- PatentsView: bulk downloads
Government Documents & Speeches
govinfo API + documentation (Congressional docs, Federal Register, etc.)
- govinfo API overview
- govinfo API docs
Congress.gov API
- Congress.gov API (GPO)
U.S. Presidential speeches (Miller Center data portal)
- Miller Center: Presidential Speech Data
News / Media Monitoring
GDELT (global news monitoring)
- GDELT data downloads
- GDELT 2.0 API documentation (blog)
Common Crawl News dataset (CC-NEWS)
- Common Crawl: News dataset announcement
Security / Technical Risk
NIST National Vulnerability Database (NVD) — CVE APIs & feeds
- NVD: Vulnerabilities API
- NVD: Data feeds
Earnings Calls / Transcripts (public option)
Academic/open dataset option
- STRUX dataset page
Tips & Rules
- Tip: Before committing, do a quick pilot on 100–200 texts to confirm your construct appears in the source and that your labels are workable
- Rule: Use public data or properly de-identified data only. No sensitive internal company data.
The Workflow: What You Must Do (Step-by-Step)
Step 1 — Construct Choice + Business Case
Goal: Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale. https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
Include:
- Who is the stakeholder? (CX lead, compliance, product manager, investor relations, etc.)
- What decision/action will classification enable?
- What happens if the classifier is wrong? (false positives vs false negatives)
- Why now? (volume, speed, competitive need)
Deliverable: 1–2 page concept brief
Step 2 — Define the Construct Precisely
Goal: Turn the concept into labels humans can apply consistently.
You must create:
- Label set (classes)
- Clear definitions and decision rules
- Borderline cases: what to do when unclear
- "None/other" policy IF APPLICABLE (multi-class yes, multi-label no)
- 2-3 example texts per label (your own)
Decide: multi-class (one label) vs multi-label (multiple labels can apply).
Deliverable: Labeling codebook (PDF)
Step 3 — Identify and Justify the Data Source
Goal: Show that your source actually contains your construct.
Include:
- Why this source fits the business purpose
- How a firm would use it regularly (weekly monitoring, quarterly reporting, etc.)
- Basic access plan (download/API/scrape—follow site rules)
Deliverable: Data source plan + sampling approach
Step 4 — Collect Your Dataset
Targets:
- 20,000 texts total for train/test
- 1,200 texts for a locked holdout set
Deliverable: Raw dataset file + collection notebook/script + documentation
Step 5 — Preprocess and Split
You must:
- Clean text (remove duplicates, empty, obvious spam)
- Create: train/test totaling 20,000; holdout = 1,200 (lock it in early—don't tune on it)
- Beware of label imbalance! May need to curate strategically!
- Report length stats and basic summaries
- Plan for imbalance (it's normal—just don't ignore it)
Deliverable: Preprocessing notebook + split files + basic descriptive stats
Step 6 — Human Label the Holdout
Requirements:
- The 1,200 holdout must be labeled by humans
- At least 3 people must label each example
- Either independent labels + agreement report (Krippendorff's alpha recommended), or discussion-based consensus + documented process
Deliverable: Holdout with 3+ labels per item (or consensus + notes) + reliability report
Step 7 — Benchmark GenAI Labeling (≥6 Models)
Goal: Treat genAI as alternative "labelers" and compare.
Requirements:
- Run at least 6 models from at least 3 different suppliers (e.g., OpenAI, Anthropic, Meta, xAI, Google, FireworksAI, Deepseek, Moonshot)
- Fix prompts to make them comparable across models
- Track runtime, estimated cost, and reproducibility strategy
Metrics to report:
- Krippendorff's alpha (vs human labels)
- F1 (macro and per-class if possible)
- AUC (where applicable)
- MCC
Deliverable: Benchmark notebook + results table + cost/time log
Step 8 — Select GenAI Labeling Strategy and Label Train/Test
Choose the best single model or a combination (e.g., majority vote). Then label your 20,000 train/test set.
Deliverable: Labeled train/test dataset + labeling script + total cost/time summary
Step 9 — Fine-Tune a Specialist Model
Goal: Create a smaller, task-specific classifier that can match/exceed genAI labeling quality.
Minimum requirements:
- Fine-tune at least one pretrained model (RoBERTa or similar)
- Try at least four configurations (model choice or hyperparameters)
- Evaluate on holdout (only at the end)
Deliverable: Training notebook + saved model + evaluation on holdout
Step 10 — Final Comparison: GenAI vs Specialist
Your final analysis must answer:
- Which is better on holdout and by how much?
- Which is cheaper per 1,000,000 texts?
- Which is faster?
- Which is more consistent and reproducible?
- What would you recommend a firm deploy, and why?
Deliverable: Final comparison section + recommendation
What To Submit (Deliverables Checklist)
A) Executive Memo (Executive Style, Max 5 Pages)
- Construct + why it matters + theoretically grounded and well documented
- Data source + governance/ethics
- Label schema overview
- Results summary: best genAI vs best specialist
- Cost/time/reproducibility comparison
- Recommendation for a real firm
B) Technical Appendix (Slides or PDF)
- Pipeline diagram (data → labels → model → evaluation)
- Label codebook (or link/appendix)
- Benchmark table (≥6 genAI models)
- Fine-tuning experiments + results
- Error analysis: where does it fail and why?
C) Code + Artifacts (Datasets)
- Colab notebooks (reproducible)
- Datasets:
- Holdout with human labels
- Train/test with genAI labels
- Any other data used
- All labels of all models for each run + majority labels
- Saved fine-tuned model + inference script (provide link to webspace/Google Drive/Dropbox; do not upload to Canvas!)
- Cost/time log
D) IGNITE Talk in Class
- 20 PPTX Slides, 15 sec per slide (automatic transitions), 5 min total
- Submit your PPTX slides (not PDF, not Google Slides, not other format!) set to auto transitions after 15 sec on Canvas
- I will have all slides ready for you to go in class
- Key messages and insights only
- Every team member must present at least 2 slides
- There will be a 3 min Q&A after each IGNITE talk
Learn more:
- https://en.wikipedia.org/wiki/Ignite_(event)
- https://www.ignitetalks.io/
- https://robbiesenbach.com/deliver-successful-ignite-talk/
- Start simple, then improve: Your first codebook will be imperfect. Iterate.
- Design labels for reliability: If humans disagree a lot, the model will struggle. Fix definitions before scaling.
- Lock the holdout early: It counts 35% of your grade. Don't change direction repeatedly—fix it early and make sure it's representative for what you ultimately want to achieve from a business perspective.
- Do a small pilot before spending money: Test prompts and schema on 100–200 examples first.
- Don't hide class imbalance: Imbalance is normal. But it can damage evaluation and training. You must solve this! Use macro F1 and MCC, and document prevalence.
Team Structure (Recommended Roles)
- Project lead: Scope, project plan, milestones, quality control
- Data lead: Collection, cleaning, dataset documentation
- Labeling lead: Codebook, human labeling workflow, reliability stats
- GenAI lead: API scripts, benchmarking, cost/time tracking
- Modeling lead: Fine-tuning, hyperparameters, reproducibility, evaluation
- Delivery lead: Assemble all outputs, organize write-ups, final quality control
Grading Rubric (100 Points)
| Criterion | Points |
|---|---|
| Business framing & construct clarity | 20 |
| Data pipeline quality + documentation | 15 |
| Human labeling process + reliability | 15 |
| GenAI benchmarking rigor | 20 |
| Fine-tuning rigor + evaluation discipline | 20 |
| Final comparison + recommendation quality | 10 |
Minimum Requirements Per Letter Grade:
C (- to +)
- Fine-tuned model with F1 score > 0.80
- Performance comparison genAI vs. fine-tuned model
- Labeled datasets
- Documentation
- Python notebook(s) to replicate pre-processing, training, and evaluation
B (- to +)
All of the above plus at least three of the following:
- Cost, time, reproducibility analysis
- Comparison of 6 or more models from at least 3 different suppliers
- Contemporary data that you collected (not an off-the-shelf dataset)
- Compelling use-case for your classifier with complete business case
A (- to A)
All of the above plus at least three of the following:
- Error-analysis (corner cases, rare or complex texts)
- Mitigation strategy and implementation to overcome identified model weaknesses
- Additional baselines (e.g., alternative classification approaches like dictionaries, topic models)
- Comparison to amateur labels
Guardrails (Ethics & Compliance)
- Public data or approved de-identified data only
- Remove or avoid sensitive personal information
- Document limitations and potential bias
- If your construct is sensitive (health, safety, harassment), include a brief risk statement and mitigation steps
Estimated Effort
| Task | Hours |
|---|---|
| Construct + codebook v1 + data source plan | 4 |
| Data collection + preprocessing + splits | 4 |
| 1.2K human labeling + reliability + codebook v2 | 8 |
| GenAI benchmarking (≥6 models) + choose labeling strategy | 3 |
| Label 20k + fine-tune specialist (2+ configs) | 2 |
| Final evaluation + memo + presentation | 3 |
Total: ~24 hours per student
Best Work Featured in Vertical AI Paper
I will select (and improve) the best 1–2 classifiers to be featured in my Vertical AI Paper where each team member will be acknowledged for their application of the synthetic expert/specialist approach. I will ask members whether they want to be named or not (choosing to remain anonymous has not impacted your grade).
Getting Started
To help you with your Capstone, I wrote a full pipeline in a Python notebook that does all the key steps you need for your Capstone Project by example of classifying 10K sentences into business functions. This includes querying genAI via API at scale, creating holdout and training datasets, fine-tuning a pretrained LLM, and evaluating the performance of genAI and your fine-tuned (vertical AI) model.
What it does not do is give you a construct of interest, collect your data, clean and preprocess your data, or draw conclusions and write reports for you.
Python notebook: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb
Zip file with outputs: http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip (excludes the actual trained vertical AI because it is 1.5GB)
All subfolders and datasets are included. This is a great blueprint for what data you need to deliver with your capstone on a shared drive (provide link to me) or uploaded if sufficiently small (less than 20MB).
The contents of the zip file also help you see what the expected output is (by example of a multi-label classification problem). You will need to adapt this code to your problem. Use genAI (e.g., Claude Opus 4.6) for this. The pipeline gives you a solid base to work off.