SEC-cyBERT/docs/reference/Capstone_assn_instructions.md

# Capstone: Build a Business-Grade Text Classifier

**Due:** April 23 by 12pm | **Points:** 35

---

## Team Assignment: Build a Business-Grade Text Classifier

**Team size:** 5–6 students: at least half must be en rolled in COMP488 or BUSI488.

One-sentence summary: Your team will build an end-to-end system that turns raw text (reviews, filings, reports, speeches, etc.) into reliable business labels—and you’ll compare two approaches: genAI labeling vs a fine-tuned specialist model.

## Form Your Team

**People > Groups > CAPSTONE : People**

- Each team member must join an EXISTING project group for THEIR SECTION
- DO NOT create your own (additional) projects groups.
- Each group must satisfy the following:
  - Have at least two students enrolled in COMP488
  - Have at least two students enrolled in BUSI488
- You can have at most three students enrolled in COMP488 and/or BUSI488 in your team, conditional on:
  - All other teams having at least two of each in their team (see above).
  - If other teams have less than two COMP488 and BUSI488 students, you cannot join a team that already has this minimum requirement.
- **ALL TEAM MEMBERS MUST BE IN THE SAME SECTION (1 vs. 2)**

## Why This Assignment Matters (Business Reality)

### Businesses Are Drowning in Text

- Customer reviews, support chats, emails
- Social posts, news, press releases
- 10-K/10-Q filings, earnings calls, ESG reports
- Policies, regulatory communications, safety notices
- Research abstracts, patents, job postings

### Text Classification Enables Business Value

Text classification is one of the most common, high-ROI uses of AI because it converts messy language into structured signals that businesses can:

- Count — How many "billing issues" this week?
- Trend — Are "delivery failures" rising?
- Segment — Which regions complain about what?
- Route — Send messages/tickets to the right team automatically
- Flag risk — Safety/adverse events, compliance issues, reputational threats
- Inform strategy — Competitor moves, pricing tactics, CX weaknesses
- Input into (predictive) models — As independent variables

### Why Not Manual Reading?

Firms don't just "read it manually" because volume is too high, humans are slow and inconsistent across people/time, and decisions often need speed (minutes/hours, not weeks). Leadership needs dashboards and measurable KPIs.

This will feel like a stretch—and that's the point. You will be supported by structure, templates, and teamwork.

## What You Will Produce (End Result)

A working classifier for a well-documented, established, theoretically grounded construct (your choice), plus a business-style memo explaining:

- What the construct of interest is that you selected from the seven options:
  https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf
- How it was established and where it is theoretically anchored and motivated
- Why the business should care
- How well your classifier works
- What it costs (time + money)
- Whether it's reliable and reproducible

## Choose a Construct of Interest: Meaningful and Labelable

Your construct should be:

- **Business-relevant** — addresses a real business decision
- **Theoretically grounded** — anchored in established literature
- **Well documented** — clearly defined in academic or industry sources
- **Observable in text** — detectable in your data source
- **Definable with clear rules** — specific enough for reliable labeling
- **Complex & nuanced** — more than just sentiment (not positive/negative)

You must pick one of the seven provided constructs of interest from here:
https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

## Potential Public Data Sources

Below are places to obtain public text data for your classifier (downloads and/or official APIs). For each of the seven constructs of interest, there are already some data sources suggested. Below are additional sources to consider. Use sources with clear educational/research access terms whenever possible. Or use your own source and data (but observe guardrails & ethics as outlined below).

**Important:** Choose a source that fits your construct and is realistically useful to a firm.

### Customer Reviews & Product Voice

**Amazon Reviews (research datasets)**

- McAuley Lab Amazon datasets
- Stanford SNAP Amazon dataset page
- Hugging Face: Amazon Reviews 2023

**Yelp Open Dataset**

- Kaggle: Yelp Open Dataset

### Corporate Filings & Investor Communications

**SEC EDGAR (10-K / 10-Q / 8-K etc.) — Official APIs**

- EDGAR Application Programming Interfaces
- SEC Developer Resources
- SEC Disclosure Data API announcement

### Regulation, Enforcement, Safety & Compliance

**FDA Warning Letters**

- FDA: Warning Letters
- Data.gov: Warning Letters dataset

**U.S. Consumer Product Safety Commission (CPSC) Recalls — API**

- CPSC Recalls API information

### Health & Public Policy (Public)

**ClinicalTrials.gov (Modernized API)**

- ClinicalTrials.gov Data API

**PubMed / NCBI (E-utilities API)**

- NCBI E-utilities documentation

### Research & Innovation

**arXiv (bulk data)**

- arXiv bulk data help
- arXiv bulk data on AWS S3

**Patents (USPTO + PatentsView)**

- USPTO Open Data Portal: bulk data
- PatentsView: bulk downloads

### Government Documents & Speeches

**govinfo API + documentation** (Congressional docs, Federal Register, etc.)

- govinfo API overview
- govinfo API docs

**Congress.gov API**

- Congress.gov API (GPO)

**U.S. Presidential speeches** (Miller Center data portal)

- Miller Center: Presidential Speech Data

### News / Media Monitoring

**GDELT** (global news monitoring)

- GDELT data downloads
- GDELT 2.0 API documentation (blog)

**Common Crawl News dataset** (CC-NEWS)

- Common Crawl: News dataset announcement

### Security / Technical Risk

**NIST National Vulnerability Database (NVD)** — CVE APIs & feeds

- NVD: Vulnerabilities API
- NVD: Data feeds

### Earnings Calls / Transcripts (public option)

**Academic/open dataset option**

- STRUX dataset page

### Tips & Rules

- **Tip:** Before committing, do a quick pilot on 100–200 texts to confirm your construct appears in the source and that your labels are workable
- **Rule:** Use public data or properly de-identified data only. No sensitive internal company data.

---

## The Workflow: What You Must Do (Step-by-Step)

### Step 1 — Construct Choice + Business Case

**Goal:** Pick a well-documented, theoretically founded construct of interest. Explain why a firm would want to classify this construct at scale.
https://www.ringel.ai/UNC/2026/BUSI488/Class23/Ringel_488-2026_Capstone_Constructs.pdf

**Include:**

- Who is the stakeholder? (CX lead, compliance, product manager, investor relations, etc.)
- What decision/action will classification enable?
- What happens if the classifier is wrong? (false positives vs false negatives)
- Why now? (volume, speed, competitive need)

**Deliverable:** 1–2 page concept brief

### Step 2 — Define the Construct Precisely

**Goal:** Turn the concept into labels humans can apply consistently.

**You must create:**

- Label set (classes)
- Clear definitions and decision rules
- Borderline cases: what to do when unclear
- "None/other" policy IF APPLICABLE (multi-class yes, multi-label no)
- 2-3 example texts per label (your own)

**Decide:** multi-class (one label) vs multi-label (multiple labels can apply).

**Deliverable:** Labeling codebook (PDF)

### Step 3 — Identify and Justify the Data Source

**Goal:** Show that your source actually contains your construct.

**Include:**

- Why this source fits the business purpose
- How a firm would use it regularly (weekly monitoring, quarterly reporting, etc.)
- Basic access plan (download/API/scrape—follow site rules)

**Deliverable:** Data source plan + sampling approach

### Step 4 — Collect Your Dataset

**Targets:**

- 20,000 texts total for train/test
- 1,200 texts for a locked holdout set

**Deliverable:** Raw dataset file + collection notebook/script + documentation

### Step 5 — Preprocess and Split

**You must:**

- Clean text (remove duplicates, empty, obvious spam)
- Create: train/test totaling 20,000; holdout = 1,200 (lock it in early—don't tune on it)
  - **Beware of label imbalance!** May need to curate strategically!
- Report length stats and basic summaries
- Plan for imbalance (it's normal—just don't ignore it)

**Deliverable:** Preprocessing notebook + split files + basic descriptive stats

### Step 6 — Human Label the Holdout

**Requirements:**

- The 1,200 holdout must be labeled by humans
- At least 3 people must label each example
- Either independent labels + agreement report (Krippendorff's alpha recommended), or discussion-based consensus + documented process

**Deliverable:** Holdout with 3+ labels per item (or consensus + notes) + reliability report

### Step 7 — Benchmark GenAI Labeling (≥6 Models)

**Goal:** Treat genAI as alternative "labelers" and compare.

**Requirements:**

- Run at least 6 models from at least 3 different suppliers (e.g., OpenAI, Anthropic, Meta, xAI, Google, FireworksAI, Deepseek, Moonshot)
- Fix prompts to make them comparable across models
- Track runtime, estimated cost, and reproducibility strategy

**Metrics to report:**

- Krippendorff's alpha (vs human labels)
- F1 (macro and per-class if possible)
- AUC (where applicable)
- MCC

**Deliverable:** Benchmark notebook + results table + cost/time log

### Step 8 — Select GenAI Labeling Strategy and Label Train/Test

Choose the best single model or a combination (e.g., majority vote). Then label your 20,000 train/test set.

**Deliverable:** Labeled train/test dataset + labeling script + total cost/time summary

### Step 9 — Fine-Tune a Specialist Model

**Goal:** Create a smaller, task-specific classifier that can match/exceed genAI labeling quality.

**Minimum requirements:**

- Fine-tune at least one pretrained model (RoBERTa or similar)
- Try at least four configurations (model choice or hyperparameters)
- Evaluate on holdout (only at the end)

**Deliverable:** Training notebook + saved model + evaluation on holdout

### Step 10 — Final Comparison: GenAI vs Specialist

Your final analysis must answer:

- Which is better on holdout and by how much?
- Which is cheaper per 1,000,000 texts?
- Which is faster?
- Which is more consistent and reproducible?
- What would you recommend a firm deploy, and why?

**Deliverable:** Final comparison section + recommendation

---

## What To Submit (Deliverables Checklist)

### A) Executive Memo (Executive Style, Max 5 Pages)

- Construct + why it matters + theoretically grounded and well documented
- Data source + governance/ethics
- Label schema overview
- Results summary: best genAI vs best specialist
- Cost/time/reproducibility comparison
- Recommendation for a real firm

### B) Technical Appendix (Slides or PDF)

- Pipeline diagram (data → labels → model → evaluation)
- Label codebook (or link/appendix)
- Benchmark table (≥6 genAI models)
- Fine-tuning experiments + results
- Error analysis: where does it fail and why?

### C) Code + Artifacts (Datasets)

- Colab notebooks (reproducible)
- **Datasets:**
  - Holdout with human labels
  - Train/test with genAI labels
  - Any other data used
  - All labels of all models for each run + majority labels
- Saved fine-tuned model + inference script (provide link to webspace/Google Drive/Dropbox; **do not upload to Canvas!**)
- Cost/time log

### D) IGNITE Talk in Class

- **20 PPTX Slides**, 15 sec per slide (automatic transitions), 5 min total
- Submit your PPTX slides (not PDF, not Google Slides, not other format!) set to auto transitions after 15 sec on Canvas
- I will have all slides ready for you to go in class
- **Key messages and insights only**
- **Every team member must present at least 2 slides**
- There will be a 3 min Q&A after each IGNITE talk

**Learn more:**

- https://en.wikipedia.org/wiki/Ignite_(event)
- https://www.ignitetalks.io/
- https://robbiesenbach.com/deliver-successful-ignite-talk/

---

- **Start simple, then improve:** Your first codebook will be imperfect. Iterate.
- **Design labels for reliability:** If humans disagree a lot, the model will struggle. Fix definitions before scaling.
- **Lock the holdout early:** It counts 35% of your grade. Don't change direction repeatedly—fix it early and make sure it's representative for what you ultimately want to achieve from a business perspective.
- **Do a small pilot before spending money:** Test prompts and schema on 100–200 examples first.
- **Don't hide class imbalance:** Imbalance is normal. But it can damage evaluation and training. **You must solve this!** Use macro F1 and MCC, and document prevalence.

## Team Structure (Recommended Roles)

- **Project lead:** Scope, project plan, milestones, quality control
- **Data lead:** Collection, cleaning, dataset documentation
- **Labeling lead:** Codebook, human labeling workflow, reliability stats
- **GenAI lead:** API scripts, benchmarking, cost/time tracking
- **Modeling lead:** Fine-tuning, hyperparameters, reproducibility, evaluation
- **Delivery lead:** Assemble all outputs, organize write-ups, final quality control

## Grading Rubric (100 Points)

| Criterion                                 | Points |
| ----------------------------------------- | ------ |
| Business framing & construct clarity      | 20     |
| Data pipeline quality + documentation     | 15     |
| Human labeling process + reliability      | 15     |
| GenAI benchmarking rigor                  | 20     |
| Fine-tuning rigor + evaluation discipline | 20     |
| Final comparison + recommendation quality | 10     |

### Minimum Requirements Per Letter Grade:

#### C (- to +)

- Fine-tuned model with F1 score > 0.80
- Performance comparison genAI vs. fine-tuned model
- Labeled datasets
- Documentation
- Python notebook(s) to replicate pre-processing, training, and evaluation

#### B (- to +)

All of the above plus **at least three** of the following:

- Cost, time, reproducibility analysis
- Comparison of 6 or more models from at least 3 different suppliers
- Contemporary data that you collected (not an off-the-shelf dataset)
- Compelling use-case for your classifier with complete business case

#### A (- to A)

All of the above plus **at least three** of the following:

- Error-analysis (corner cases, rare or complex texts)
- Mitigation strategy and implementation to overcome identified model weaknesses
- Additional baselines (e.g., alternative classification approaches like dictionaries, topic models)
- Comparison to amateur labels

## Guardrails (Ethics & Compliance)

- Public data or approved de-identified data only
- Remove or avoid sensitive personal information
- Document limitations and potential bias
- If your construct is sensitive (health, safety, harassment), include a brief risk statement and mitigation steps

## Estimated Effort

| Task                                                      | Hours |
| --------------------------------------------------------- | ----- |
| Construct + codebook v1 + data source plan                | 4     |
| Data collection + preprocessing + splits                  | 4     |
| 1.2K human labeling + reliability + codebook v2           | 8     |
| GenAI benchmarking (≥6 models) + choose labeling strategy | 3     |
| Label 20k + fine-tune specialist (2+ configs)             | 2     |
| Final evaluation + memo + presentation                    | 3     |

**Total: ~24 hours per student**

## Best Work Featured in Vertical AI Paper

I will select (and improve) the best 1–2 classifiers to be featured in my Vertical AI Paper where each team member will be acknowledged for their application of the synthetic expert/specialist approach. I will ask members whether they want to be named or not (choosing to remain anonymous has not impacted your grade).

## Getting Started

To help you with your Capstone, I wrote a full pipeline in a Python notebook that does all the key steps you need for your Capstone Project by example of classifying 10K sentences into business functions. This includes querying genAI via API at scale, creating holdout and training datasets, fine-tuning a pretrained LLM, and evaluating the performance of genAI and your fine-tuned (vertical AI) model.

**What it does not do** is give you a construct of interest, collect your data, clean and preprocess your data, or draw conclusions and write reports for you.

**Python notebook:** http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.ipynb

**Zip file with outputs:** http://ringel.ai/UNC/2026/helpers/Ringel_2026_VerticalAI_Capstone_Pipeline_Example.zip (excludes the actual trained vertical AI because it is 1.5GB)

All subfolders and datasets are included. This is a great blueprint for what data you need to deliver with your capstone on a shared drive (provide link to me) or uploaded if sufficiently small (less than 20MB).

The contents of the zip file also help you see what the expected output is (by example of a multi-label classification problem). You will need to adapt this code to your problem. Use genAI (e.g., Claude Opus 4.6) for this. The pipeline gives you a solid base to work off.