decisions for TAPT

2026-03-30 00:33:28 -04:00 · 2026-03-30 00:33:28 -04:00 · 313e14fb96
commit 313e14fb96
parent fe5155ab6d
2 changed files with 38 additions and 6 deletions
--- a/docs/DAPT-PROCEDURE.md
+++ b/docs/DAPT-PROCEDURE.md
@ -178,17 +178,25 @@ bun run py:train dapt --config configs/dapt/modernbert.yaml \
  --stage tapt
 ```
 ### TAPT Configuration Differences
 The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than DAPT. This changes the training dynamics:
 **Epochs: 5-10 (not 1).** Gururangan et al. (2020) ran TAPT for 100 epochs, but their corpora were 50-500K tokens — 20-200x smaller than ours. We match on total token exposure (~50-100M) rather than epoch count: 5-10 epochs × 10M tokens = 50-100M tokens, comparable to the upper end of their TAPT exposure.
 **Whole-word masking (optional).** `DataCollatorForWholeWordMasking` masks entire words instead of random subword tokens, naturally emphasizing content words over fragments. Worth trying for TAPT since the model already knows subword patterns from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). One-line change in `train.py`.
 ### What happens
 1. Loads the DAPT checkpoint (not the base ModernBERT)
 2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl`
-3. Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
+3. Tokenizes, concatenates, chunks (~10M tokens → ~1,220 sequences at 8192)
-4. Trains MLM with same hyperparameters
+4. Trains 5-10 epochs of MLM with different masking each epoch
 5. Saves to `checkpoints/tapt/modernbert-large/final/`
 ### Expected duration
-~2-3 hours (much smaller corpus).
+~1-2 hours (5-10 epochs on ~1,220 sequences = ~190-380 optimizer steps).
 ### Output
@ -220,10 +228,19 @@ Every 1,000 steps, it also reports:
 - `eval_loss` — validation MLM loss
 - Perplexity can be computed as `2^eval_loss`
 **Expected loss trajectory:**
 - Starting loss: ~0.80 (the model already knows English — it's only learning SEC-specific patterns)
 - Final loss: ~0.55-0.65 (a gentle downward drift, not a dramatic curve)
 - For reference, a randomly initialized model would start at ~10.8 (ln(50280 vocab size))
 **What to watch for:**
- Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
+- `grad_norm` should stay small (0.05-0.15). Healthy = gentle weight updates. Spikes >1.0 = LR too high.
- Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
+- `learning_rate` ramps up during warmup (first 5% of steps ≈ 93 steps), then decays.
- If loss spikes or goes to NaN, the learning rate may be too high
+- Loss going *up* after warmup → LR too high or data issue
 - Loss stuck flat after 500+ steps → model isn't learning, LR too low
 - Loss < 0.3 → possible overfitting (unlikely in 1 epoch of 500M tokens)
 **The DAPT loss number itself matters less than the downstream impact.** DAPT teaches the model SEC vocabulary and co-occurrence patterns ("NIST CSF", "tabletop exercise", "materiality assessment"). Whether the final loss is 0.55 or 0.65 is less important than whether the [CLS] embeddings produce better classification after fine-tuning. The real evaluation is the ablation: base vs +DAPT vs +DAPT+TAPT.
 ## Artifacts
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -741,6 +741,21 @@ Initial training ran at ~47s/step (projected ~56 hours for 1B tokens). Through i
 Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROCEDURE.md`.
 ### Early Training Results
 First eval at step 54 (~3% through):
 - **Loss: 0.80** — the model already knows English, so loss starts low. For comparison, a randomly initialized model would start at ~10.8. The loss reflects the model's ability to predict masked SEC filing tokens from context.
 - **grad_norm: 0.066** — very small, indicating gentle weight updates. Healthy sign.
 - **learning_rate: 2.66e-5** — still in warmup phase (first 93 steps, 5% of training).
 Expected trajectory: loss drifts from ~0.80 to ~0.55-0.65 over the run. This is not the dramatic loss curve of fine-tuning — DAPT is nudging a capable language model toward SEC-specific vocabulary and co-occurrence patterns, not teaching it a new task from scratch.
 ### TAPT Planning
 The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. Following Gururangan et al. (2020), we run multiple epochs to compensate, but match on total token exposure rather than blindly copying their 100-epoch setting (which was calibrated for 50-500K token corpora, 20-200x smaller than ours). 5-10 epochs × 10M = 50-100M total tokens, matching the upper end of their TAPT exposure. Estimated training time: ~1-2 hours.
 One planned experiment: **whole-word masking** (`DataCollatorForWholeWordMasking`) for TAPT, which masks entire words rather than random subword tokens. Since the model already knows subword patterns from DAPT, TAPT should focus on domain-specific whole words. This is a one-line change.
 ---
 ## Cost and Time Ledger