tapt setup

2026-03-30 19:46:20 -04:00 · 2026-03-30 19:46:20 -04:00 · 3292980d33
commit 3292980d33
parent c0273c9e2e
8 changed files with 158 additions and 50 deletions
--- a/docs/DAPT-PROCEDURE.md
+++ b/docs/DAPT-PROCEDURE.md
@ -166,42 +166,72 @@ checkpoints/dapt/modernbert-large/
 ## Step 2: TAPT
-After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.
+After DAPT completes, continue MLM on the 72K Item 1C paragraphs using the DAPT checkpoint.
 ### Command
 ```bash
-bun run py:train dapt --config configs/dapt/modernbert.yaml \
+cd python
-  --model-path ../checkpoints/dapt/modernbert-large/final \
+bun run py:train dapt --config configs/tapt/modernbert.yaml
  --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large \
  --stage tapt
 ```
-### TAPT Configuration Differences
+Equivalent to: `uv run main.py dapt --config configs/tapt/modernbert.yaml`
-The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than DAPT. This changes the training dynamics:
+### TAPT Configuration
-**Epochs: 5-10 (not 1).** Gururangan et al. (2020) ran TAPT for 100 epochs, but their corpora were 50-500K tokens — 20-200x smaller than ours. We match on total token exposure (~50-100M) rather than epoch count: 5-10 epochs × 10M tokens = 50-100M tokens, comparable to the upper end of their TAPT exposure.
+**Config file:** `python/configs/tapt/modernbert.yaml`
-**Whole-word masking (optional).** `DataCollatorForWholeWordMasking` masks entire words instead of random subword tokens, naturally emphasizing content words over fragments. Worth trying for TAPT since the model already knows subword patterns from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). One-line change in `train.py`.
+The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than DAPT. This changes several training decisions:
 | Parameter | Value | vs. DAPT | Rationale |
 |-----------|-------|----------|-----------|
 | `max_seq_length` | 512 | 8192 → 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). 8192 would be 98.5% padding. |
 | `num_train_epochs` | 5 | 1 → 5 | Match total token exposure: 5 × 10M = 50M ≈ upper bound of Gururangan et al. (2020) TAPT exposure. |
 | `whole_word_mask` | true | false → true | Mask entire words, not subword pieces. Model knows subword composition from DAPT; TAPT focuses on domain-specific whole words ("CISO", "materiality", "tabletop"). |
 | `per_device_train_batch_size` | 32 | 4 → 32 | Short sequences free VRAM. Tested: 22.7 GB peak with torch.compile at batch=32 (OOM at 48). |
 | `gradient_accumulation_steps` | 1 | 8 → 1 | Effective batch = 32 in both cases. |
 | `gradient_checkpointing` | false | true → false | Not needed at 512 seq_len. Would add 30-40% overhead for no benefit. |
 | `save_strategy` | epoch | steps → epoch | Checkpoint + evaluate after each of 5 epochs. |
 | `validation_split` | 0.05 | 0.02 → 0.05 | Larger val split for 50x smaller dataset — need stable eval loss. |
 ### Sequence Length Decision (512 vs. 8192)
 DAPT used 8192 to match ModernBERT's pre-training context and exercise all positional embeddings and global attention layers. TAPT uses 512 because:
 1. **The data is 512.** Paragraphs average 127 tokens (P99=386). There is no long-range structure to learn.
 2. **50M tokens won't cause forgetting.** TAPT's 50M token-exposures is 0.0025% of ModernBERT's ~2T pre-training and 10% of DAPT. The model's long-range patterns are deeply established.
 3. **RoPE is position-independent.** Positions 0-511 compute identically at any max_length. Positions 512-8191 remain untouched from DAPT.
 ### Whole-Word Masking Implementation
 Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). For TAPT, each paragraph is tokenized individually with truncation, preserving `offset_mapping`. The data collator handles dynamic padding per batch.
 Note: with `whole_word_mask=True`, the HuggingFace collator automatically disables random token replacement (`mask_replace_prob=1.0`). All masked positions receive the `[MASK]` token.
 ### What happens
-1. Loads the DAPT checkpoint (not the base ModernBERT)
+1. Loads the DAPT checkpoint from `checkpoints/dapt/modernbert-large/final/`
 2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl`
-3. Tokenizes, concatenates, chunks (~10M tokens → ~1,220 sequences at 8192)
+3. Tokenizes each paragraph individually (truncation at 512, with offset_mapping for whole-word masking)
-4. Trains 5-10 epochs of MLM with different masking each epoch
+4. Splits 5% validation (~3,602 paragraphs), 95% train (~68,443 paragraphs)
-5. Saves to `checkpoints/tapt/modernbert-large/final/`
+5. Trains 5 epochs of MLM with whole-word masking — different masking each epoch
 6. Saves checkpoint after each epoch; saves final model to `checkpoints/tapt/modernbert-large/final/`
 ### Expected duration
-~1-2 hours (5-10 epochs on ~1,220 sequences = ~190-380 optimizer steps).
+~2,138 steps/epoch × 5 epochs = ~10,700 total steps. At seq_len=512 on the 3090 (~0.5-1s/step), estimated 1-3 hours.
 ### Resume if interrupted
 Re-run the same command — it detects existing checkpoints and resumes automatically.
 ### Output
 ```
 checkpoints/tapt/modernbert-large/
  checkpoint-epoch-1/
  checkpoint-epoch-2/
  ...
  final/                  <- SEC-cyBERT-large (DAPT + TAPT)
 ```
@ -247,6 +277,7 @@ Every 1,000 steps, it also reports:
 | File | Purpose |
 |------|---------|
 | `python/configs/dapt/modernbert.yaml` | DAPT config |
 | `python/configs/tapt/modernbert.yaml` | TAPT config |
 | `python/configs/dapt/neobert.yaml` | NeoBERT config (if needed) |
 | `python/main.py` | CLI entrypoint |
 | `python/src/dapt/train.py` | Training loop |
--- a/docs/NARRATIVE.md
+++ b/docs/NARRATIVE.md
@ -745,16 +745,39 @@ Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROC
 | Step | Loss | grad_norm | LR | Epoch | Note |
 |------|------|-----------|-----|-------|------|
-| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Still in warmup (first 93 steps) |
+| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Warmup phase |
-| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | 70% through, steady decline |
+| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | Steady decline |
 | 1800 | 0.7253 | 0.073 | 1.48e-6 | 0.97 | LR near zero, loss plateaued |
 | **Final** | **0.7250** | **0.043** | **5.7e-8** | **1.00** | **Eval loss: 0.7250, perplexity: 1.65** |
-The loss dropped 0.076 over ~1,200 steps — a gentle, steady downward drift. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; the model is learning SEC-specific token co-occurrence patterns, not language fundamentals. grad_norm remained stable at ~0.07 throughout, indicating healthy, non-volatile weight updates.
+The loss dropped from 0.80 → 0.72 — a gentle 10% decline over one epoch. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; DAPT taught it SEC-specific token co-occurrence patterns ("NIST CSF", "materiality assessment", "tabletop exercise"), not language fundamentals. grad_norm remained stable at 0.04-0.07 throughout with zero instability. Total training time: ~14 hours across two sessions on an RTX 3090 (resumed from checkpoint-1280).
-### TAPT Planning
+The DAPT checkpoint is saved at `checkpoints/dapt/modernbert-large/final/` and is ready for TAPT.
-The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. Following Gururangan et al. (2020), we run multiple epochs to compensate, but match on total token exposure rather than blindly copying their 100-epoch setting (which was calibrated for 50-500K token corpora, 20-200x smaller than ours). 5-10 epochs × 10M = 50-100M total tokens, matching the upper end of their TAPT exposure. Estimated training time: ~1-2 hours.
+### TAPT Configuration
-One planned experiment: **whole-word masking** (`DataCollatorForWholeWordMasking`) for TAPT, which masks entire words rather than random subword tokens. Since the model already knows subword patterns from DAPT, TAPT should focus on domain-specific whole words. This is a one-line change.
+The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. This changes several training decisions vs. DAPT. Config file: `python/configs/tapt/modernbert.yaml`.
 | Parameter | DAPT | TAPT | Rationale for change |
 |-----------|------|------|---------------------|
 | `max_seq_length` | 8192 | 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). Using 8192 would mean 98.5% padding — pure waste. See seq_len discussion below. |
 | `num_train_epochs` | 1 | 5 | Gururangan et al. (2020) ran 100 epochs on 50-500K token TAPT corpora. We match total token exposure: 5 × 10M = 50M tokens ≈ upper bound of their TAPT exposure. |
 | `whole_word_mask` | false | true | Masks entire words instead of subword pieces. Prevents trivially solvable masking patterns (e.g., masked `cyber` next to unmasked `security`). The model already knows subword composition from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). |
 | `per_device_train_batch_size` | 4 | 32 | Short sequences free VRAM. Tested: batch=32 uses 22.7 GB with torch.compile (vs. OOM at batch=48). |
 | `gradient_accumulation_steps` | 8 | 1 | Effective batch = 32 in both cases. No accumulation needed since batch=32 fits directly. |
 | `gradient_checkpointing` | true | false | Not needed at seq_len=512 — activations are small. Gradient checkpointing would slow training 30-40% for no memory benefit. |
 | `save_strategy` / `eval_strategy` | steps (256) | epoch | 5 epochs; checkpoint and evaluate after each one. |
 | `validation_split` | 0.02 | 0.05 | Larger val split for a 50x smaller dataset — need enough samples for stable eval loss. |
 **Sequence length (512 vs. 8192):** The concern with a shorter seq_len is degrading the model's long-range attention capabilities. Three factors make this a non-issue for TAPT:
 1. **The data is short.** Paragraphs average 127 tokens. There is no long-range structure to learn — the information simply isn't there.
 2. **Scale of exposure.** TAPT is 50M token-exposures (5 epochs × 10M). ModernBERT was pre-trained on ~2T tokens; DAPT added 500M. 50M is 0.0025% of original pre-training — far too small to cause catastrophic forgetting of patterns established over trillions of tokens.
 3. **RoPE positions are independent.** ModernBERT uses rotary position embeddings. Positions 0-511 compute identically whether max_length is 512 or 8192. Training at 512 updates the same parameters; positions 512-8191 remain as-is from DAPT, not degraded.
 **Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked.
 **Estimated training time:** ~2,138 steps/epoch × 5 epochs = ~10,700 total steps. At seq_len=512 on the 3090 (~0.5-1s/step), ballpark 1-3 hours.
 ---
@ -801,7 +824,7 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok
 | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 |
 | Training data assembly | ~2h | $0 |
 | DAPT pre-training (1 epoch) | ~4-8h GPU | $0 (own 3090) |
-| TAPT pre-training | ~2-3h GPU | $0 |
+| TAPT pre-training (5 epochs, WWM) | ~1-3h GPU | $0 |
 | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 |
 | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 |
 | Evaluation + comparison + write-up | ~6-8h | $0 |
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@ -31,11 +31,8 @@
 ## What's In Progress
-### DAPT Training (~13.5h, running)
+### DAPT Training — Complete
-```bash
+Final eval loss: 0.7250, perplexity: 1.65. Loss: 0.80 → 0.72 over 1 epoch on 500M tokens. ~14h total across 2 sessions on RTX 3090. Checkpoint at `checkpoints/dapt/modernbert-large/final/`.
 cd python && bun run py:train dapt --config configs/dapt/modernbert.yaml
 ```
 Running on RTX 3090. 500M tokens (newest filings), batch=4, ~27s/step, ~1,950 steps. Checkpoints every 256 steps (~1.9h). Resume-safe — ctrl+c and restart to continue from last checkpoint.
 ### Human Labeling (139/1,200)
 - 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed
@ -44,13 +41,10 @@ Running on RTX 3090. 500M tokens (newest filings), batch=4, ~27s/step, ~1,950 st
 ## What's Next (in dependency order)
-### 1. TAPT (~2-3h, blocked on DAPT)
+### 1. TAPT (~1-3h, ready to run)
-Continue MLM on 72K Item 1C paragraphs using the DAPT checkpoint.
+Continue MLM on 72K Item 1C paragraphs using the DAPT checkpoint. 5 epochs, whole-word masking, seq_len=512, batch=32.
 ```bash
-bun run py:train dapt --config configs/dapt/modernbert.yaml \
+cd python && bun run py:train dapt --config configs/tapt/modernbert.yaml
  --model-path ../checkpoints/dapt/modernbert-large/final \
  --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large --stage tapt
 ```
 ### 2. Fine-tuning pipeline (no blockers — can build now)
--- a/python/configs/dapt/modernbert.yaml
+++ b/python/configs/dapt/modernbert.yaml
@ -23,7 +23,9 @@ training:
  bf16: true
  gradient_checkpointing: true
  logging_steps: 50
  save_strategy: steps
  save_steps: 256
  eval_strategy: steps
  eval_steps: 256
  save_total_limit: 8
  dataloader_num_workers: 4
--- a/python/configs/tapt/modernbert.yaml
+++ b/python/configs/tapt/modernbert.yaml
@ -0,0 +1,30 @@
 stage: tapt
 model:
  name_or_path: ../checkpoints/dapt/modernbert-large/final
  trust_remote_code: false
 data:
  corpus_path: ../data/paragraphs/paragraphs-clean.patched.jsonl
  text_field: text
  max_seq_length: 512  # 99.6% of paragraphs fit; mean=127, P99=386
  validation_split: 0.05  # larger val split — small dataset
 training:
  output_dir: ../checkpoints/tapt/modernbert-large
  learning_rate: 5.0e-5
  mlm_probability: 0.30
  whole_word_mask: true
  num_train_epochs: 5
  per_device_train_batch_size: 32  # 22.7 GB peak w/ torch.compile at seq_len=512
  gradient_accumulation_steps: 1  # effective batch = 32 (matches DAPT)
  warmup_ratio: 0.05
  weight_decay: 1.0e-5
  bf16: true
  gradient_checkpointing: false  # short sequences, not needed
  logging_steps: 50
  save_strategy: epoch
  eval_strategy: epoch
  save_total_limit: 6  # keep all 5 epoch checkpoints + final
  dataloader_num_workers: 4
  seed: 42
--- a/python/pyproject.toml
+++ b/python/pyproject.toml
@ -10,6 +10,8 @@ dependencies = [
    "datasets",
    "accelerate",
    "pyyaml",
    "nvidia-cusparselt-cu12>=0.8.1",
    "nvidia-nvshmem-cu12>=3.6.5",
 ]
 [project.optional-dependencies]
--- a/python/src/common/config.py
+++ b/python/src/common/config.py
@ -33,6 +33,7 @@ class TrainingConfig:
    output_dir: str
    learning_rate: float = 5e-5
    mlm_probability: float = 0.30
    whole_word_mask: bool = False
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 4
    gradient_accumulation_steps: int = 8
@ -41,8 +42,10 @@ class TrainingConfig:
    bf16: bool = True
    gradient_checkpointing: bool = True
    logging_steps: int = 50
-    save_steps: int = 1000
+    save_strategy: str = "steps"  # "steps" or "epoch"
-    eval_steps: int = 1000
+    save_steps: int = 1000  # ignored when save_strategy="epoch"
    eval_strategy: str = "steps"  # "steps" or "epoch"
    eval_steps: int = 1000  # ignored when eval_strategy="epoch"
    save_total_limit: int = 3
    dataloader_num_workers: int = 4
    seed: int = 42
--- a/python/src/dapt/train.py
+++ b/python/src/dapt/train.py
@ -62,13 +62,14 @@ def train(config: DAPTConfig) -> None:
        dataset = load_corpus(config.data.corpus_path, config.data.text_field)
        print(f"  Raw documents: {len(dataset):,}")
-        # Filter tiny documents (cover pages, empty filings)
+        # Filter tiny documents (cover pages, empty filings) — DAPT only
-        min_chars = 10_000
+        if config.stage == "dapt":
-        before = len(dataset)
+            min_chars = 10_000
-        dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars)
+            before = len(dataset)
-        filtered = before - len(dataset)
+            dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars)
-        if filtered > 0:
+            filtered = before - len(dataset)
-            print(f"  Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining")
+            if filtered > 0:
                print(f"  Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining")
        # Subsample corpus if max_tokens is set (Ponnock 2025: diminishing
        # returns beyond ~250M tokens for SEC DAPT). Takes from the END of
@ -88,12 +89,17 @@ def train(config: DAPTConfig) -> None:
            est_tokens = cumulative / chars_per_token
            print(f"  Subsampled to {n - keep_from:,} docs (~{est_tokens / 1e6:.0f}M tokens, newest filings, max_tokens={config.data.max_tokens:,})")
-        print(f"  Tokenizing and chunking to {config.data.max_seq_length} tokens...")
+        wwm = config.training.whole_word_mask
        if wwm:
            print(f"  Tokenizing to {config.data.max_seq_length} tokens (whole-word mask)...")
        else:
            print(f"  Tokenizing and chunking to {config.data.max_seq_length} tokens...")
        chunked = tokenize_and_chunk(
            dataset,
            tokenizer,
            text_field=config.data.text_field,
            max_seq_length=config.data.max_seq_length,
            whole_word_mask=wwm,
        )
        print(f"  Training sequences: {len(chunked):,}")
@ -113,30 +119,34 @@ def train(config: DAPTConfig) -> None:
        tokenizer=tokenizer,
        mlm=True,
        mlm_probability=config.training.mlm_probability,
        whole_word_mask=config.training.whole_word_mask,
    )
    # Training arguments
    output_dir = Path(config.training.output_dir)
-    args = TrainingArguments(
+    steps_per_epoch = len(split["train"]) // (
        config.training.per_device_train_batch_size
        * config.training.gradient_accumulation_steps
    )
    warmup_steps = int(config.training.warmup_ratio * steps_per_epoch)
    training_kwargs: dict = dict(
        output_dir=str(output_dir),
        learning_rate=config.training.learning_rate,
        num_train_epochs=config.training.num_train_epochs,
        per_device_train_batch_size=config.training.per_device_train_batch_size,
        gradient_accumulation_steps=config.training.gradient_accumulation_steps,
-        warmup_steps=int(config.training.warmup_ratio * (len(split["train"]) // (config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps))),
+        warmup_steps=warmup_steps,
        weight_decay=config.training.weight_decay,
        bf16=config.training.bf16,
        gradient_checkpointing=config.training.gradient_checkpointing,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        torch_compile=True,
        optim="adamw_torch_fused",
        tf32=True,
        per_device_eval_batch_size=1,
        dataloader_persistent_workers=True,
        logging_steps=config.training.logging_steps,
-        save_steps=config.training.save_steps,
+        save_strategy=config.training.save_strategy,
-        eval_strategy="steps",
+        eval_strategy=config.training.eval_strategy,
        eval_steps=config.training.eval_steps,
        save_total_limit=config.training.save_total_limit,
        dataloader_num_workers=config.training.dataloader_num_workers,
        seed=config.training.seed,
@ -145,6 +155,19 @@ def train(config: DAPTConfig) -> None:
        metric_for_best_model="eval_loss",
    )
    if config.training.gradient_checkpointing:
        training_kwargs["gradient_checkpointing_kwargs"] = {"use_reentrant": False}
        # Long sequences need small eval batch to avoid OOM
        training_kwargs["per_device_eval_batch_size"] = 1
    # Only pass step counts when using step-based strategy
    if config.training.save_strategy == "steps":
        training_kwargs["save_steps"] = config.training.save_steps
    if config.training.eval_strategy == "steps":
        training_kwargs["eval_steps"] = config.training.eval_steps
    args = TrainingArguments(**training_kwargs)
    trainer = Trainer(
        model=model,
        args=args,