diff --git a/docs/DAPT-PROCEDURE.md b/docs/DAPT-PROCEDURE.md index aee5134..30bcc28 100644 --- a/docs/DAPT-PROCEDURE.md +++ b/docs/DAPT-PROCEDURE.md @@ -166,42 +166,72 @@ checkpoints/dapt/modernbert-large/ ## Step 2: TAPT -After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically. +After DAPT completes, continue MLM on the 72K Item 1C paragraphs using the DAPT checkpoint. ### Command ```bash -bun run py:train dapt --config configs/dapt/modernbert.yaml \ - --model-path ../checkpoints/dapt/modernbert-large/final \ - --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \ - --output-dir ../checkpoints/tapt/modernbert-large \ - --stage tapt +cd python +bun run py:train dapt --config configs/tapt/modernbert.yaml ``` -### TAPT Configuration Differences +Equivalent to: `uv run main.py dapt --config configs/tapt/modernbert.yaml` -The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than DAPT. This changes the training dynamics: +### TAPT Configuration -**Epochs: 5-10 (not 1).** Gururangan et al. (2020) ran TAPT for 100 epochs, but their corpora were 50-500K tokens — 20-200x smaller than ours. We match on total token exposure (~50-100M) rather than epoch count: 5-10 epochs × 10M tokens = 50-100M tokens, comparable to the upper end of their TAPT exposure. +**Config file:** `python/configs/tapt/modernbert.yaml` -**Whole-word masking (optional).** `DataCollatorForWholeWordMasking` masks entire words instead of random subword tokens, naturally emphasizing content words over fragments. Worth trying for TAPT since the model already knows subword patterns from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). One-line change in `train.py`. +The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than DAPT. This changes several training decisions: + +| Parameter | Value | vs. DAPT | Rationale | +|-----------|-------|----------|-----------| +| `max_seq_length` | 512 | 8192 → 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). 8192 would be 98.5% padding. | +| `num_train_epochs` | 5 | 1 → 5 | Match total token exposure: 5 × 10M = 50M ≈ upper bound of Gururangan et al. (2020) TAPT exposure. | +| `whole_word_mask` | true | false → true | Mask entire words, not subword pieces. Model knows subword composition from DAPT; TAPT focuses on domain-specific whole words ("CISO", "materiality", "tabletop"). | +| `per_device_train_batch_size` | 32 | 4 → 32 | Short sequences free VRAM. Tested: 22.7 GB peak with torch.compile at batch=32 (OOM at 48). | +| `gradient_accumulation_steps` | 1 | 8 → 1 | Effective batch = 32 in both cases. | +| `gradient_checkpointing` | false | true → false | Not needed at 512 seq_len. Would add 30-40% overhead for no benefit. | +| `save_strategy` | epoch | steps → epoch | Checkpoint + evaluate after each of 5 epochs. | +| `validation_split` | 0.05 | 0.02 → 0.05 | Larger val split for 50x smaller dataset — need stable eval loss. | + +### Sequence Length Decision (512 vs. 8192) + +DAPT used 8192 to match ModernBERT's pre-training context and exercise all positional embeddings and global attention layers. TAPT uses 512 because: + +1. **The data is 512.** Paragraphs average 127 tokens (P99=386). There is no long-range structure to learn. +2. **50M tokens won't cause forgetting.** TAPT's 50M token-exposures is 0.0025% of ModernBERT's ~2T pre-training and 10% of DAPT. The model's long-range patterns are deeply established. +3. **RoPE is position-independent.** Positions 0-511 compute identically at any max_length. Positions 512-8191 remain untouched from DAPT. + +### Whole-Word Masking Implementation + +Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). For TAPT, each paragraph is tokenized individually with truncation, preserving `offset_mapping`. The data collator handles dynamic padding per batch. + +Note: with `whole_word_mask=True`, the HuggingFace collator automatically disables random token replacement (`mask_replace_prob=1.0`). All masked positions receive the `[MASK]` token. ### What happens -1. Loads the DAPT checkpoint (not the base ModernBERT) +1. Loads the DAPT checkpoint from `checkpoints/dapt/modernbert-large/final/` 2. Loads 72,045 patched paragraphs from `paragraphs-clean.patched.jsonl` -3. Tokenizes, concatenates, chunks (~10M tokens → ~1,220 sequences at 8192) -4. Trains 5-10 epochs of MLM with different masking each epoch -5. Saves to `checkpoints/tapt/modernbert-large/final/` +3. Tokenizes each paragraph individually (truncation at 512, with offset_mapping for whole-word masking) +4. Splits 5% validation (~3,602 paragraphs), 95% train (~68,443 paragraphs) +5. Trains 5 epochs of MLM with whole-word masking — different masking each epoch +6. Saves checkpoint after each epoch; saves final model to `checkpoints/tapt/modernbert-large/final/` ### Expected duration -~1-2 hours (5-10 epochs on ~1,220 sequences = ~190-380 optimizer steps). +~2,138 steps/epoch × 5 epochs = ~10,700 total steps. At seq_len=512 on the 3090 (~0.5-1s/step), estimated 1-3 hours. + +### Resume if interrupted + +Re-run the same command — it detects existing checkpoints and resumes automatically. ### Output ``` checkpoints/tapt/modernbert-large/ + checkpoint-epoch-1/ + checkpoint-epoch-2/ + ... final/ <- SEC-cyBERT-large (DAPT + TAPT) ``` @@ -247,6 +277,7 @@ Every 1,000 steps, it also reports: | File | Purpose | |------|---------| | `python/configs/dapt/modernbert.yaml` | DAPT config | +| `python/configs/tapt/modernbert.yaml` | TAPT config | | `python/configs/dapt/neobert.yaml` | NeoBERT config (if needed) | | `python/main.py` | CLI entrypoint | | `python/src/dapt/train.py` | Training loop | diff --git a/docs/NARRATIVE.md b/docs/NARRATIVE.md index 49dc365..9ffd935 100644 --- a/docs/NARRATIVE.md +++ b/docs/NARRATIVE.md @@ -745,16 +745,39 @@ Full procedure, optimization journey, and cloud cost analysis in `docs/DAPT-PROC | Step | Loss | grad_norm | LR | Epoch | Note | |------|------|-----------|-----|-------|------| -| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Still in warmup (first 93 steps) | -| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | 70% through, steady decline | +| 54 | 0.7991 | 0.066 | 2.66e-5 | 0.03 | Warmup phase | +| 1280 | 0.7233 | 0.068 | 1.57e-5 | 0.70 | Steady decline | +| 1800 | 0.7253 | 0.073 | 1.48e-6 | 0.97 | LR near zero, loss plateaued | +| **Final** | **0.7250** | **0.043** | **5.7e-8** | **1.00** | **Eval loss: 0.7250, perplexity: 1.65** | -The loss dropped 0.076 over ~1,200 steps — a gentle, steady downward drift. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; the model is learning SEC-specific token co-occurrence patterns, not language fundamentals. grad_norm remained stable at ~0.07 throughout, indicating healthy, non-volatile weight updates. +The loss dropped from 0.80 → 0.72 — a gentle 10% decline over one epoch. For comparison, a randomly initialized model would start at ~10.8 (ln(50280 vocab size)). Starting at 0.80 reflects that ModernBERT already knows English; DAPT taught it SEC-specific token co-occurrence patterns ("NIST CSF", "materiality assessment", "tabletop exercise"), not language fundamentals. grad_norm remained stable at 0.04-0.07 throughout with zero instability. Total training time: ~14 hours across two sessions on an RTX 3090 (resumed from checkpoint-1280). -### TAPT Planning +The DAPT checkpoint is saved at `checkpoints/dapt/modernbert-large/final/` and is ready for TAPT. -The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. Following Gururangan et al. (2020), we run multiple epochs to compensate, but match on total token exposure rather than blindly copying their 100-epoch setting (which was calibrated for 50-500K token corpora, 20-200x smaller than ours). 5-10 epochs × 10M = 50-100M total tokens, matching the upper end of their TAPT exposure. Estimated training time: ~1-2 hours. +### TAPT Configuration -One planned experiment: **whole-word masking** (`DataCollatorForWholeWordMasking`) for TAPT, which masks entire words rather than random subword tokens. Since the model already knows subword patterns from DAPT, TAPT should focus on domain-specific whole words. This is a one-line change. +The TAPT corpus is 72K Item 1C paragraphs (~10M tokens) — 50x smaller than the DAPT corpus. This changes several training decisions vs. DAPT. Config file: `python/configs/tapt/modernbert.yaml`. + +| Parameter | DAPT | TAPT | Rationale for change | +|-----------|------|------|---------------------| +| `max_seq_length` | 8192 | 512 | Data-driven: paragraphs average 127 tokens (P99=386, 99.6% fit in 512). Using 8192 would mean 98.5% padding — pure waste. See seq_len discussion below. | +| `num_train_epochs` | 1 | 5 | Gururangan et al. (2020) ran 100 epochs on 50-500K token TAPT corpora. We match total token exposure: 5 × 10M = 50M tokens ≈ upper bound of their TAPT exposure. | +| `whole_word_mask` | false | true | Masks entire words instead of subword pieces. Prevents trivially solvable masking patterns (e.g., masked `cyber` next to unmasked `security`). The model already knows subword composition from DAPT — TAPT should focus on domain-specific whole words ("CISO", "materiality", "tabletop"). | +| `per_device_train_batch_size` | 4 | 32 | Short sequences free VRAM. Tested: batch=32 uses 22.7 GB with torch.compile (vs. OOM at batch=48). | +| `gradient_accumulation_steps` | 8 | 1 | Effective batch = 32 in both cases. No accumulation needed since batch=32 fits directly. | +| `gradient_checkpointing` | true | false | Not needed at seq_len=512 — activations are small. Gradient checkpointing would slow training 30-40% for no memory benefit. | +| `save_strategy` / `eval_strategy` | steps (256) | epoch | 5 epochs; checkpoint and evaluate after each one. | +| `validation_split` | 0.02 | 0.05 | Larger val split for a 50x smaller dataset — need enough samples for stable eval loss. | + +**Sequence length (512 vs. 8192):** The concern with a shorter seq_len is degrading the model's long-range attention capabilities. Three factors make this a non-issue for TAPT: + +1. **The data is short.** Paragraphs average 127 tokens. There is no long-range structure to learn — the information simply isn't there. +2. **Scale of exposure.** TAPT is 50M token-exposures (5 epochs × 10M). ModernBERT was pre-trained on ~2T tokens; DAPT added 500M. 50M is 0.0025% of original pre-training — far too small to cause catastrophic forgetting of patterns established over trillions of tokens. +3. **RoPE positions are independent.** ModernBERT uses rotary position embeddings. Positions 0-511 compute identically whether max_length is 512 or 8192. Training at 512 updates the same parameters; positions 512-8191 remain as-is from DAPT, not degraded. + +**Whole-word masking and tokenization:** Whole-word masking requires `offset_mapping` from the tokenizer to determine word boundaries. This is incompatible with DAPT's concatenate-and-chunk approach (which destroys offset_mapping by merging documents). TAPT tokenizes each paragraph individually with truncation, preserving offset_mapping. The data collator handles dynamic padding per batch. This is a different code path from DAPT's concatenation, but the data justifies it: paragraphs are natural self-contained units, unlike DAPT's long filings that must be chunked. + +**Estimated training time:** ~2,138 steps/epoch × 5 epochs = ~10,700 total steps. At seq_len=512 on the 3090 (~0.5-1s/step), ballpark 1-3 hours. --- @@ -801,7 +824,7 @@ Only nano's portion ($21.24) of the first run was wasted — the gemini and grok | Stage 2 judge production run (~3-5K paragraphs) | ~1h | ~$20-40 | | Training data assembly | ~2h | $0 | | DAPT pre-training (1 epoch) | ~4-8h GPU | $0 (own 3090) | -| TAPT pre-training | ~2-3h GPU | $0 | +| TAPT pre-training (5 epochs, WWM) | ~1-3h GPU | $0 | | Fine-tuning + ablations (7 experiments) | ~12-20h GPU | $0 | | Full GenAI benchmark on 1,200 holdout (9 models) | ~1h | ~$30-50 | | Evaluation + comparison + write-up | ~6-8h | $0 | diff --git a/docs/STATUS.md b/docs/STATUS.md index 75a9fbc..168ec23 100644 --- a/docs/STATUS.md +++ b/docs/STATUS.md @@ -31,11 +31,8 @@ ## What's In Progress -### DAPT Training (~13.5h, running) -```bash -cd python && bun run py:train dapt --config configs/dapt/modernbert.yaml -``` -Running on RTX 3090. 500M tokens (newest filings), batch=4, ~27s/step, ~1,950 steps. Checkpoints every 256 steps (~1.9h). Resume-safe — ctrl+c and restart to continue from last checkpoint. +### DAPT Training — Complete +Final eval loss: 0.7250, perplexity: 1.65. Loss: 0.80 → 0.72 over 1 epoch on 500M tokens. ~14h total across 2 sessions on RTX 3090. Checkpoint at `checkpoints/dapt/modernbert-large/final/`. ### Human Labeling (139/1,200) - 3 of 6 annotators started: 68 + 50 + 21 paragraphs completed @@ -44,13 +41,10 @@ Running on RTX 3090. 500M tokens (newest filings), batch=4, ~27s/step, ~1,950 st ## What's Next (in dependency order) -### 1. TAPT (~2-3h, blocked on DAPT) -Continue MLM on 72K Item 1C paragraphs using the DAPT checkpoint. +### 1. TAPT (~1-3h, ready to run) +Continue MLM on 72K Item 1C paragraphs using the DAPT checkpoint. 5 epochs, whole-word masking, seq_len=512, batch=32. ```bash -bun run py:train dapt --config configs/dapt/modernbert.yaml \ - --model-path ../checkpoints/dapt/modernbert-large/final \ - --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \ - --output-dir ../checkpoints/tapt/modernbert-large --stage tapt +cd python && bun run py:train dapt --config configs/tapt/modernbert.yaml ``` ### 2. Fine-tuning pipeline (no blockers — can build now) diff --git a/python/configs/dapt/modernbert.yaml b/python/configs/dapt/modernbert.yaml index c5bf2fb..a3d7ba5 100644 --- a/python/configs/dapt/modernbert.yaml +++ b/python/configs/dapt/modernbert.yaml @@ -23,7 +23,9 @@ training: bf16: true gradient_checkpointing: true logging_steps: 50 + save_strategy: steps save_steps: 256 + eval_strategy: steps eval_steps: 256 save_total_limit: 8 dataloader_num_workers: 4 diff --git a/python/configs/tapt/modernbert.yaml b/python/configs/tapt/modernbert.yaml new file mode 100644 index 0000000..49bcfca --- /dev/null +++ b/python/configs/tapt/modernbert.yaml @@ -0,0 +1,30 @@ +stage: tapt + +model: + name_or_path: ../checkpoints/dapt/modernbert-large/final + trust_remote_code: false + +data: + corpus_path: ../data/paragraphs/paragraphs-clean.patched.jsonl + text_field: text + max_seq_length: 512 # 99.6% of paragraphs fit; mean=127, P99=386 + validation_split: 0.05 # larger val split — small dataset + +training: + output_dir: ../checkpoints/tapt/modernbert-large + learning_rate: 5.0e-5 + mlm_probability: 0.30 + whole_word_mask: true + num_train_epochs: 5 + per_device_train_batch_size: 32 # 22.7 GB peak w/ torch.compile at seq_len=512 + gradient_accumulation_steps: 1 # effective batch = 32 (matches DAPT) + warmup_ratio: 0.05 + weight_decay: 1.0e-5 + bf16: true + gradient_checkpointing: false # short sequences, not needed + logging_steps: 50 + save_strategy: epoch + eval_strategy: epoch + save_total_limit: 6 # keep all 5 epoch checkpoints + final + dataloader_num_workers: 4 + seed: 42 diff --git a/python/pyproject.toml b/python/pyproject.toml index f688a45..680e451 100644 --- a/python/pyproject.toml +++ b/python/pyproject.toml @@ -10,6 +10,8 @@ dependencies = [ "datasets", "accelerate", "pyyaml", + "nvidia-cusparselt-cu12>=0.8.1", + "nvidia-nvshmem-cu12>=3.6.5", ] [project.optional-dependencies] diff --git a/python/src/common/config.py b/python/src/common/config.py index 44803ae..c2f7b83 100644 --- a/python/src/common/config.py +++ b/python/src/common/config.py @@ -33,6 +33,7 @@ class TrainingConfig: output_dir: str learning_rate: float = 5e-5 mlm_probability: float = 0.30 + whole_word_mask: bool = False num_train_epochs: int = 1 per_device_train_batch_size: int = 4 gradient_accumulation_steps: int = 8 @@ -41,8 +42,10 @@ class TrainingConfig: bf16: bool = True gradient_checkpointing: bool = True logging_steps: int = 50 - save_steps: int = 1000 - eval_steps: int = 1000 + save_strategy: str = "steps" # "steps" or "epoch" + save_steps: int = 1000 # ignored when save_strategy="epoch" + eval_strategy: str = "steps" # "steps" or "epoch" + eval_steps: int = 1000 # ignored when eval_strategy="epoch" save_total_limit: int = 3 dataloader_num_workers: int = 4 seed: int = 42 diff --git a/python/src/dapt/train.py b/python/src/dapt/train.py index 10d774e..3d3e9f1 100644 --- a/python/src/dapt/train.py +++ b/python/src/dapt/train.py @@ -62,13 +62,14 @@ def train(config: DAPTConfig) -> None: dataset = load_corpus(config.data.corpus_path, config.data.text_field) print(f" Raw documents: {len(dataset):,}") - # Filter tiny documents (cover pages, empty filings) - min_chars = 10_000 - before = len(dataset) - dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars) - filtered = before - len(dataset) - if filtered > 0: - print(f" Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining") + # Filter tiny documents (cover pages, empty filings) — DAPT only + if config.stage == "dapt": + min_chars = 10_000 + before = len(dataset) + dataset = dataset.filter(lambda x: len(x[config.data.text_field]) >= min_chars) + filtered = before - len(dataset) + if filtered > 0: + print(f" Filtered {filtered} docs < {min_chars:,} chars → {len(dataset):,} remaining") # Subsample corpus if max_tokens is set (Ponnock 2025: diminishing # returns beyond ~250M tokens for SEC DAPT). Takes from the END of @@ -88,12 +89,17 @@ def train(config: DAPTConfig) -> None: est_tokens = cumulative / chars_per_token print(f" Subsampled to {n - keep_from:,} docs (~{est_tokens / 1e6:.0f}M tokens, newest filings, max_tokens={config.data.max_tokens:,})") - print(f" Tokenizing and chunking to {config.data.max_seq_length} tokens...") + wwm = config.training.whole_word_mask + if wwm: + print(f" Tokenizing to {config.data.max_seq_length} tokens (whole-word mask)...") + else: + print(f" Tokenizing and chunking to {config.data.max_seq_length} tokens...") chunked = tokenize_and_chunk( dataset, tokenizer, text_field=config.data.text_field, max_seq_length=config.data.max_seq_length, + whole_word_mask=wwm, ) print(f" Training sequences: {len(chunked):,}") @@ -113,30 +119,34 @@ def train(config: DAPTConfig) -> None: tokenizer=tokenizer, mlm=True, mlm_probability=config.training.mlm_probability, + whole_word_mask=config.training.whole_word_mask, ) # Training arguments output_dir = Path(config.training.output_dir) - args = TrainingArguments( + steps_per_epoch = len(split["train"]) // ( + config.training.per_device_train_batch_size + * config.training.gradient_accumulation_steps + ) + warmup_steps = int(config.training.warmup_ratio * steps_per_epoch) + + training_kwargs: dict = dict( output_dir=str(output_dir), learning_rate=config.training.learning_rate, num_train_epochs=config.training.num_train_epochs, per_device_train_batch_size=config.training.per_device_train_batch_size, gradient_accumulation_steps=config.training.gradient_accumulation_steps, - warmup_steps=int(config.training.warmup_ratio * (len(split["train"]) // (config.training.per_device_train_batch_size * config.training.gradient_accumulation_steps))), + warmup_steps=warmup_steps, weight_decay=config.training.weight_decay, bf16=config.training.bf16, gradient_checkpointing=config.training.gradient_checkpointing, - gradient_checkpointing_kwargs={"use_reentrant": False}, torch_compile=True, optim="adamw_torch_fused", tf32=True, - per_device_eval_batch_size=1, dataloader_persistent_workers=True, logging_steps=config.training.logging_steps, - save_steps=config.training.save_steps, - eval_strategy="steps", - eval_steps=config.training.eval_steps, + save_strategy=config.training.save_strategy, + eval_strategy=config.training.eval_strategy, save_total_limit=config.training.save_total_limit, dataloader_num_workers=config.training.dataloader_num_workers, seed=config.training.seed, @@ -145,6 +155,19 @@ def train(config: DAPTConfig) -> None: metric_for_best_model="eval_loss", ) + if config.training.gradient_checkpointing: + training_kwargs["gradient_checkpointing_kwargs"] = {"use_reentrant": False} + # Long sequences need small eval batch to avoid OOM + training_kwargs["per_device_eval_batch_size"] = 1 + + # Only pass step counts when using step-based strategy + if config.training.save_strategy == "steps": + training_kwargs["save_steps"] = config.training.save_steps + if config.training.eval_strategy == "steps": + training_kwargs["eval_steps"] = config.training.eval_steps + + args = TrainingArguments(**training_kwargs) + trainer = Trainer( model=model, args=args,