7.9 KiB
DAPT/TAPT Training Procedure
Date: 2026-03-29 Hardware: NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128
Pre-flight Checklist
| Check | Status |
|---|---|
| PyTorch 2.10.0+cu128, CUDA available | Verified |
| RTX 3090, 25.3 GB VRAM, bf16 supported | Verified |
| CUDA driver 13.2 / runtime 12.8 forward compatible | Verified (GPU matmul test passed) |
| ModernBERT-large loads: 396M params, max_position_embeddings=8192 | Verified |
| Corpus: 14,756 docs, ~1.06B tokens, 15 shards | Verified |
| After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss) | Verified |
| Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens | Verified |
| Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30 | Set |
DAPT Corpus Summary
- 14,568 documents (after filtering 188 cover pages <10K chars)
- ~1.056 billion tokens (ModernBERT tokenizer, 4.72 chars/token)
- ~136K training sequences at seq_len=8192
- Median document: ~73K tokens (347K chars) — 90.6% of docs exceed 8192 tokens
- Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed
- Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by
ts/scripts/dapt-corpus-prep.ts
Training Configuration
Config file: python/configs/dapt/modernbert.yaml
| Parameter | Value | Rationale |
|---|---|---|
max_seq_length |
8192 | Match ModernBERT's pre-training context length |
per_device_train_batch_size |
1 | Memory-limited at 8192 seq_len on 24GB |
gradient_accumulation_steps |
32 | Effective batch size = 32 |
num_train_epochs |
1 | Single pass per Gururangan et al. (2020) and Ponnock (2025) |
learning_rate |
5e-5 | Standard for continued pre-training |
mlm_probability |
0.30 | ModernBERT's pre-training masking rate |
warmup_ratio |
0.05 | ~213 warmup steps |
gradient_checkpointing |
true | Required for 8192 seq_len on 24GB |
bf16 |
true | Native RTX 3090 support |
save_steps |
1000 | Checkpoint every ~1000 steps |
eval_steps |
1000 | Evaluate every ~1000 steps |
save_total_limit |
3 | Keep last 3 checkpoints |
Epoch Decision Justification
We train for 1 epoch (single pass over the corpus), following the empirical consensus:
-
Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020): Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks.
-
Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384): Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold.
Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass.
Sequence Length Decision
ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT.
The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time.
For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability.
Step 1: DAPT
Command
cd python
bun run py:train dapt --config configs/dapt/modernbert.yaml
Equivalent to: uv run main.py dapt --config configs/dapt/modernbert.yaml
What happens
- Loads ModernBERT-large from HuggingFace (cached after first download)
- Loads 14,756 docs from
data/dapt-corpus/, filters 188 < 10K chars - Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens
- Splits 2% validation (~2,700 sequences), 98% train (~133K sequences)
- Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing
- ~4,257 steps total, logging every 50, checkpoint+eval every 1,000
- Saves final model + tokenizer to
checkpoints/dapt/modernbert-large/final/ - Reports final eval loss and perplexity
Expected duration
~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing).
Resume if interrupted
HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically.
Output
checkpoints/dapt/modernbert-large/
checkpoint-1000/
checkpoint-2000/
checkpoint-3000/
final/ <- final model + tokenizer
config.json
model.safetensors
tokenizer.json
...
Step 2: TAPT
After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.
Command
bun run py:train dapt --config configs/dapt/modernbert.yaml \
--model-path ../checkpoints/dapt/modernbert-large/final \
--data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
--output-dir ../checkpoints/tapt/modernbert-large \
--stage tapt
What happens
- Loads the DAPT checkpoint (not the base ModernBERT)
- Loads 72,045 patched paragraphs from
paragraphs-clean.patched.jsonl - Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
- Trains MLM with same hyperparameters
- Saves to
checkpoints/tapt/modernbert-large/final/
Expected duration
~2-3 hours (much smaller corpus).
Output
checkpoints/tapt/modernbert-large/
final/ <- SEC-cyBERT-large (DAPT + TAPT)
Step 3: Ablation Checkpoints
The training pipeline produces clean ablation rows for the paper:
| Model | Checkpoint | Description |
|---|---|---|
| Base | answerdotai/ModernBERT-large |
Off-the-shelf, no domain adaptation |
| +DAPT | checkpoints/dapt/modernbert-large/final |
After domain pre-training on 14.5K filings |
| +DAPT+TAPT | checkpoints/tapt/modernbert-large/final |
After task pre-training on 72K paragraphs |
Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage.
Monitoring
During training, the Trainer logs to stderr every 50 steps:
loss— training MLM loss (cross-entropy on masked tokens)learning_rate— current LR (ramps up during warmup, then decays)epoch— progress through the epoch
Every 1,000 steps, it also reports:
eval_loss— validation MLM loss- Perplexity can be computed as
2^eval_loss
What to watch for:
- Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
- Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
- If loss spikes or goes to NaN, the learning rate may be too high
Artifacts
| File | Purpose |
|---|---|
python/configs/dapt/modernbert.yaml |
DAPT config |
python/configs/dapt/neobert.yaml |
NeoBERT config (if needed) |
python/main.py |
CLI entrypoint |
python/src/dapt/train.py |
Training loop |
python/src/data/corpus.py |
Corpus loading + tokenization |
python/src/common/config.py |
Typed YAML config |
ts/scripts/dapt-corpus-prep.ts |
Corpus preparation from HTML |
ts/scripts/dapt-corpus-analytics.ts |
Corpus analytics |
data/dapt-corpus/shard-*.jsonl |
Cleaned corpus (15 shards) |