SEC-cyBERT/docs/DAPT-PROCEDURE.md
2026-03-29 20:33:39 -04:00

7.9 KiB

DAPT/TAPT Training Procedure

Date: 2026-03-29 Hardware: NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128


Pre-flight Checklist

Check Status
PyTorch 2.10.0+cu128, CUDA available Verified
RTX 3090, 25.3 GB VRAM, bf16 supported Verified
CUDA driver 13.2 / runtime 12.8 forward compatible Verified (GPU matmul test passed)
ModernBERT-large loads: 396M params, max_position_embeddings=8192 Verified
Corpus: 14,756 docs, ~1.06B tokens, 15 shards Verified
After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss) Verified
Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens Verified
Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30 Set

DAPT Corpus Summary

  • 14,568 documents (after filtering 188 cover pages <10K chars)
  • ~1.056 billion tokens (ModernBERT tokenizer, 4.72 chars/token)
  • ~136K training sequences at seq_len=8192
  • Median document: ~73K tokens (347K chars) — 90.6% of docs exceed 8192 tokens
  • Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed
  • Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by ts/scripts/dapt-corpus-prep.ts

Training Configuration

Config file: python/configs/dapt/modernbert.yaml

Parameter Value Rationale
max_seq_length 8192 Match ModernBERT's pre-training context length
per_device_train_batch_size 1 Memory-limited at 8192 seq_len on 24GB
gradient_accumulation_steps 32 Effective batch size = 32
num_train_epochs 1 Single pass per Gururangan et al. (2020) and Ponnock (2025)
learning_rate 5e-5 Standard for continued pre-training
mlm_probability 0.30 ModernBERT's pre-training masking rate
warmup_ratio 0.05 ~213 warmup steps
gradient_checkpointing true Required for 8192 seq_len on 24GB
bf16 true Native RTX 3090 support
save_steps 1000 Checkpoint every ~1000 steps
eval_steps 1000 Evaluate every ~1000 steps
save_total_limit 3 Keep last 3 checkpoints

Epoch Decision Justification

We train for 1 epoch (single pass over the corpus), following the empirical consensus:

  • Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020): Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks.

  • Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384): Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold.

Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass.

Sequence Length Decision

ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT.

The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time.

For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability.

Step 1: DAPT

Command

cd python
bun run py:train dapt --config configs/dapt/modernbert.yaml

Equivalent to: uv run main.py dapt --config configs/dapt/modernbert.yaml

What happens

  1. Loads ModernBERT-large from HuggingFace (cached after first download)
  2. Loads 14,756 docs from data/dapt-corpus/, filters 188 < 10K chars
  3. Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens
  4. Splits 2% validation (~2,700 sequences), 98% train (~133K sequences)
  5. Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing
  6. ~4,257 steps total, logging every 50, checkpoint+eval every 1,000
  7. Saves final model + tokenizer to checkpoints/dapt/modernbert-large/final/
  8. Reports final eval loss and perplexity

Expected duration

~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing).

Resume if interrupted

HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically.

Output

checkpoints/dapt/modernbert-large/
  checkpoint-1000/
  checkpoint-2000/
  checkpoint-3000/
  final/                  <- final model + tokenizer
    config.json
    model.safetensors
    tokenizer.json
    ...

Step 2: TAPT

After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.

Command

bun run py:train dapt --config configs/dapt/modernbert.yaml \
  --model-path ../checkpoints/dapt/modernbert-large/final \
  --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large \
  --stage tapt

What happens

  1. Loads the DAPT checkpoint (not the base ModernBERT)
  2. Loads 72,045 patched paragraphs from paragraphs-clean.patched.jsonl
  3. Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
  4. Trains MLM with same hyperparameters
  5. Saves to checkpoints/tapt/modernbert-large/final/

Expected duration

~2-3 hours (much smaller corpus).

Output

checkpoints/tapt/modernbert-large/
  final/                  <- SEC-cyBERT-large (DAPT + TAPT)

Step 3: Ablation Checkpoints

The training pipeline produces clean ablation rows for the paper:

Model Checkpoint Description
Base answerdotai/ModernBERT-large Off-the-shelf, no domain adaptation
+DAPT checkpoints/dapt/modernbert-large/final After domain pre-training on 14.5K filings
+DAPT+TAPT checkpoints/tapt/modernbert-large/final After task pre-training on 72K paragraphs

Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage.

Monitoring

During training, the Trainer logs to stderr every 50 steps:

  • loss — training MLM loss (cross-entropy on masked tokens)
  • learning_rate — current LR (ramps up during warmup, then decays)
  • epoch — progress through the epoch

Every 1,000 steps, it also reports:

  • eval_loss — validation MLM loss
  • Perplexity can be computed as 2^eval_loss

What to watch for:

  • Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
  • Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
  • If loss spikes or goes to NaN, the learning rate may be too high

Artifacts

File Purpose
python/configs/dapt/modernbert.yaml DAPT config
python/configs/dapt/neobert.yaml NeoBERT config (if needed)
python/main.py CLI entrypoint
python/src/dapt/train.py Training loop
python/src/data/corpus.py Corpus loading + tokenization
python/src/common/config.py Typed YAML config
ts/scripts/dapt-corpus-prep.ts Corpus preparation from HTML
ts/scripts/dapt-corpus-analytics.ts Corpus analytics
data/dapt-corpus/shard-*.jsonl Cleaned corpus (15 shards)