joey/SEC-cyBERT

Fork 0

Joey Eamigh 9d41dd199f

DAPT and precleaning for DAPT

2026-03-29 20:33:39 -04:00

7.9 KiB

Raw Blame History

DAPT/TAPT Training Procedure

Date: 2026-03-29 Hardware: NVIDIA RTX 3090 (24GB VRAM), CUDA driver 13.2, PyTorch 2.10.0+cu128

Pre-flight Checklist

Check	Status
PyTorch 2.10.0+cu128, CUDA available	Verified
RTX 3090, 25.3 GB VRAM, bf16 supported	Verified
CUDA driver 13.2 / runtime 12.8 forward compatible	Verified (GPU matmul test passed)
ModernBERT-large loads: 396M params, max_position_embeddings=8192	Verified
Corpus: 14,756 docs, ~1.06B tokens, 15 shards	Verified
After <10K filter: 14,568 docs, ~1.056B tokens (0.027% loss)	Verified
Tokenize+chunk pipeline: 10 docs -> 85 sequences of 8192 tokens	Verified
Config: seq_len=8192, batch=1, grad_accum=32, 1 epoch, lr=5e-5, mlm=0.30	Set

DAPT Corpus Summary

14,568 documents (after filtering 188 cover pages <10K chars)
~1.056 billion tokens (ModernBERT tokenizer, 4.72 chars/token)
~136K training sequences at seq_len=8192
Median document: ~73K tokens (347K chars) — 90.6% of docs exceed 8192 tokens
Cleaned: XBRL data blobs stripped, exhibit listings stripped, URLs removed, F-N page numbers removed
Source: 14,759 cached 10-K HTML filings, FY2023-FY2025, processed by ts/scripts/dapt-corpus-prep.ts

Training Configuration

Config file: python/configs/dapt/modernbert.yaml

Parameter	Value	Rationale
`max_seq_length`	8192	Match ModernBERT's pre-training context length
`per_device_train_batch_size`	1	Memory-limited at 8192 seq_len on 24GB
`gradient_accumulation_steps`	32	Effective batch size = 32
`num_train_epochs`	1	Single pass per Gururangan et al. (2020) and Ponnock (2025)
`learning_rate`	5e-5	Standard for continued pre-training
`mlm_probability`	0.30	ModernBERT's pre-training masking rate
`warmup_ratio`	0.05	~213 warmup steps
`gradient_checkpointing`	true	Required for 8192 seq_len on 24GB
`bf16`	true	Native RTX 3090 support
`save_steps`	1000	Checkpoint every ~1000 steps
`eval_steps`	1000	Evaluate every ~1000 steps
`save_total_limit`	3	Keep last 3 checkpoints

Epoch Decision Justification

We train for 1 epoch (single pass over the corpus), following the empirical consensus:

Gururangan et al. (2020), "Don't Stop Pretraining" (ACL 2020): Trained DAPT for "12.5K steps, which amounts to a single pass on each domain dataset" across corpora ranging from 2-8B tokens. A single pass was sufficient for consistent downstream gains across all four domains and eight tasks.
Ponnock (2025), "The Data Efficiency Frontier of Financial Foundation Models" (arXiv:2512.12384): Found that SEC-specific DAPT exhibits diminishing marginal returns beyond ~250M tokens within a single epoch: "Both models exhibit their largest improvements in the early stages of continued pretraining: loss drops noticeably between 50M and 200M tokens, after which the rate of improvement slows." Our ~1B token corpus is already well past the diminishing-returns threshold.

Additional epochs risk overfitting to the domain corpus without proportional downstream benefit, while general-domain capability remains stable through a single pass.

Sequence Length Decision

ModernBERT was pre-trained with 8192-token context. We match this during DAPT to ensure all positional embedding and attention weights receive gradient updates. At seq_len=2048, the weights for positions 2048-8191 would receive no updates during DAPT.

The tradeoff is memory: batch_size drops from 4 (at 2048) to 1 (at 8192), compensated by gradient_accumulation=32 to maintain effective batch size of 32. Training time is comparable because 4x fewer steps offset the slower per-step time.

For our downstream task (paragraph classification at ~50-400 tokens), the long-context benefit is modest — the primary DAPT benefit is vocabulary and domain language patterns, which transfer at any sequence length. But there is no cost to using 8192, so we preserve the model's full capability.

Step 1: DAPT

Command

cd python
bun run py:train dapt --config configs/dapt/modernbert.yaml

Equivalent to: uv run main.py dapt --config configs/dapt/modernbert.yaml

What happens

Loads ModernBERT-large from HuggingFace (cached after first download)
Loads 14,756 docs from data/dapt-corpus/, filters 188 < 10K chars
Tokenizes all text, concatenates, chunks into ~136K sequences of 8192 tokens
Splits 2% validation (~2,700 sequences), 98% train (~133K sequences)
Trains 1 epoch of MLM with 30% masking, bf16, gradient checkpointing
~4,257 steps total, logging every 50, checkpoint+eval every 1,000
Saves final model + tokenizer to checkpoints/dapt/modernbert-large/final/
Reports final eval loss and perplexity

Expected duration

~4-8 hours on RTX 3090 (depends on actual seconds/step at 8192 with gradient checkpointing).

Resume if interrupted

HuggingFace Trainer auto-saves checkpoints every 1,000 steps. Re-run the same command — it detects existing checkpoints and resumes automatically.

Output

checkpoints/dapt/modernbert-large/
  checkpoint-1000/
  checkpoint-2000/
  checkpoint-3000/
  final/                  <- final model + tokenizer
    config.json
    model.safetensors
    tokenizer.json
    ...

Step 2: TAPT

After DAPT completes, continue MLM on the 72K Item 1C paragraphs specifically.

Command

bun run py:train dapt --config configs/dapt/modernbert.yaml \
  --model-path ../checkpoints/dapt/modernbert-large/final \
  --data-path ../data/paragraphs/paragraphs-clean.patched.jsonl \
  --output-dir ../checkpoints/tapt/modernbert-large \
  --stage tapt

What happens

Loads the DAPT checkpoint (not the base ModernBERT)
Loads 72,045 patched paragraphs from paragraphs-clean.patched.jsonl
Tokenizes, concatenates, chunks (much smaller corpus — ~10M tokens)
Trains MLM with same hyperparameters
Saves to checkpoints/tapt/modernbert-large/final/

Expected duration

~2-3 hours (much smaller corpus).

Output

checkpoints/tapt/modernbert-large/
  final/                  <- SEC-cyBERT-large (DAPT + TAPT)

Step 3: Ablation Checkpoints

The training pipeline produces clean ablation rows for the paper:

Model	Checkpoint	Description
Base	`answerdotai/ModernBERT-large`	Off-the-shelf, no domain adaptation
+DAPT	`checkpoints/dapt/modernbert-large/final`	After domain pre-training on 14.5K filings
+DAPT+TAPT	`checkpoints/tapt/modernbert-large/final`	After task pre-training on 72K paragraphs

Each checkpoint can be independently fine-tuned with classification heads to isolate the contribution of each pre-training stage.

Monitoring

During training, the Trainer logs to stderr every 50 steps:

loss — training MLM loss (cross-entropy on masked tokens)
learning_rate — current LR (ramps up during warmup, then decays)
epoch — progress through the epoch

Every 1,000 steps, it also reports:

eval_loss — validation MLM loss
Perplexity can be computed as 2^eval_loss

What to watch for:

Training loss should decrease steadily from ~2.5-3.0 to ~1.5-2.0
Eval loss should track training loss (if eval loss diverges upward, the model is overfitting — but this is unlikely in 1 epoch)
If loss spikes or goes to NaN, the learning rate may be too high

Artifacts

File	Purpose
`python/configs/dapt/modernbert.yaml`	DAPT config
`python/configs/dapt/neobert.yaml`	NeoBERT config (if needed)
`python/main.py`	CLI entrypoint
`python/src/dapt/train.py`	Training loop
`python/src/data/corpus.py`	Corpus loading + tokenization
`python/src/common/config.py`	Typed YAML config
`ts/scripts/dapt-corpus-prep.ts`	Corpus preparation from HTML
`ts/scripts/dapt-corpus-analytics.ts`	Corpus analytics
`data/dapt-corpus/shard-*.jsonl`	Cleaned corpus (15 shards)

7.9 KiB Raw Blame History

DAPT/TAPT Training Procedure

Pre-flight Checklist

DAPT Corpus Summary

Training Configuration

Epoch Decision Justification

Sequence Length Decision

Step 1: DAPT

Command

What happens

Expected duration

Resume if interrupted

Output

Step 2: TAPT

Command

What happens

Expected duration

Output

Step 3: Ablation Checkpoints

Monitoring

Artifacts

7.9 KiB

Raw Blame History