feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1
Draft
speckhard wants to merge 36 commits into
Draft
feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1speckhard wants to merge 36 commits into
speckhard wants to merge 36 commits into
Conversation
…stra Adds an Adastra-side variant of submit_charge3net.sh and a runbook covering the seven blockers encountered during the port: - HTTP proxy must be set explicitly (Adastra doesn't auto-export it), - 30-day scratch purge wipes setup, so $LEMATRHO_ADASTRA_SETUP is rebuildable from sources, - pip on Adastra defaults to gorgone.cines.fr (missing boto3 etc); --index-url https://pypi.org/simple is required, - huggingface_hub Xet backend silently no-ops the payload fetch, so raw curl with Authorization: Bearer is used for the dataset, - --qos=debug is not granted on the team accounts, - group inode quota on /lus/scratch/CT10/c1816212/ is at the hard cap, so the submit dir lives on cad16353 scratch while the job is billed to c1816212 (account and scratch dir are independent dimensions), - sbatch over SSH defaults WorkDir to \$HOME unless cd'd first. submit_charge3net_adastra.sh mirrors the Jean Zay script (auto-resume from latest.pt, 50-epoch budget) but with MI250 SLURM headers, ROCm HIP_VISIBLE_DEVICES alignment, batch_size=8 (HBM2e has 64 GB per GCD vs A100's 40-80), val_probes=1000, and online W&B (the Adastra proxy gives us live internet, so the Jean Zay offline-then-sync dance is unnecessary). Adds a regression test test_ignores_extra_columns for the dataset loader: Entalpic/lemat-rho-v1 added Bader analysis columns (bader_charges, bader_volumes, material_id) which would have broken _build_parquet_index if it didn't honor the four-column _COLUMNS allowlist. The test confirms the allowlist still holds. Reference smoke run: job 4969516 on g1342, May 19 2026. 65,239 of 68,549 valid materials loaded from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 at step 50 to 5.67 at step 1,000. Hit TIMEOUT before completing the epoch (expected: one epoch needs ~150 min at batch=4), no val/test metrics yet; a follow-up 6h job under the production knobs will produce those.
…uards Adds tests/test_equivariance.py with 7 structural tests that pin down the architectural properties needed for ChargE3Net's rotational equivariance guarantee: - Production model has 1.9M params (catches drift that would break loading charge3net_mp.pt). - atom_irreps_sequence reaches lmax >= 4 (the "higher-order" in the paper title; a silent drop to lmax=0 would degenerate the model to a much weaker scalar-only baseline). - Atom representation includes both even and odd parity components. - get_irreps(500, lmax=4) returns 10 entries with no zero-multiplicity irreps (catches a regression that would silently delete some irreps). - atom_irreps_sequence length matches num_interactions. - Atom-model cutoff matches the 4.0 A baked into KdTreeGraphConstructor in LeMatRhoDataset. - Final irreps are an e3nn o3.Irreps instance (replacing this with a plain list would silently break equivariance while still producing output). A runtime equivariance check (rotate inputs, predict, compare) is the gold standard but requires a real forward pass at production hyperparameters that is too slow for a CPU unit test. The structural tests cover the same property at the architecture level. Tests autoskip when the sibling AIforGreatGood/charge3net repo is absent.
… training
Two changes motivated by job 4969727 (FAILED after 1h47m on the previous
single-GPU submit):
1. Multi-GPU via torch DistributedDataParallel. The paper uses per-GPU
batch=16 across 4 GPUs (effective batch=64). Our previous Adastra
submit was single-GPU batch=8 — 8x smaller effective batch. With the
half-node submit (4 GCDs, 64 CPUs, 128 GB RAM, batch=16 per GCD) the
effective batch now matches the paper.
Implementation:
- New _setup_ddp / _is_ddp / _is_main helpers in train.py read
WORLD_SIZE / RANK / LOCAL_RANK / MASTER_ADDR / MASTER_PORT from
the env (set in the submit script via srun + scontrol show hostname).
- Backend is nccl which routes through RCCL on AMD ROCm builds.
- Model wrapped in DistributedDataParallel after .to(device).
- DistributedSampler injected into the train loader via a new
distributed=True flag on build_dataloaders. Val/test stay
non-distributed; cheap enough at 5% of 65k.
- DistributedSampler.set_epoch called each epoch for proper shuffling.
- All prints and wandb logs gated on is_main (rank 0 only).
- Save and load go through a new _unwrap helper so checkpoints are
interchangeable between single-GPU and DDP runs.
- dist.barrier at end of each epoch to keep ranks in lockstep
during checkpoint saves.
- dist.destroy_process_group at the very end.
2. Wandb soft-fail. wandb.init now sits inside try/except — if the
compute node can't reach api.wandb.ai through the proxy (which is
what killed job 4969727 after 5min of timeouts and 1h47m elapsed
total), the script logs a warning and sets use_wandb=False so
training proceeds with stdout + checkpoints only.
Submit script (submit_charge3net_adastra.sh) updated for half-node:
--nodes=1 --ntasks-per-node=4
--gpus-per-node=4 --cpus-per-task=16
--mem=125000M --time=06:00:00
plus srun-based DDP launcher that exports RANK/LOCAL_RANK per task,
batch_size=16 per GPU, val_probes=1000, wandb-mode=offline.
Test plan
- pytest tests/ ... 34 passed, 1 failure pre-existing (test_metrics
collection error from src.charge3net path shadowing in pytest;
unrelated, same on main).
- ruff format + check clean on the touched files.
- DDP path not yet exercised end-to-end on Adastra; the immediate
next step is a 6h submission. If the DDP init fails, the
single-GPU code path is still reachable by running without srun.
…om-scratch (TDD)
The submit script now reads LEMATRHO_TRAINING_MODE to switch between
two runs that share all infrastructure (same DDP, same hyperparams,
same dataset, same node layout) but differ in init:
pretrained (default) --ckpt-path charge3net_mp.pt
save-dir charge3net_checkpoints/
WANDB_NAME=pretrained_mp
from_scratch no --ckpt-path (random init)
save-dir charge3net_checkpoints_fromscratch/
WANDB_NAME=from_scratch
Auto-resume from latest.pt is per-mode (the two save-dirs don't
collide), so each arm can be relaunched independently via
sbatch ... submit_charge3net_adastra.sh until val NMAPE plateaus.
Also adds a LEMATRHO_DRY_RUN=1 escape hatch that prints the resolved
train command and exits 0 without sourcing the venv or invoking srun.
Used by the 9 new pytest tests in tests/test_submit_script.py:
- dry-run prints train command
- default mode is pretrained, uses MP checkpoint
- pretrained writes to charge3net_checkpoints (not fromscratch dir)
- from_scratch drops --ckpt-path completely and never references
charge3net_mp.pt
- from_scratch uses a separate save dir
- WANDB_NAME differs between modes
- invalid mode exits non-zero with a clear error
- batch-size 16, val-probes 1000 (paper-matching)
- wandb-mode is offline
TDD: 9 tests RED before the refactor, all GREEN after. Full suite
still 33 passed (data + model + equivariance + submit). ruff format
+ check clean.
Submission examples in the script header and in ADASTRA.md.
PR 1 of a 2-PR stack to land DeepDFT as a baseline for the
ChargE3Net VASP-speedup experiment. This PR adds only the data
adapter; PR 2 will add the training submission (DDP-patched).
What's here:
deepdft_ft/__init__.py empty package marker
deepdft_ft/data.py LeMatRhoDeepDFTDataset adapter
tests/test_deepdft_data.py 11 TDD tests pinning the contract
The adapter reuses charge3net_ft.data's _row_to_atoms_and_density and
_build_parquet_index, then re-shapes the per-sample output into the
dict that DeepDFT's CollateFuncRandomSample expects:
{
"density": np.ndarray (Nx, Ny, Nz),
"atoms": ase.Atoms,
"origin": np.ndarray (3,),
"grid_position": np.ndarray (Nx, Ny, Nz, 3),
"metadata": {"filename": str},
}
_calculate_grid_pos is inlined from upstream DeepDFT/dataset.py so
this adapter has no runtime dependency on the DeepDFT sibling repo
(which keeps the test suite hermetic).
Tests pinned (RED then GREEN):
- dataset length matches the count of valid parquet rows
- sample dict has all 5 required keys
- density is a 3D numpy array
- atoms is ase.Atoms with PBC True/True/True
- origin is zeros (matches LeMat-Rho convention)
- grid_position has shape (Nx, Ny, Nz, 3)
- grid_position[0,0,0] = (0,0,0)
- grid_position[1,0,0] = (a_lattice / Nx, 0, 0)
- metadata.filename present and unique per sample
- extra columns (bader_charges, material_id) ignored
- empty parquet dir raises FileNotFoundError
Caching is keyed by absolute parquet path (not file index) so multiple
LeMatRhoDeepDFTDataset instances pointing at different directories
don't collide on fi=0 (which bit me writing the metadata test).
Full LeMat-Rho test suite: 44 passed. Ruff format + check clean.
Next: PR 2 will add deepdft_ft/runner.py (vendored from upstream
DeepDFT + DDP patches) and submit_deepdft_adastra.sh (4-GCD half-node
DDP, PaiNN model variant for equivariance parity with ChargE3Net).
PR 2 of the DeepDFT-on-LeMat-Rho stack (PR 1 was the data adapter).
Closes the gap from "we have a DeepDFT-compatible Dataset" to "we
can sbatch a 4-GCD DDP DeepDFT training run on Adastra".
What's here:
deepdft_ft/runner.py vendored from peterbjorgensen/DeepDFT@main
+ DDP patches + LeMat-Rho parquet auto-detect
+ asap3 stub (no C++ headers on Adastra)
submit_deepdft_adastra.sh half-node 4-GCD DDP submission, PaiNN default,
LEMATRHO_DEEPDFT_VARIANT={painn,schnet} env var,
LEMATRHO_DRY_RUN=1 supported
DDP patches mirror what we did in charge3net_ft/train.py:
- _setup_ddp + _is_main + _unwrap helpers
- DistributedSampler when WORLD_SIZE>1, RandomSampler otherwise
- DistributedDataParallel wrap of the PaiNN/SchNet model
- All logging.info and checkpoint saves gated on rank 0
- Device pinned to cuda:LOCAL_RANK via torch.cuda.set_device
LeMat-Rho parquet auto-detect: if --dataset points at a directory
containing chunk_*.parquet, the runner uses LeMatRhoDeepDFTDataset
(PR 1). Other dataset paths (.tar, .txt, dir of cube/CHGCAR) still
work unchanged — upstream's dataset.DensityData path is preserved.
asap3 stub: upstream DeepDFT imports asap3 at module load. asap3
needs Python.h to build from source which isn't on Adastra (and would
need admin). The stub at the top of runner.py registers a fake asap3
module with a FullNeighborList class that delegates to ASE's
NewPrimitiveNeighborList. Slower than real asap3 but functionally
identical for DeepDFT's call sites. Skipped when real asap3 is
installed.
Submit script defaults:
- PaiNN model (matches equivariance of ChargE3Net for the comparison)
- batch=2 (DeepDFT's upstream default — they iterate on probes,
not materials, so per-batch counts work differently from ChargE3Net)
- cutoff=4.0, num_interactions=3, node_size=128
- max_steps=1e8 (effectively unbounded; SLURM walltime is the limiter)
- WANDB_NAME=deepdft_painn (or deepdft_schnet)
Verified on Adastra: runner module imports cleanly under the venv311,
asap3 stub kicks in without error, parquet directory detection works.
The actual training run will be submitted next.
8487ae9 to
8d510d2
Compare
Root-causes job 4971720's OOM-kill at startup and aligns the DeepDFT
training to the upstream paper's submission settings.
Two changes:
1. submit_deepdft_adastra.sh: switch from half-node DDP (4 GCDs) to
paper-faithful single-GPU (1 GCD on mi250-shared, HIP_VISIBLE_DEVICES=0,
WORLD_SIZE unset). Upstream DeepDFT was trained on 1x RTX 3090 per
pretrained_models/*/submit_script.sh. Single-GPU keeps gradient-step
semantics identical to the paper's batch=2; no LR sweep needed.
Effective hyperparameters are now exactly the upstream PaiNN settings
from pretrained_models/{nmc,qm9,ethylenecarbonate}_painn/commandline_args.txt:
--cutoff 4
--num_interactions 3
--node_size 128
--max_steps 10000000
--use_painn_model
batch_size=2 materials (hardcoded in runner.py)
train_probes=1000 per material (hardcoded)
val_probes=5000 per material (hardcoded)
DDP code paths in runner.py stay in place but only fire when
WORLD_SIZE>1, so a future DDP variant of DeepDFT is one env flip away.
2. deepdft_ft/runner.py: replace upstream's eager validation preload
`val_loader = [b for b in val_loader]` with a comment explaining
why we left it as a streaming DataLoader. Upstream's val sets are
~100 materials (NMC, QM9 ethylenecarbonate subsets) so the preload
is cheap. Our val set is 3,261 materials at 5000 probes each, x4
ranks under DDP, which materialised ~150 GB and OOM-killed job
4971720 at startup before a single training step. Streaming the
val loader is a data-loading detail, not a hyperparameter; the
model math is unchanged.
Test plan:
- 44/44 local tests still pass (no behavioural changes to the data
adapter or submit-script env contract; only the runner internals
and the SLURM headers move).
- New job to be submitted as the next step; will confirm DeepDFT
trains and produces step-level loss in the .out log.
Observation from jobs 4971293 and 4971343: SLURM bumped both to EXCLUSIVE mode despite us requesting half-node resources. The --mem=125000M line was exactly half the 256 GB node's memory, which crosses SLURM's auto-exclusive threshold. Dropping --mem entirely lets SLURM allocate memory proportional to our CPU share (64 of 128 logical CPUs -> ~128 GB out of 256 GB). The other half of the node stays schedulable for other users / jobs. The currently running jobs 4971293 and 4971343 keep their exclusive allocations; only future submissions are affected. Test plan - 9/9 tests in tests/test_submit_script.py still pass (no memory assertion). - Will confirm on next sbatch by inspecting AllocTRES.
Root-causes the OOM that killed jobs 4971293 and 4971343 at MaxRSS=35 GB per rank (140 GB cumulative across 4 DDP ranks, exceeding our 125 GB --mem budget). Two changes, both small: 1. charge3net_ft/data.py: bound _TABLE_CACHE with an LRU eviction policy capped at _TABLE_CACHE_MAX_CHUNKS=5. OrderedDict gives O(1) move-to-end on hit and popitem(last=False) on miss-with-eviction. The previous dict was unbounded, so each DataLoader worker accumulated every chunk it had ever seen. With ~2 GB per pyarrow-decompressed chunk (compressed_charge_density JSON strings inflate 6x) and 32 worker processes (8 per rank x 4 ranks), the cache alone grew to ~140 GB over 6 h. 2. submit_charge3net_adastra.sh: drop --num-workers from 8 to 2. Defense in depth on top of the LRU. At LeMat-Rho's 10x10x10 grid size the DataLoader's data-loading throughput isn't the bottleneck; 2 workers per rank x 4 ranks = 8 total workers is plenty, and per-rank cache pressure now drops by 4x. 3. tests/test_data.py: TestTableCacheLRU adds three regression tests (cache size bounded, LRU eviction order is correct, default cap is within a sensible range). TDD: RED before changes 1+2, GREEN after. Combined effect: cache pressure on a half-node DDP run drops from ~140 GB to roughly 4 ranks x 2 workers x 5 chunks x 2 GB = 80 GB worst case, and in practice much less because workers tend to revisit chunks. Comfortably under the ~128 GB shared-mode default mem. Full suite: 47 passed (test_metrics.py pre-existing src-shadow failure unrelated, same on main).
…ack)
PR alpha of 4 for the SALTED-arm basis-expansion benchmark. This PR
lands only the BasisSpec dataclass and its tests. PRs beta/gamma/delta
land the projection layer, the rholearn model wrapper, and the VASP
CHGCAR I/O respectively.
What's here
salted_ft/__init__.py exports BasisSpec, documents the stack
salted_ft/basis.py frozen dataclass with the locked-in
hyperparameters from Phase A4 of the
investigation memo
tests/test_salted_basis.py 19 TDD tests across 5 categories
Design decisions captured by the tests
BasisSpec is frozen, hashable, equality-by-value so it can key
caches and identify metric runs without ambiguity. Mutation raises
FrozenInstanceError.
Validation happens in __post_init__ so a malformed spec raises at
construction time, not deep in a tensor op three PRs from now.
Negative max_l, zero n_radial, nonpositive sigma, nonpositive
cutoff all rejected with clear messages.
Default values match the Phase A4 lockdown verbatim
max_l=4, n_radial=4, sigma=(0.5,1.0,2.0,4.0), cutoff=4.0
n_coeffs_per_atom == 100 from the formula n_radial * (max_l+1)**2.
These numbers picked to match ChargE3Net's cutoff + lmax for a clean
side-by-side comparison.
Shape helpers
n_angular_components -> (max_l + 1)**2
n_coeffs_per_atom -> n_radial * n_angular_components
total_coeffs_shape(n_atoms) -> (n_atoms, n_coeffs_per_atom)
used by downstream PRs for tensor allocation.
Why locking these numbers matters
Every downstream PR (projection, model, I/O) depends on the
coefficient shape. Changing max_l or n_radial later requires
retraining and re-running validation. Pin once, build around it.
Test plan
19/19 tests pass. Ruff format + check clean. No interaction with
Adastra; pure-Python dataclass.
Next: PR beta = salted_ft/projection.py with project_chgcar_to_basis
and reconstruct_grid_from_basis + their tests.
PR beta of 4. The DIY bridge between VASP plane-wave CHGCAR data and
the rholearn/SALTED localized-basis world. Both libraries (SALTED,
rholearn, also Graph2Mat) target localized-basis DFT codes
(FHI-aims, CP2K, PySCF, SIESTA); VASP is plane-wave. So we have to
build this projection layer ourselves regardless of which upstream
we wrap. See the Phase A memo for the analysis.
What's here
salted_ft/projection.py
- _grid_positions(grid_shape, cell) -> (n_grid, 3) Cartesian
- _real_sph_harm(rhat, lmax) -> (..., (lmax+1)^2) real Y_lm
values, hand-rolled for lmax <= 4 (covers the locked default).
Standard SOAP / SALTED component ordering
[Y_00, Y_1{-1}, Y_10, Y_11, Y_2{-2}, ..., Y_44].
- _eval_basis_at_grid(atom, grid, cell, spec) ->
(n_grid, n_coeffs_per_atom) basis-function values with
minimum-image PBC.
- project_chgcar_to_basis(density, atoms, basis_spec)
Orthonormal-approx projection: c_k = <B_k, rho> / <B_k, B_k>.
v1 stand-in for proper overlap-matrix LSQR which lands in PR
gamma. Linear in the input density.
- reconstruct_grid_from_basis(coefficients, atoms, grid_shape,
basis_spec). Literal expansion sum. Linear in the input
coefficients.
tests/test_salted_projection.py
- TestProjectChgcarToBasis (6 tests)
shape, zero->zero, dtype, linearity, additivity, finite.
- TestReconstructGridFromBasis (6 tests)
shape, zero->zero, dtype, linearity, single-atom-l0-peak-at-
atom-position, finite.
- TestProjectionReconstructionRoundtrip (2 tests)
zero-density and zero-coefficient roundtrips. Tight roundtrip
accuracy is intentionally NOT pinned; that lands in PR gamma
when we swap in proper LSQR.
Design notes
PBC: minimum-image via cell inverse. Adequate when 2*cutoff fits
inside the smallest cell vector. For very small cells we'd want
full supercell expansion; out of scope for PR beta.
Numpy-only on purpose. e3nn / torch were tempting for spherical
harmonics but adding them to a projection module mixes concerns:
projection should be a clean reference implementation that runs
on any laptop with numpy.
Test plan
33/33 tests pass (19 from PR alpha + 14 new). Ruff format + check
clean. No Adastra interaction; pure numpy.
Next: PR gamma wraps rholearn's training/inference loop as a
SALTEDModel class, pinned against our LeMat-Rho parquet input
pipeline and reusing charge3net_ft.train's NMAPE/RMSE/NRMSE metrics.
PR gamma of 4. Adds the model wrapper that pairs with the projection
+ reconstruction layer from PR beta. The wrapper has a stub mode so
the surrounding pipeline (predict -> reconstruct -> metric) can be
exercised end-to-end without a trained rholearn checkpoint.
What's here
salted_ft/model.py
SALTEDModel(basis_spec, ckpt_path=None)
* __call__(atoms) -> (n_atoms, n_coeffs_per_atom) float64
coefficients.
* reconstruct_density(atoms, grid_shape) convenience that runs
predict + reconstruct_grid_from_basis in one call.
* Stub mode (ckpt_path=None): deterministic, position-dependent
coefficients seeded by a hash of the positions / numbers /
basis spec. Different atoms in -> different coefficients out;
same atoms in -> same coefficients out (verified by tests).
* Real-rholearn path raises NotImplementedError for now; lands
in a follow-up PR once rholearn is configured on Adastra.
Sibling-repo discovery for rholearn follows the existing
charge3net_ft / deepdft_ft pattern (lazy; only insists when
ckpt_path is set).
salted_ft/projection.py
Wrapped two more matmul sites in np.errstate to silence the same
benign divide/invalid/overflow noise we already suppressed in
_eval_basis_at_grid and _grid_positions.
tests/test_salted_model.py
15 TDD tests across 5 categories:
* Construct: basis_spec stored, default ckpt_path is None.
* Output shape: single-atom, multi-atom, float64 dtype, finite.
* Determinism: same input -> same output; position changes
produce different output (rules out a zero-returning stub).
* Reconstruct density: shape, dtype, finite, equals the
explicit (predict, then reconstruct_grid_from_basis) path.
* Metric integration with charge3net_ft.train's
compute_nmape / compute_rmse / compute_nrmse: finite scalars,
self-similarity gives NMAPE=0 sanity check. Pinned per the
brief: keep metric calculations identical to the ChargE3Net
pipeline.
Test plan
48/48 tests across the salted suite pass (19 basis + 14 projection
+ 15 model). Ruff format + check clean. No Adastra interaction;
pure local Python.
Next: PR delta wraps the CHGCAR I/O via pymatgen so reconstructed
grids can be written to disk for VASP ICHARG=1 single-points. End-to-
end VASP integration test will be gated on the entalsim
StructureVASPSinglePoint maker (separate stack).
PR delta of 4, closes the SALTED scaffold. Adds the boundary between
the predicted-density-tensor world and the VASP-input-file world so a
trained SALTED-arm model can be evaluated end-to-end via paired
SCF runs.
What's here
salted_ft/io.py
write_chgcar(density, atoms, path, n_electrons=None)
Writes a pymatgen Chgcar-compatible file. The n_electrons
argument rescales the density so its integrated value equals
the requested electron count; that is what VASP reads as the
total electron count when starting with ICHARG=1. Without
rescaling VASP silently fixes the count for us at startup,
which would mask part of the speedup we are trying to measure.
Rejects non-3D densities and nonpositive n_electrons with
clear messages.
read_chgcar(path) -> (density, atoms)
The inverse. Converts pymatgen's "density times volume"
storage convention back to plain rho on the grid.
Uses pymatgen.io.ase.AseAtomsAdaptor for the ase.Atoms <->
pymatgen.Structure conversion.
tests/test_salted_io.py
9 TDD tests + 1 placeholder (skipped):
Write: file exists and is nonempty, electron-count rescaling
within 1e-4 relative, non-3D rejected, negative N rejected.
Read: shape preserved, atom species preserved (multiset),
cell preserved within 1e-6.
Roundtrip: density write->read within VASP scientific-notation
precision (rtol 1e-3, atol 1e-5).
End-to-end: SALTEDModel.reconstruct_density piped into
write_chgcar produces a readable file.
VASP hook gate: pytest.importorskip on
entalsim.dft.tasks.single_point, which auto-activates once
Entalpic/entalsim PR #56 lands its PR 2 (StructureVASPSinglePoint
maker).
Test plan
9 passed + 1 skipped (entalsim gate). Full salted suite now
57 passed + 1 skipped across 4 stacked PRs:
PR alpha 19 tests on BasisSpec
PR beta 14 tests on projection / reconstruction
PR gamma 15 tests on SALTEDModel + metric integration
PR delta 10 tests (9+1) on CHGCAR I/O + VASP hook gate
Ruff format + check clean across all 8 source/test files.
The SALTED scaffold is now ready to consume a trained rholearn
checkpoint and produce VASP-ready CHGCARs end-to-end. Next steps
(separate stack): wire rholearn training on Adastra using the
LeMat-Rho parquet adapter; flip the entalsim hook gate to live when
PR 2 of the r2SCAN single-point stack lands.
Phase D1 (projection sanity check on 10 real LeMat-Rho rows) caught a catastrophic failure mode: the orthonormal-approximation projection landed in PR beta produced 1068% NMAPE on the basis-set roundtrip because the Gaussian basis functions overlap heavily (sigma ~= cutoff) and the per-channel c_k = <B_k, rho> / <B_k, B_k> overcounts contributions from neighboring basis functions. Fix: build the full per-structure design matrix B_global of shape (n_grid, n_atoms * n_coeffs_per_atom) and solve one least-squares system for all atom coefficients simultaneously. The system is overdetermined for our 10x10x10 grids (1000 > 4 atoms * 100 coeffs in the typical LeMat-Rho cell) so lstsq returns the unique minimum-residual fit. After: basis-set ceiling on 10 random LeMat-Rho rows is NMAPE: 8.19% +/- 6.60% (min 2.00%, max 22.67%) vs NMAPE: 1068.81% +/- 109.42% (orthonormal-approx) Well within the 'proceed' band from the plan. Full per-sample numbers are in the offline CSV at salted_basis_sanity_check.csv (outside the repo). Test plan 57/57 tests in tests/test_salted_basis.py + test_salted_projection.py + test_salted_model.py + test_salted_io.py still pass with no changes to test contracts. Linearity, zero-in-zero-out, shape, dtype, single-atom peak position, all unaffected. LSQR is linear in rho so the linearity tests hold by construction. Ruff format + check clean. The previous orthonormal-approx was documented in PR beta's commit as a 'v1 stand-in' for proper LSQR; this lands the proper version. No API change.
…irectory)
Phase D2 of the Adastra comparison plan. One-time job to project every
LeMat-Rho parquet row onto the locked SALTED basis, producing a
parallel parquet directory of basis coefficients that downstream
training loops (rholearn, Graph2Mat) consume.
What's here
salted_ft/project_dataset.py
project_chunk(in_path, out_path, basis_spec)
Reads one LeMat-Rho format chunk, runs project_chgcar_to_basis
on every valid row, writes a parallel chunk with this schema:
row_index, material_id, n_atoms, atomic_numbers,
lattice_vectors, n_electrons, grid_shape, coefficients,
basis_set_NMAPE
basis_set_NMAPE column is the per-row reconstruction error from
project + reconstruct roundtrip; lets downstream training know
the basis ceiling per sample.
project_directory(input_dir, output_dir, basis_spec)
Driver that loops over chunk_*.parquet files. Idempotent:
existing nonempty output files are left untouched so an
interrupted run can resume cheaply.
CLI entry point so the Adastra job runs as
uv run python -m salted_ft.project_dataset \\
--input-dir ... --output-dir ...
tests/test_salted_project_dataset.py
9 TDD tests across 2 classes covering the contract:
* file written, row count, all required columns present
* per-row coefficient shape is (n_atoms, n_coeffs_per_atom)
* basis_set_NMAPE finite + nonneg per row
* material_id preserved if source has it
* NULL charge_density rows in source are skipped (real
LeMat-Rho has some failed extractions)
* project_directory processes every chunk
* second invocation is a no-op (idempotent resume)
The script uses the LSQR projection landed in commit 22809b9; D1
sanity check (10 random LeMat-Rho rows) showed basis ceiling
8.19% +/- 6.60% NMAPE, well within the proceed band.
Test plan
9/9 tests pass on the new file; full salted suite still 66 passed
+ 1 skipped after this. Ruff format + check clean on touched files.
Next: scp + run on Adastra against $SETUP/charge3net_data, expected
~30 min wall on a Genoa CPU node for 65k rows.
Genoa CPU partition, single node, 16 CPUs, 2 h wall (Adastra smoke test of 1 chunk = 71 s, 69 chunks extrapolate to ~80 min). Caps OMP_NUM_THREADS / OPENBLAS_NUM_THREADS / MKL_NUM_THREADS to SLURM_CPUS_ON_NODE so numpy's BLAS-backed lstsq does not over- subscribe the node (default behavior would spawn one thread per hardware core regardless of allocation). Idempotent via project_directory's skip-existing logic, so the job can be requeued without paying the LSQR cost for chunks already written.
Job 4977567 (LRU OOM fix in place) ran 2h41m and died from a NEW
failure mode: NCCL TCPStore "Broken pipe" on the DDP heartbeat
channel. Trace from .err:
Failed to check the "should dump" flag on TCPStore,
(maybe TCPStore server has shut down too early), with error: Broken pipe
...
srun: error: g1132: tasks 1-3: Terminated
MaxRSS was 14 GB/task -- memory budget healthy, so the LRU fix is
solid. The new bug is inter-rank communication, not memory.
Adds four NCCL env vars to the submit script:
NCCL_TIMEOUT=3600 per-collective timeout
NCCL_ASYNC_ERROR_HANDLING=1 clean shutdown on rank
failure, no cascading hangs
TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1800 half-hour heartbeat tolerance
(was the default ~600 sec)
TORCH_NCCL_TRACE_BUFFER_SIZE=1000 larger trace buffer for the
next crash post-mortem
Test plan
9/9 tests in tests/test_submit_script.py still pass.
Resubmit to validate end-to-end. If this still crashes from NCCL,
fallback options are gloo backend or single-GPU runs.
Phase D3 of the Adastra comparison plan. Bridges our SALTED-arm
dense coefficient layout with rholearn's metatensor TensorMap layout
so the training loop in rholearn can consume LeMat-Rho data.
Layout mismatch resolved by this adapter
Our layout (from project_chgcar_to_basis):
atom -> n (radial) -> lambda -> mu
rholearn's layout (from rholearn/utils/convert.py:_get_flat_index):
atom -> lambda -> n (radial) -> mu
The reordering is a single per-atom permutation, independent of
species because our BasisSpec is uniform across all species in v1.
What's here
salted_ft/rholearn_adapter.py
build_lmax_nmax(basis_spec, species)
Expand uniform BasisSpec into rholearn's per-species lmax / nmax
dicts (the form expected by convert.coeff_vector_ndarray_to_tensormap).
dense_to_rholearn_flat(coeffs, basis_spec, symbols)
rholearn_flat_to_dense(flat, basis_spec, symbols)
The exact permutation between the two layouts. Roundtrip is the
identity; pinned by tests.
dense_to_tensormap(coeffs, basis_spec, symbols, positions, cell, structure_idx)
Full path that calls rholearn's converter. Lazy-imports rholearn
and metatensor so this module is importable without those deps.
tests/test_salted_rholearn_adapter.py
12 TDD tests across 4 classes:
Build lmax/nmax dicts (species coverage, value match, key form,
total coefficient count matches)
dense_to_rholearn_flat (output length, zero-in-zero-out, dtype,
per-atom block ordering)
Roundtrip (single-atom, multi-atom, permutation-is-nontrivial)
Full TensorMap (key names; skipped locally when sibling rholearn
missing -- auto-activates on Adastra)
Test plan
77 passed + 2 skipped across the salted suite (78 = previous 66 +
12 new). The 2 skips are forward-looking gates: one on the entalsim
VASP single-point maker, one on the rholearn sibling repo. Both
auto-activate as soon as their deps are reachable. Ruff format +
check clean.
Next: D4 (rholearn training submit script that reads our projected
coefficients via this adapter, runs the metatensor-based training,
saves checkpoints). Will need a real Adastra job once D2's
projected-coefficient dataset is on disk.
Needed by tests/test_salted_rholearn_adapter.py (the metatensor TensorMap conversion path uses both). Without them the TestDenseToTensorMap class skips locally, which masks integration breaks until they're caught at runtime on Adastra. Pure-Python binary wheels exist on PyPI, no compilation needed.
Mirrors salted_ft's basis module for the Graph2Mat arm of the r2SCAN density-model comparison. point_basis_for_species and basis_table_for_species expand our uniform BasisSpec(max_l=4, n_radial=4, cutoff=4.0) into Graph2Mat PointBasis objects with basis=[4]*5 and basis_convention='spherical'. PointBasis.basis_size is asserted equal to BasisSpec.n_coeffs_per_atom (100) so projected coefficients stay loadable into Graph2Mat density matrices. 10 TDD tests pinning: type/R/basis_size/convention contracts, one entry per l, species independence, and dedup behaviour of the batch table builder.
CINES policy rejects explicit --partition= asks on the Genoa nodes, so SLURM auto-routes based on the resource size. 16 CPUs/task lands in the exclusive queue (long wait); 4 CPUs/task lands in shared and starts almost immediately. The projection is BLAS-LSQR bound and saturates 4 cores per chunk already, so the smaller ask costs no wall time.
Path A of the Graph2Mat plan: keep the same regression target as SALTED (per-atom basis coefficient vectors from salted_ft) and use Graph2Mat as a different backbone over the same target. graph2mat_ft.projection exposes: * pack_coeffs_to_point_labels(coeffs, basis_spec, symbols) flattens (N_atoms, n_coeffs_per_atom) into atom-major point_labels. * unpack_point_labels_to_coeffs is the inverse. * make_basis_configuration bundles a structure into a graph2mat.BasisConfiguration so the training driver does not have to reach into graph2mat internals. 14 TDD tests pinning shape, dtype preservation, atom-major ordering, within-atom channel order, length-mismatch ValueError guards, and BasisConfiguration point_types indexing into the species basis list.
…mma) Mirrors salted_ft.model.SALTEDModel. Stub mode (ckpt_path=None) returns deterministic per-atom coefficients seeded off positions + numbers + basis_spec via blake2b, so same structure in -> same coefficients out and small perturbations to any atom change the output. ckpt_path != None raises NotImplementedError until D6 wires in the real Graph2Mat backbone, so the failure mode is loud rather than silently returning stub output during benchmarking. reconstruct_density(atoms, grid_shape) is the convenience entry point for the VASP comparison pipeline. Note: salted_ft.model uses int.from_bytes(seed_bytes[:16], ...) which only seeds off atom 0 -- different bug, same shape, but left alone here per the surgical-changes rule. Worth fixing in its own patch. 10 TDD tests pinning shape, dtype, finiteness, determinism, position-dependence, species-dependence, output magnitude, the NotImplementedError gate for ckpt_path, and the reconstruct_density shape contract.
graph2mat_ft.io re-exports read_chgcar / write_chgcar from salted_ft.io so the two arms share a single implementation (including the n_electrons rescaling that VASP ICHARG=1 needs). Tests pin the identity of the re-exports so a future fix in salted_ft.io automatically propagates.
Graph2Mat's native target is D_ab in an atom-centered basis. VASP does not output that; we would have to invent a CHGCAR -> D_ab projection (10^6 x 10^6 dense LSQR per structure, needs matrix-free + neighbor-cutoff and its own quality validation). Multi-week effort, no clear win for the SCF-speedup goal vs the three arms already in flight. The PointBasis adapter, projection helpers, model wrapper and shared IO surface stay in tree as green-tested scaffolding so the arm can be revived (with SIESTA training data, a matrix-free projection, or a vector-output hijack) without rewriting from zero.
scripts/density_model_eval.py loops over a LeMat-Rho-shaped test parquet, runs the selected arm to predict the density on the ground-truth grid, and writes per-row NMAPE / RMSE / NRMSE into an output parquet. Importable for D8 (the comparison-table builder) via evaluate_dataset(...). Arm coverage in this alpha: * salted: fully wired through SALTEDModel.reconstruct_density. Stub mode (no ckpt) works; real mode lights up when D6 (SALTED training driver) lands. * charge3net, deepdft: dispatcher raises NotImplementedError with a TODO pointing at D7-beta (probe batching). Catches a future user feeding a real-arm name and silently getting stub metrics. * unknown name: ValueError at the boundary. Metrics are numpy-only on flat or 3D arrays (no probe-padding mask needed because grid eval has no padding). 14 TDD tests pin metric values, dispatcher contract, parquet schema (model, ckpt, material_id, n_atoms, nmape, rmse, nrmse), finiteness, and the --limit smoke-test path.
scripts/density_model_comparison_table.py concatenates one or more D7 per-row eval parquets, groups by the model column, and emits a per-arm summary (n, mean +/- std, median for NMAPE / RMSE / NRMSE). Writes both a CSV (machine-readable) and a GitHub-flavour markdown table (paste-into-PR). build_comparison_table(inputs, csv_path, markdown_path) is importable so a Lightning callback / pipeline step can call it directly without spawning a subprocess. CLI driver provided for ad-hoc use. 10 TDD tests pin: per-arm grouping, mean / std / median values, n_structures count, multi-file-per-arm aggregation (sharded eval), markdown content and header structure, and the CSV + markdown write paths.
The old int.from_bytes(seed_bytes[:16], ...) only consumed the first 16 bytes of positions + numbers + spec, which is two-thirds of atom 0's xyz and nothing else. Perturbing any atom past index 0 produced identical stub coefficients, silently collapsing distinct structures into the same seed. Switch to a blake2b(digest_size=16) hash over the full buffer so every atom contributes. Same fix already in graph2mat_ft.model. Regression test pins the multi-atom case: nudging atom 1 in a two-atom Fe cell must change the predicted coefficients.
Wires the charge3net arm in scripts/density_model_eval.py. Builds the full-grid input dict via charge3net's own KdTreeGraphConstructor (so atom + probe edges match training), batches probes through src.utils.predictions.split_batch, and reshapes the concatenated forward output to (Nx, Ny, Nz). predict_density now accepts an optional pre-loaded model so tests inject a mock without going through ChargE3NetWrapper + a real ckpt. The charge3net_ft.model import is forced for its sys.path side effect (adds ../charge3net) so the data utilities resolve even when the caller supplies the model directly. Tests skip cleanly when the charge3net sibling repo is absent (integration-only). Two new mock-model tests pin: full-grid shape contract, value reshape order (constant predictions reproduce a constant grid), and that lowering max_probe_batch increases the forward-pass count. DeepDFT branch still gated behind NotImplementedError (separate forward signature, lands in D7-beta2).
DeepDFT is the upstream code charge3net forked, so the model input-dict format is identical: probe_xyz, num_probes, probe_edges, etc. _deepdft_predict_grid reuses charge3net's data utilities to build the graph and split_batch to batch probes; the DeepDFT-specific bits are: * sys.path side effect from deepdft_ft.runner (adds ../DeepDFT and stubs asap3 when its C extension is unbuildable, as on Adastra). * densitymodel.PainnDensityModel(num_interactions=3, node_size=128, cutoff=4.0) by default; toggle use_painn=False for SchNet. * ckpt loading via torch.load with the "model" key wrapper. Optional model= injection identical to charge3net so tests can mock the network. Integration test skips when the DeepDFT sibling repo is absent (this machine); runs on Adastra where it lives.
…p (D6) Path B of the D6 plan: skip the rholearn integration (would need multi-week Adastra-side iteration) and train a small SchNet-style invariant message-passing net directly on D2's per-atom basis coefficients. MSE loss; AdamW; gradient accumulation per batch since per-structure forward is variable size. Architecture (salted_ft/train_baseline.py): * Z embedding (nn.Embedding, max_z=120). * GaussianRBF distance featurisation over neighbours within BasisSpec.cutoff. * Two SchNet-style cfconv layers. * Per-atom readout MLP -> BasisSpec.n_coeffs_per_atom. Caveat: invariant model means l>0 channels of the SALTED basis will be systematically wrong. This is a baseline; upgrade to e3nn/MACE for proper equivariance if it under-performs. SaltedTrainingDataset joins D2 source (cartesian_site_positions column) and projected coefficients (training targets) by row_index per matching chunk basename, since D2 output does not carry positions. submit_salted_baseline_adastra.sh: single-GCD MI250 job, 10 epochs, 24h walltime, ROCm env mirrored from the DeepDFT submit. 8 TDD tests pinning: forward output shape, dtype, finiteness, determinism, species-dependence (catches frozen Z embedding), loss-decrease on a synthetic toy, save/load round-trip, and an end-to-end train() call on a synthetic 2-row dataset.
Replaces _rholearn_predict (which only raised NotImplementedError)
with _baseline_predict: lazy-loads the SaltedBaselineModel from
the D6 ckpt format {basis_spec, model: state_dict}, caches it on
the wrapper, and forwards through torch.no_grad(). The result is
cast to float64 to match the stub-mode contract.
Removes the eager _ensure_rholearn_importable() check from
__init__ since the baseline path does not need the rholearn
sibling repo. The rholearn-faithful path was deferred (graph2mat
arm is parked, SALTED arm uses path B); when it comes back as a
follow-up we will dispatch on ckpt format inside _baseline_predict.
Two new tests: round-trip a baseline state_dict through SALTEDModel
and verify the predicted coefficients differ from the stub seed
(so we know the ckpt is actually driving inference), and assert a
clear RuntimeError on a malformed ckpt.
…izes Job 5003891 OOM-killed (CPU RAM) at ~10 min: slurmstepd reported "Detected 1 oom_kill event" with 64 GB budget. Root cause is the data-buffer footprint, not a model or training-loop issue. The upstream-DeepDFT defaults of RotatingPoolData(pool_size=20) + num_workers=4 keep up to 80 full grids in RAM concurrently. For QM9 (~50^3) and MP (~100^3) that is fine. LeMat-Rho's r2SCAN CHGCARs have a long upper tail (200-300^3), and a single 300^3 sample is ~750 MB once density + grid_pos are materialised; a handful of those in the pool blows past 64 GB. Cut pool_size 20 -> 5 and num_workers 4 -> 2. Effective in-RAM grid count drops 80 -> 10. Hyperparameters that affect training quality (batch_size=2 materials, 1000 probes/material, learning rate, etc.) are unchanged. Verified locally: full test suite still green (195 pass).
…Flow (P4) For each held-out test row, predicts the density via the chosen arm (salted, charge3net, deepdft) using the existing density_model_eval.predict_density, writes a CHGCAR with the n_electrons rescaling salted_ft.io.write_chgcar applies, and submits the paired baseline + predicted r2SCAN single-point Flow via entalsim.dft.scf_speedup.make_scf_speedup_pair plus entalsim.core.submit.submit_workflow. Driver is dependency-injectable on the two entalsim callables (make_pair_fn, submit_fn) so its tests pass locally without entalsim installed; the CLI imports them at runtime via lazy imports. Fail-fast guards at run_experiment call time: * charge3net or deepdft without --ckpt raises ValueError (those arms with no weights produce random-init predictions and waste HPC time) * salted without --ckpt is allowed — stub mode is the documented fallback while D6 trained weights are pending One per-row chgcar directory keyed by (model, material_id) so multiple rows never share a CHGCAR file; make_scf_speedup_pair's prev_dir mechanism receives the right directory. 9 TDD tests pinning: dry-run writes one CHGCAR per row + does not submit; make_pair gets metadata with material_id + arm + experiment; --limit caps rows processed; non-dry-run submits per row with the right project + worker; submitted=True/False flag appears on the returned records; charge3net + deepdft without --ckpt fails fast; salted stub-mode ckpt label propagates; per-row CHGCAR directories are unique.
…4 hardening)
Reviewer flagged two blockers on the multi-hour submit loop:
* a single bad row killed the batch and left already-submitted
Flows on Mongo with no resume path
* no per-row logging meant a row-200 failure left no breadcrumb
for diagnosis
This commit addresses both, plus a chgcar-dir contract nit later.
Per-row resilience:
* try/except Exception around the prediction + flow-build + submit
body. A failed row records {"error": repr(e), "submitted": False}
and the loop continues with the next row.
Resumable JSONL manifest:
* records stream to {chgcar_dir}/manifest.jsonl by default
(overridable via --manifest) AFTER each row, in finally:, so an
interrupted run leaves an inspectable record.
* --skip-existing reads the manifest at start and skips rows with
submitted=True for THIS model. Failed rows (submitted=False) are
always retried.
Observability:
* tqdm.auto wrapper on df_in.iterrows() with desc=
f"scf_speedup({model_name})" -- visible progress bar without
spamming the log.
* logger.info per row (material_id, arm, n_jobs, submitted) plus
logger.exception on per-row failure for full traceback.
* main() configures basicConfig(level=INFO) so the CLI path emits
logs straight to stderr.
5 new TDD tests:
* TestPerRowResilience: a corrupt positions cell in row 2 of 3
fails that row only; the other two complete normally.
* TestManifest.test_manifest_jsonl_written_after_each_row: 3 rows
-> 3 JSONL lines in the manifest.
* TestManifest.test_manifest_defaults_to_chgcar_dir: implicit
manifest path lands at chgcar_dir/manifest.jsonl.
* TestSkipExisting.test_skip_existing_skips_already_submitted_rows:
pre-populated manifest with submitted=True skips that row.
* TestSkipExisting.test_skip_existing_does_not_skip_failed_rows:
submitted=False rows are retried, not skipped.
14 / 14 tests green; full suite green (204+ tests).
Reviewer flagged two worth-flagging items. LeMaterial#4 CHGCAR directory layout * was: chgcar_root / f"{model}__{material_id}/CHGCAR" * now: chgcar_root / model / material_id / CHGCAR * the flat layout would have been ambiguous for synthesised IDs containing the separator (e.g. "oqmd__1234"). Nested avoids that entirely and is also more ls-friendly when sweeping models. * new test test_chgcar_layout_is_nested_by_model_then_material_id asserts the path tail. LeMaterial#8 Test-data realism * the existing _toy_parquet uses 2-atom H2 cells with grid_shape=(4,4,4) and n_electrons=2.0 -- a missing n_electrons rescale, a positions-reshape bug, or a grid/atom mismatch would all pass silently. * new TestRealisticRow.test_5_atom_asymmetric_grid_unequal_n_electrons exercises an FeO4 row with grid_shape=(8,10,12) and n_electrons=12.5 != sum(Z). Catches mutations on the reshape and rescale paths. 16 / 16 tests green; full suite green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on PR LeMaterial#8. Adds an Adastra-side variant of the ChargE3Net fine-tuning pipeline (NVIDIA A100 on Jean Zay → AMD MI250X on Adastra/CINES) without touching
charge3net_ft/. Same training code, same dataset layout; only the submit script + setup runbook differ.What's in this PR
submit_charge3net_adastra.shHIP_VISIBLE_DEVICESalignment,batch_size=8(vs A100's 4, since MI250X has 64 GB HBM2e per GCD),val_probes=1000, online W&B (Adastra proxy gives live internet), auto-resume fromlatest.pt. Submit dir defaults tocad16353scratch, account billed toc1816212.ADASTRA.mdtests/test_data.pytest_ignores_extra_columnsregression test for the Bader-analysis columns thatEntalpic/lemat-rho-v1added (bader_charges,bader_volumes,material_id).Port blockers solved
pip installreturns HTTP 000HTTP_PROXYHTTP_PROXY=http://proxy-l-adastra.cines.fr:3128(+ HTTPS, lowercase); now in~/.bashrcon Adastra\$LEMATRHO_ADASTRA_SETUPis now rebuildable from sourcespip install boto3times outgorgone.cines.fr(missing boto3)pip install --index-url https://pypi.org/simple ...for non-torch depssnapshot_downloadreports 100% but cache is emptycurlwithAuthorization: Bearerper file (3.5 GB in 16 s withxargs -P 8)sbatch: You are not allowed to ask for a qos--qos=debugnot granted on team accounts--qos; default works with 6 h MaxWall0:53(signal 53 = prolog failure), no log filesc1816212group inode quota at hard cap (Ali owns ~85% of 1.1M files)--account=c1816212(active window). Account and scratch dir are independent in SLURM..outlands in\$HOMEcddefaultsWorkDir=\$HOMEcd \$WORK_DIR && sbatch ...in the submit scriptReference smoke run
Job 4969516 on g1342, 2026-05-19. Loaded 65,239 / 68,549 valid materials from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 (step 50) → 5.67 (step 1,000). Hit TIMEOUT before completing the epoch (expected: one epoch ≈ 150 min at the debug-run knobs); no val/test metrics yet. A 6 h job under the production knobs in this script is the next step.
Test plan