Skip to content

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1

Draft
speckhard wants to merge 36 commits into
feat/charge3netfrom
feat/charge3net-adastra
Draft

feat(adastra): port ChargE3Net fine-tuning to AMD MI250X on CINES Adastra#1
speckhard wants to merge 36 commits into
feat/charge3netfrom
feat/charge3net-adastra

Conversation

@speckhard
Copy link
Copy Markdown
Owner

Summary

Stacked on PR LeMaterial#8. Adds an Adastra-side variant of the ChargE3Net fine-tuning pipeline (NVIDIA A100 on Jean Zay → AMD MI250X on Adastra/CINES) without touching charge3net_ft/. Same training code, same dataset layout; only the submit script + setup runbook differ.

What's in this PR

File What
submit_charge3net_adastra.sh MI250 SLURM headers, ROCm HIP_VISIBLE_DEVICES alignment, batch_size=8 (vs A100's 4, since MI250X has 64 GB HBM2e per GCD), val_probes=1000, online W&B (Adastra proxy gives live internet), auto-resume from latest.pt. Submit dir defaults to cad16353 scratch, account billed to c1816212.
ADASTRA.md Step-by-step setup (proxy, venv, dataset transfer) + a gotchas table covering the seven port blockers.
tests/test_data.py New test_ignores_extra_columns regression test for the Bader-analysis columns that Entalpic/lemat-rho-v1 added (bader_charges, bader_volumes, material_id).

Port blockers solved

# Symptom Cause Fix
1 pip install returns HTTP 000 Adastra doesn't auto-set HTTP_PROXY Export HTTP_PROXY=http://proxy-l-adastra.cines.fr:3128 (+ HTTPS, lowercase); now in ~/.bashrc on Adastra
2 April setup vanished CINES 30-day scratch purge Setup tree under \$LEMATRHO_ADASTRA_SETUP is now rebuildable from sources
3 pip install boto3 times out Adastra's pip prefers gorgone.cines.fr (missing boto3) pip install --index-url https://pypi.org/simple ... for non-torch deps
4 snapshot_download reports 100% but cache is empty HF Xet backend silently no-ops on Adastra Raw curl with Authorization: Bearer per file (3.5 GB in 16 s with xargs -P 8)
5 sbatch: You are not allowed to ask for a qos --qos=debug not granted on team accounts Omit --qos; default works with 6 h MaxWall
6 Exit code 0:53 (signal 53 = prolog failure), no log files c1816212 group inode quota at hard cap (Ali owns ~85% of 1.1M files) Cross-account setup: submit dir on cad16353 scratch (390 k headroom), --account=c1816212 (active window). Account and scratch dir are independent in SLURM.
7 sbatch .out lands in \$HOME sbatch over SSH without cd defaults WorkDir=\$HOME cd \$WORK_DIR && sbatch ... in the submit script

Reference smoke run

Job 4969516 on g1342, 2026-05-19. Loaded 65,239 / 68,549 valid materials from 69 parquet chunks. 1,150 training steps in 12 min wall, train L1 down from 29.95 (step 50) → 5.67 (step 1,000). Hit TIMEOUT before completing the epoch (expected: one epoch ≈ 150 min at the debug-run knobs); no val/test metrics yet. A 6 h job under the production knobs in this script is the next step.

Test plan

  • `pytest tests/test_data.py -v` — 11/11 pass including the new `test_ignores_extra_columns`
  • `ruff format` + `ruff check` — clean
  • Manual smoke run on Adastra (job 4969516) — pipeline trains end-to-end on AMD MI250X
  • Real-data 6 h run with production knobs (`batch=8`, `val_probes=1000`, online W&B) — follow-up, not in this PR

speckhard added 6 commits May 20, 2026 13:26
…stra

Adds an Adastra-side variant of submit_charge3net.sh and a runbook
covering the seven blockers encountered during the port:
- HTTP proxy must be set explicitly (Adastra doesn't auto-export it),
- 30-day scratch purge wipes setup, so $LEMATRHO_ADASTRA_SETUP is
  rebuildable from sources,
- pip on Adastra defaults to gorgone.cines.fr (missing boto3 etc);
  --index-url https://pypi.org/simple is required,
- huggingface_hub Xet backend silently no-ops the payload fetch, so
  raw curl with Authorization: Bearer is used for the dataset,
- --qos=debug is not granted on the team accounts,
- group inode quota on /lus/scratch/CT10/c1816212/ is at the hard cap,
  so the submit dir lives on cad16353 scratch while the job is billed
  to c1816212 (account and scratch dir are independent dimensions),
- sbatch over SSH defaults WorkDir to \$HOME unless cd'd first.

submit_charge3net_adastra.sh mirrors the Jean Zay script (auto-resume
from latest.pt, 50-epoch budget) but with MI250 SLURM headers, ROCm
HIP_VISIBLE_DEVICES alignment, batch_size=8 (HBM2e has 64 GB per GCD
vs A100's 40-80), val_probes=1000, and online W&B (the Adastra proxy
gives us live internet, so the Jean Zay offline-then-sync dance is
unnecessary).

Adds a regression test test_ignores_extra_columns for the dataset
loader: Entalpic/lemat-rho-v1 added Bader analysis columns
(bader_charges, bader_volumes, material_id) which would have broken
_build_parquet_index if it didn't honor the four-column _COLUMNS
allowlist. The test confirms the allowlist still holds.

Reference smoke run: job 4969516 on g1342, May 19 2026. 65,239 of
68,549 valid materials loaded from 69 parquet chunks. 1,150 training
steps in 12 min wall, train L1 down from 29.95 at step 50 to 5.67
at step 1,000. Hit TIMEOUT before completing the epoch (expected:
one epoch needs ~150 min at batch=4), no val/test metrics yet; a
follow-up 6h job under the production knobs will produce those.
…uards

Adds tests/test_equivariance.py with 7 structural tests that pin down the
architectural properties needed for ChargE3Net's rotational equivariance
guarantee:

- Production model has 1.9M params (catches drift that would break loading
  charge3net_mp.pt).
- atom_irreps_sequence reaches lmax >= 4 (the "higher-order" in the paper
  title; a silent drop to lmax=0 would degenerate the model to a much
  weaker scalar-only baseline).
- Atom representation includes both even and odd parity components.
- get_irreps(500, lmax=4) returns 10 entries with no zero-multiplicity
  irreps (catches a regression that would silently delete some irreps).
- atom_irreps_sequence length matches num_interactions.
- Atom-model cutoff matches the 4.0 A baked into KdTreeGraphConstructor in
  LeMatRhoDataset.
- Final irreps are an e3nn o3.Irreps instance (replacing this with a plain
  list would silently break equivariance while still producing output).

A runtime equivariance check (rotate inputs, predict, compare) is the gold
standard but requires a real forward pass at production hyperparameters
that is too slow for a CPU unit test. The structural tests cover the same
property at the architecture level.

Tests autoskip when the sibling AIforGreatGood/charge3net repo is absent.
… training

Two changes motivated by job 4969727 (FAILED after 1h47m on the previous
single-GPU submit):

1. Multi-GPU via torch DistributedDataParallel. The paper uses per-GPU
   batch=16 across 4 GPUs (effective batch=64). Our previous Adastra
   submit was single-GPU batch=8 — 8x smaller effective batch. With the
   half-node submit (4 GCDs, 64 CPUs, 128 GB RAM, batch=16 per GCD) the
   effective batch now matches the paper.

   Implementation:
   - New _setup_ddp / _is_ddp / _is_main helpers in train.py read
     WORLD_SIZE / RANK / LOCAL_RANK / MASTER_ADDR / MASTER_PORT from
     the env (set in the submit script via srun + scontrol show hostname).
   - Backend is nccl which routes through RCCL on AMD ROCm builds.
   - Model wrapped in DistributedDataParallel after .to(device).
   - DistributedSampler injected into the train loader via a new
     distributed=True flag on build_dataloaders. Val/test stay
     non-distributed; cheap enough at 5% of 65k.
   - DistributedSampler.set_epoch called each epoch for proper shuffling.
   - All prints and wandb logs gated on is_main (rank 0 only).
   - Save and load go through a new _unwrap helper so checkpoints are
     interchangeable between single-GPU and DDP runs.
   - dist.barrier at end of each epoch to keep ranks in lockstep
     during checkpoint saves.
   - dist.destroy_process_group at the very end.

2. Wandb soft-fail. wandb.init now sits inside try/except — if the
   compute node can't reach api.wandb.ai through the proxy (which is
   what killed job 4969727 after 5min of timeouts and 1h47m elapsed
   total), the script logs a warning and sets use_wandb=False so
   training proceeds with stdout + checkpoints only.

Submit script (submit_charge3net_adastra.sh) updated for half-node:
   --nodes=1 --ntasks-per-node=4
   --gpus-per-node=4 --cpus-per-task=16
   --mem=125000M  --time=06:00:00
plus srun-based DDP launcher that exports RANK/LOCAL_RANK per task,
batch_size=16 per GPU, val_probes=1000, wandb-mode=offline.

Test plan
- pytest tests/ ... 34 passed, 1 failure pre-existing (test_metrics
  collection error from src.charge3net path shadowing in pytest;
  unrelated, same on main).
- ruff format + check clean on the touched files.
- DDP path not yet exercised end-to-end on Adastra; the immediate
  next step is a 6h submission. If the DDP init fails, the
  single-GPU code path is still reachable by running without srun.
…om-scratch (TDD)

The submit script now reads LEMATRHO_TRAINING_MODE to switch between
two runs that share all infrastructure (same DDP, same hyperparams,
same dataset, same node layout) but differ in init:

  pretrained   (default)  --ckpt-path charge3net_mp.pt
                          save-dir charge3net_checkpoints/
                          WANDB_NAME=pretrained_mp
  from_scratch            no --ckpt-path (random init)
                          save-dir charge3net_checkpoints_fromscratch/
                          WANDB_NAME=from_scratch

Auto-resume from latest.pt is per-mode (the two save-dirs don't
collide), so each arm can be relaunched independently via
sbatch ... submit_charge3net_adastra.sh until val NMAPE plateaus.

Also adds a LEMATRHO_DRY_RUN=1 escape hatch that prints the resolved
train command and exits 0 without sourcing the venv or invoking srun.
Used by the 9 new pytest tests in tests/test_submit_script.py:
  - dry-run prints train command
  - default mode is pretrained, uses MP checkpoint
  - pretrained writes to charge3net_checkpoints (not fromscratch dir)
  - from_scratch drops --ckpt-path completely and never references
    charge3net_mp.pt
  - from_scratch uses a separate save dir
  - WANDB_NAME differs between modes
  - invalid mode exits non-zero with a clear error
  - batch-size 16, val-probes 1000 (paper-matching)
  - wandb-mode is offline

TDD: 9 tests RED before the refactor, all GREEN after. Full suite
still 33 passed (data + model + equivariance + submit). ruff format
+ check clean.

Submission examples in the script header and in ADASTRA.md.
PR 1 of a 2-PR stack to land DeepDFT as a baseline for the
ChargE3Net VASP-speedup experiment. This PR adds only the data
adapter; PR 2 will add the training submission (DDP-patched).

What's here:
  deepdft_ft/__init__.py         empty package marker
  deepdft_ft/data.py             LeMatRhoDeepDFTDataset adapter
  tests/test_deepdft_data.py     11 TDD tests pinning the contract

The adapter reuses charge3net_ft.data's _row_to_atoms_and_density and
_build_parquet_index, then re-shapes the per-sample output into the
dict that DeepDFT's CollateFuncRandomSample expects:

  {
      "density":       np.ndarray (Nx, Ny, Nz),
      "atoms":         ase.Atoms,
      "origin":        np.ndarray (3,),
      "grid_position": np.ndarray (Nx, Ny, Nz, 3),
      "metadata":      {"filename": str},
  }

_calculate_grid_pos is inlined from upstream DeepDFT/dataset.py so
this adapter has no runtime dependency on the DeepDFT sibling repo
(which keeps the test suite hermetic).

Tests pinned (RED then GREEN):
  - dataset length matches the count of valid parquet rows
  - sample dict has all 5 required keys
  - density is a 3D numpy array
  - atoms is ase.Atoms with PBC True/True/True
  - origin is zeros (matches LeMat-Rho convention)
  - grid_position has shape (Nx, Ny, Nz, 3)
  - grid_position[0,0,0] = (0,0,0)
  - grid_position[1,0,0] = (a_lattice / Nx, 0, 0)
  - metadata.filename present and unique per sample
  - extra columns (bader_charges, material_id) ignored
  - empty parquet dir raises FileNotFoundError

Caching is keyed by absolute parquet path (not file index) so multiple
LeMatRhoDeepDFTDataset instances pointing at different directories
don't collide on fi=0 (which bit me writing the metadata test).

Full LeMat-Rho test suite: 44 passed. Ruff format + check clean.

Next: PR 2 will add deepdft_ft/runner.py (vendored from upstream
DeepDFT + DDP patches) and submit_deepdft_adastra.sh (4-GCD half-node
DDP, PaiNN model variant for equivariance parity with ChargE3Net).
PR 2 of the DeepDFT-on-LeMat-Rho stack (PR 1 was the data adapter).
Closes the gap from "we have a DeepDFT-compatible Dataset" to "we
can sbatch a 4-GCD DDP DeepDFT training run on Adastra".

What's here:
  deepdft_ft/runner.py            vendored from peterbjorgensen/DeepDFT@main
                                  + DDP patches + LeMat-Rho parquet auto-detect
                                  + asap3 stub (no C++ headers on Adastra)
  submit_deepdft_adastra.sh       half-node 4-GCD DDP submission, PaiNN default,
                                  LEMATRHO_DEEPDFT_VARIANT={painn,schnet} env var,
                                  LEMATRHO_DRY_RUN=1 supported

DDP patches mirror what we did in charge3net_ft/train.py:
- _setup_ddp + _is_main + _unwrap helpers
- DistributedSampler when WORLD_SIZE>1, RandomSampler otherwise
- DistributedDataParallel wrap of the PaiNN/SchNet model
- All logging.info and checkpoint saves gated on rank 0
- Device pinned to cuda:LOCAL_RANK via torch.cuda.set_device

LeMat-Rho parquet auto-detect: if --dataset points at a directory
containing chunk_*.parquet, the runner uses LeMatRhoDeepDFTDataset
(PR 1). Other dataset paths (.tar, .txt, dir of cube/CHGCAR) still
work unchanged — upstream's dataset.DensityData path is preserved.

asap3 stub: upstream DeepDFT imports asap3 at module load. asap3
needs Python.h to build from source which isn't on Adastra (and would
need admin). The stub at the top of runner.py registers a fake asap3
module with a FullNeighborList class that delegates to ASE's
NewPrimitiveNeighborList. Slower than real asap3 but functionally
identical for DeepDFT's call sites. Skipped when real asap3 is
installed.

Submit script defaults:
- PaiNN model (matches equivariance of ChargE3Net for the comparison)
- batch=2 (DeepDFT's upstream default — they iterate on probes,
  not materials, so per-batch counts work differently from ChargE3Net)
- cutoff=4.0, num_interactions=3, node_size=128
- max_steps=1e8 (effectively unbounded; SLURM walltime is the limiter)
- WANDB_NAME=deepdft_painn (or deepdft_schnet)

Verified on Adastra: runner module imports cleanly under the venv311,
asap3 stub kicks in without error, parquet directory detection works.
The actual training run will be submitted next.
@speckhard speckhard force-pushed the feat/charge3net-adastra branch from 8487ae9 to 8d510d2 Compare May 20, 2026 11:27
speckhard added 23 commits May 20, 2026 13:28
Root-causes job 4971720's OOM-kill at startup and aligns the DeepDFT
training to the upstream paper's submission settings.

Two changes:

1. submit_deepdft_adastra.sh: switch from half-node DDP (4 GCDs) to
   paper-faithful single-GPU (1 GCD on mi250-shared, HIP_VISIBLE_DEVICES=0,
   WORLD_SIZE unset). Upstream DeepDFT was trained on 1x RTX 3090 per
   pretrained_models/*/submit_script.sh. Single-GPU keeps gradient-step
   semantics identical to the paper's batch=2; no LR sweep needed.

   Effective hyperparameters are now exactly the upstream PaiNN settings
   from pretrained_models/{nmc,qm9,ethylenecarbonate}_painn/commandline_args.txt:
     --cutoff 4
     --num_interactions 3
     --node_size 128
     --max_steps 10000000
     --use_painn_model
     batch_size=2 materials (hardcoded in runner.py)
     train_probes=1000 per material (hardcoded)
     val_probes=5000 per material (hardcoded)

   DDP code paths in runner.py stay in place but only fire when
   WORLD_SIZE>1, so a future DDP variant of DeepDFT is one env flip away.

2. deepdft_ft/runner.py: replace upstream's eager validation preload
   `val_loader = [b for b in val_loader]` with a comment explaining
   why we left it as a streaming DataLoader. Upstream's val sets are
   ~100 materials (NMC, QM9 ethylenecarbonate subsets) so the preload
   is cheap. Our val set is 3,261 materials at 5000 probes each, x4
   ranks under DDP, which materialised ~150 GB and OOM-killed job
   4971720 at startup before a single training step. Streaming the
   val loader is a data-loading detail, not a hyperparameter; the
   model math is unchanged.

Test plan:
- 44/44 local tests still pass (no behavioural changes to the data
  adapter or submit-script env contract; only the runner internals
  and the SLURM headers move).
- New job to be submitted as the next step; will confirm DeepDFT
  trains and produces step-level loss in the .out log.
Observation from jobs 4971293 and 4971343: SLURM bumped both to
EXCLUSIVE mode despite us requesting half-node resources. The
--mem=125000M line was exactly half the 256 GB node's memory, which
crosses SLURM's auto-exclusive threshold.

Dropping --mem entirely lets SLURM allocate memory proportional to
our CPU share (64 of 128 logical CPUs -> ~128 GB out of 256 GB).
The other half of the node stays schedulable for other users / jobs.

The currently running jobs 4971293 and 4971343 keep their exclusive
allocations; only future submissions are affected.

Test plan
- 9/9 tests in tests/test_submit_script.py still pass (no memory
  assertion).
- Will confirm on next sbatch by inspecting AllocTRES.
Root-causes the OOM that killed jobs 4971293 and 4971343 at MaxRSS=35 GB
per rank (140 GB cumulative across 4 DDP ranks, exceeding our 125 GB
--mem budget).

Two changes, both small:

1. charge3net_ft/data.py: bound _TABLE_CACHE with an LRU eviction
   policy capped at _TABLE_CACHE_MAX_CHUNKS=5. OrderedDict gives O(1)
   move-to-end on hit and popitem(last=False) on miss-with-eviction.
   The previous dict was unbounded, so each DataLoader worker
   accumulated every chunk it had ever seen. With ~2 GB per
   pyarrow-decompressed chunk (compressed_charge_density JSON
   strings inflate 6x) and 32 worker processes (8 per rank x 4 ranks),
   the cache alone grew to ~140 GB over 6 h.

2. submit_charge3net_adastra.sh: drop --num-workers from 8 to 2.
   Defense in depth on top of the LRU. At LeMat-Rho's 10x10x10 grid
   size the DataLoader's data-loading throughput isn't the
   bottleneck; 2 workers per rank x 4 ranks = 8 total workers is
   plenty, and per-rank cache pressure now drops by 4x.

3. tests/test_data.py: TestTableCacheLRU adds three regression tests
   (cache size bounded, LRU eviction order is correct, default cap
   is within a sensible range). TDD: RED before changes 1+2, GREEN
   after.

Combined effect: cache pressure on a half-node DDP run drops from
~140 GB to roughly 4 ranks x 2 workers x 5 chunks x 2 GB = 80 GB
worst case, and in practice much less because workers tend to revisit
chunks. Comfortably under the ~128 GB shared-mode default mem.

Full suite: 47 passed (test_metrics.py pre-existing src-shadow
failure unrelated, same on main).
…ack)

PR alpha of 4 for the SALTED-arm basis-expansion benchmark. This PR
lands only the BasisSpec dataclass and its tests. PRs beta/gamma/delta
land the projection layer, the rholearn model wrapper, and the VASP
CHGCAR I/O respectively.

What's here

  salted_ft/__init__.py         exports BasisSpec, documents the stack
  salted_ft/basis.py            frozen dataclass with the locked-in
                                hyperparameters from Phase A4 of the
                                investigation memo
  tests/test_salted_basis.py    19 TDD tests across 5 categories

Design decisions captured by the tests

  BasisSpec is frozen, hashable, equality-by-value so it can key
  caches and identify metric runs without ambiguity. Mutation raises
  FrozenInstanceError.

  Validation happens in __post_init__ so a malformed spec raises at
  construction time, not deep in a tensor op three PRs from now.
  Negative max_l, zero n_radial, nonpositive sigma, nonpositive
  cutoff all rejected with clear messages.

  Default values match the Phase A4 lockdown verbatim
    max_l=4, n_radial=4, sigma=(0.5,1.0,2.0,4.0), cutoff=4.0
  n_coeffs_per_atom == 100 from the formula n_radial * (max_l+1)**2.
  These numbers picked to match ChargE3Net's cutoff + lmax for a clean
  side-by-side comparison.

  Shape helpers
    n_angular_components -> (max_l + 1)**2
    n_coeffs_per_atom    -> n_radial * n_angular_components
    total_coeffs_shape(n_atoms) -> (n_atoms, n_coeffs_per_atom)
  used by downstream PRs for tensor allocation.

Why locking these numbers matters

  Every downstream PR (projection, model, I/O) depends on the
  coefficient shape. Changing max_l or n_radial later requires
  retraining and re-running validation. Pin once, build around it.

Test plan

  19/19 tests pass. Ruff format + check clean. No interaction with
  Adastra; pure-Python dataclass.

Next: PR beta = salted_ft/projection.py with project_chgcar_to_basis
and reconstruct_grid_from_basis + their tests.
PR beta of 4. The DIY bridge between VASP plane-wave CHGCAR data and
the rholearn/SALTED localized-basis world. Both libraries (SALTED,
rholearn, also Graph2Mat) target localized-basis DFT codes
(FHI-aims, CP2K, PySCF, SIESTA); VASP is plane-wave. So we have to
build this projection layer ourselves regardless of which upstream
we wrap. See the Phase A memo for the analysis.

What's here

  salted_ft/projection.py
    - _grid_positions(grid_shape, cell) -> (n_grid, 3) Cartesian
    - _real_sph_harm(rhat, lmax) -> (..., (lmax+1)^2) real Y_lm
      values, hand-rolled for lmax <= 4 (covers the locked default).
      Standard SOAP / SALTED component ordering
      [Y_00, Y_1{-1}, Y_10, Y_11, Y_2{-2}, ..., Y_44].
    - _eval_basis_at_grid(atom, grid, cell, spec) ->
      (n_grid, n_coeffs_per_atom) basis-function values with
      minimum-image PBC.
    - project_chgcar_to_basis(density, atoms, basis_spec)
        Orthonormal-approx projection: c_k = <B_k, rho> / <B_k, B_k>.
        v1 stand-in for proper overlap-matrix LSQR which lands in PR
        gamma. Linear in the input density.
    - reconstruct_grid_from_basis(coefficients, atoms, grid_shape,
        basis_spec). Literal expansion sum. Linear in the input
        coefficients.

  tests/test_salted_projection.py
    - TestProjectChgcarToBasis (6 tests)
        shape, zero->zero, dtype, linearity, additivity, finite.
    - TestReconstructGridFromBasis (6 tests)
        shape, zero->zero, dtype, linearity, single-atom-l0-peak-at-
        atom-position, finite.
    - TestProjectionReconstructionRoundtrip (2 tests)
        zero-density and zero-coefficient roundtrips. Tight roundtrip
        accuracy is intentionally NOT pinned; that lands in PR gamma
        when we swap in proper LSQR.

Design notes

  PBC: minimum-image via cell inverse. Adequate when 2*cutoff fits
  inside the smallest cell vector. For very small cells we'd want
  full supercell expansion; out of scope for PR beta.

  Numpy-only on purpose. e3nn / torch were tempting for spherical
  harmonics but adding them to a projection module mixes concerns:
  projection should be a clean reference implementation that runs
  on any laptop with numpy.

Test plan

  33/33 tests pass (19 from PR alpha + 14 new). Ruff format + check
  clean. No Adastra interaction; pure numpy.

Next: PR gamma wraps rholearn's training/inference loop as a
SALTEDModel class, pinned against our LeMat-Rho parquet input
pipeline and reusing charge3net_ft.train's NMAPE/RMSE/NRMSE metrics.
PR gamma of 4. Adds the model wrapper that pairs with the projection
+ reconstruction layer from PR beta. The wrapper has a stub mode so
the surrounding pipeline (predict -> reconstruct -> metric) can be
exercised end-to-end without a trained rholearn checkpoint.

What's here

  salted_ft/model.py
    SALTEDModel(basis_spec, ckpt_path=None)
      * __call__(atoms) -> (n_atoms, n_coeffs_per_atom) float64
        coefficients.
      * reconstruct_density(atoms, grid_shape) convenience that runs
        predict + reconstruct_grid_from_basis in one call.
      * Stub mode (ckpt_path=None): deterministic, position-dependent
        coefficients seeded by a hash of the positions / numbers /
        basis spec. Different atoms in -> different coefficients out;
        same atoms in -> same coefficients out (verified by tests).
      * Real-rholearn path raises NotImplementedError for now; lands
        in a follow-up PR once rholearn is configured on Adastra.
    Sibling-repo discovery for rholearn follows the existing
    charge3net_ft / deepdft_ft pattern (lazy; only insists when
    ckpt_path is set).

  salted_ft/projection.py
    Wrapped two more matmul sites in np.errstate to silence the same
    benign divide/invalid/overflow noise we already suppressed in
    _eval_basis_at_grid and _grid_positions.

  tests/test_salted_model.py
    15 TDD tests across 5 categories:
      * Construct: basis_spec stored, default ckpt_path is None.
      * Output shape: single-atom, multi-atom, float64 dtype, finite.
      * Determinism: same input -> same output; position changes
        produce different output (rules out a zero-returning stub).
      * Reconstruct density: shape, dtype, finite, equals the
        explicit (predict, then reconstruct_grid_from_basis) path.
      * Metric integration with charge3net_ft.train's
        compute_nmape / compute_rmse / compute_nrmse: finite scalars,
        self-similarity gives NMAPE=0 sanity check. Pinned per the
        brief: keep metric calculations identical to the ChargE3Net
        pipeline.

Test plan

  48/48 tests across the salted suite pass (19 basis + 14 projection
  + 15 model). Ruff format + check clean. No Adastra interaction;
  pure local Python.

Next: PR delta wraps the CHGCAR I/O via pymatgen so reconstructed
grids can be written to disk for VASP ICHARG=1 single-points. End-to-
end VASP integration test will be gated on the entalsim
StructureVASPSinglePoint maker (separate stack).
PR delta of 4, closes the SALTED scaffold. Adds the boundary between
the predicted-density-tensor world and the VASP-input-file world so a
trained SALTED-arm model can be evaluated end-to-end via paired
SCF runs.

What's here

  salted_ft/io.py
    write_chgcar(density, atoms, path, n_electrons=None)
      Writes a pymatgen Chgcar-compatible file. The n_electrons
      argument rescales the density so its integrated value equals
      the requested electron count; that is what VASP reads as the
      total electron count when starting with ICHARG=1. Without
      rescaling VASP silently fixes the count for us at startup,
      which would mask part of the speedup we are trying to measure.
      Rejects non-3D densities and nonpositive n_electrons with
      clear messages.
    read_chgcar(path) -> (density, atoms)
      The inverse. Converts pymatgen's "density times volume"
      storage convention back to plain rho on the grid.
    Uses pymatgen.io.ase.AseAtomsAdaptor for the ase.Atoms <->
    pymatgen.Structure conversion.

  tests/test_salted_io.py
    9 TDD tests + 1 placeholder (skipped):
      Write: file exists and is nonempty, electron-count rescaling
      within 1e-4 relative, non-3D rejected, negative N rejected.
      Read: shape preserved, atom species preserved (multiset),
      cell preserved within 1e-6.
      Roundtrip: density write->read within VASP scientific-notation
      precision (rtol 1e-3, atol 1e-5).
      End-to-end: SALTEDModel.reconstruct_density piped into
      write_chgcar produces a readable file.
      VASP hook gate: pytest.importorskip on
      entalsim.dft.tasks.single_point, which auto-activates once
      Entalpic/entalsim PR #56 lands its PR 2 (StructureVASPSinglePoint
      maker).

Test plan

  9 passed + 1 skipped (entalsim gate). Full salted suite now
  57 passed + 1 skipped across 4 stacked PRs:
    PR alpha  19 tests on BasisSpec
    PR beta   14 tests on projection / reconstruction
    PR gamma  15 tests on SALTEDModel + metric integration
    PR delta  10 tests (9+1) on CHGCAR I/O + VASP hook gate
  Ruff format + check clean across all 8 source/test files.

The SALTED scaffold is now ready to consume a trained rholearn
checkpoint and produce VASP-ready CHGCARs end-to-end. Next steps
(separate stack): wire rholearn training on Adastra using the
LeMat-Rho parquet adapter; flip the entalsim hook gate to live when
PR 2 of the r2SCAN single-point stack lands.
Phase D1 (projection sanity check on 10 real LeMat-Rho rows) caught a
catastrophic failure mode: the orthonormal-approximation projection
landed in PR beta produced 1068% NMAPE on the basis-set roundtrip
because the Gaussian basis functions overlap heavily (sigma ~= cutoff)
and the per-channel c_k = <B_k, rho> / <B_k, B_k> overcounts
contributions from neighboring basis functions.

Fix: build the full per-structure design matrix B_global of shape
(n_grid, n_atoms * n_coeffs_per_atom) and solve one least-squares
system for all atom coefficients simultaneously. The system is
overdetermined for our 10x10x10 grids (1000 > 4 atoms * 100 coeffs in
the typical LeMat-Rho cell) so lstsq returns the unique
minimum-residual fit.

After: basis-set ceiling on 10 random LeMat-Rho rows is

  NMAPE: 8.19% +/- 6.60%  (min 2.00%, max 22.67%)

vs

  NMAPE: 1068.81% +/- 109.42%  (orthonormal-approx)

Well within the 'proceed' band from the plan. Full per-sample numbers
are in the offline CSV at
salted_basis_sanity_check.csv (outside the repo).

Test plan

  57/57 tests in tests/test_salted_basis.py + test_salted_projection.py
  + test_salted_model.py + test_salted_io.py still pass with no
  changes to test contracts. Linearity, zero-in-zero-out, shape, dtype,
  single-atom peak position, all unaffected. LSQR is linear in rho so
  the linearity tests hold by construction.

  Ruff format + check clean.

The previous orthonormal-approx was documented in PR beta's commit as
a 'v1 stand-in' for proper LSQR; this lands the proper version. No
API change.
…irectory)

Phase D2 of the Adastra comparison plan. One-time job to project every
LeMat-Rho parquet row onto the locked SALTED basis, producing a
parallel parquet directory of basis coefficients that downstream
training loops (rholearn, Graph2Mat) consume.

What's here

  salted_ft/project_dataset.py
    project_chunk(in_path, out_path, basis_spec)
      Reads one LeMat-Rho format chunk, runs project_chgcar_to_basis
      on every valid row, writes a parallel chunk with this schema:
        row_index, material_id, n_atoms, atomic_numbers,
        lattice_vectors, n_electrons, grid_shape, coefficients,
        basis_set_NMAPE
      basis_set_NMAPE column is the per-row reconstruction error from
      project + reconstruct roundtrip; lets downstream training know
      the basis ceiling per sample.
    project_directory(input_dir, output_dir, basis_spec)
      Driver that loops over chunk_*.parquet files. Idempotent:
      existing nonempty output files are left untouched so an
      interrupted run can resume cheaply.
    CLI entry point so the Adastra job runs as
      uv run python -m salted_ft.project_dataset \\
          --input-dir  ... --output-dir ...

  tests/test_salted_project_dataset.py
    9 TDD tests across 2 classes covering the contract:
      * file written, row count, all required columns present
      * per-row coefficient shape is (n_atoms, n_coeffs_per_atom)
      * basis_set_NMAPE finite + nonneg per row
      * material_id preserved if source has it
      * NULL charge_density rows in source are skipped (real
        LeMat-Rho has some failed extractions)
      * project_directory processes every chunk
      * second invocation is a no-op (idempotent resume)

The script uses the LSQR projection landed in commit 22809b9; D1
sanity check (10 random LeMat-Rho rows) showed basis ceiling
8.19% +/- 6.60% NMAPE, well within the proceed band.

Test plan

  9/9 tests pass on the new file; full salted suite still 66 passed
  + 1 skipped after this. Ruff format + check clean on touched files.

Next: scp + run on Adastra against $SETUP/charge3net_data, expected
~30 min wall on a Genoa CPU node for 65k rows.
Genoa CPU partition, single node, 16 CPUs, 2 h wall (Adastra smoke
test of 1 chunk = 71 s, 69 chunks extrapolate to ~80 min).

Caps OMP_NUM_THREADS / OPENBLAS_NUM_THREADS / MKL_NUM_THREADS to
SLURM_CPUS_ON_NODE so numpy's BLAS-backed lstsq does not over-
subscribe the node (default behavior would spawn one thread per
hardware core regardless of allocation).

Idempotent via project_directory's skip-existing logic, so the job
can be requeued without paying the LSQR cost for chunks already
written.
Job 4977567 (LRU OOM fix in place) ran 2h41m and died from a NEW
failure mode: NCCL TCPStore "Broken pipe" on the DDP heartbeat
channel. Trace from .err:

  Failed to check the "should dump" flag on TCPStore,
  (maybe TCPStore server has shut down too early), with error: Broken pipe
  ...
  srun: error: g1132: tasks 1-3: Terminated

MaxRSS was 14 GB/task -- memory budget healthy, so the LRU fix is
solid. The new bug is inter-rank communication, not memory.

Adds four NCCL env vars to the submit script:
  NCCL_TIMEOUT=3600                       per-collective timeout
  NCCL_ASYNC_ERROR_HANDLING=1             clean shutdown on rank
                                          failure, no cascading hangs
  TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1800   half-hour heartbeat tolerance
                                          (was the default ~600 sec)
  TORCH_NCCL_TRACE_BUFFER_SIZE=1000       larger trace buffer for the
                                          next crash post-mortem

Test plan
  9/9 tests in tests/test_submit_script.py still pass.
  Resubmit to validate end-to-end. If this still crashes from NCCL,
  fallback options are gloo backend or single-GPU runs.
Phase D3 of the Adastra comparison plan. Bridges our SALTED-arm
dense coefficient layout with rholearn's metatensor TensorMap layout
so the training loop in rholearn can consume LeMat-Rho data.

Layout mismatch resolved by this adapter

    Our layout (from project_chgcar_to_basis):
        atom -> n (radial) -> lambda -> mu
    rholearn's layout (from rholearn/utils/convert.py:_get_flat_index):
        atom -> lambda -> n (radial) -> mu

The reordering is a single per-atom permutation, independent of
species because our BasisSpec is uniform across all species in v1.

What's here

  salted_ft/rholearn_adapter.py
    build_lmax_nmax(basis_spec, species)
      Expand uniform BasisSpec into rholearn's per-species lmax / nmax
      dicts (the form expected by convert.coeff_vector_ndarray_to_tensormap).
    dense_to_rholearn_flat(coeffs, basis_spec, symbols)
    rholearn_flat_to_dense(flat, basis_spec, symbols)
      The exact permutation between the two layouts. Roundtrip is the
      identity; pinned by tests.
    dense_to_tensormap(coeffs, basis_spec, symbols, positions, cell, structure_idx)
      Full path that calls rholearn's converter. Lazy-imports rholearn
      and metatensor so this module is importable without those deps.

  tests/test_salted_rholearn_adapter.py
    12 TDD tests across 4 classes:
      Build lmax/nmax dicts (species coverage, value match, key form,
                             total coefficient count matches)
      dense_to_rholearn_flat (output length, zero-in-zero-out, dtype,
                              per-atom block ordering)
      Roundtrip (single-atom, multi-atom, permutation-is-nontrivial)
      Full TensorMap (key names; skipped locally when sibling rholearn
                      missing -- auto-activates on Adastra)

Test plan

  77 passed + 2 skipped across the salted suite (78 = previous 66 +
  12 new). The 2 skips are forward-looking gates: one on the entalsim
  VASP single-point maker, one on the rholearn sibling repo. Both
  auto-activate as soon as their deps are reachable. Ruff format +
  check clean.

Next: D4 (rholearn training submit script that reads our projected
coefficients via this adapter, runs the metatensor-based training,
saves checkpoints). Will need a real Adastra job once D2's
projected-coefficient dataset is on disk.
Needed by tests/test_salted_rholearn_adapter.py (the metatensor
TensorMap conversion path uses both). Without them the
TestDenseToTensorMap class skips locally, which masks integration
breaks until they're caught at runtime on Adastra.

Pure-Python binary wheels exist on PyPI, no compilation needed.
Mirrors salted_ft's basis module for the Graph2Mat arm of the r2SCAN
density-model comparison. point_basis_for_species and
basis_table_for_species expand our uniform BasisSpec(max_l=4,
n_radial=4, cutoff=4.0) into Graph2Mat PointBasis objects with
basis=[4]*5 and basis_convention='spherical'. PointBasis.basis_size
is asserted equal to BasisSpec.n_coeffs_per_atom (100) so projected
coefficients stay loadable into Graph2Mat density matrices.

10 TDD tests pinning: type/R/basis_size/convention contracts,
one entry per l, species independence, and dedup behaviour of the
batch table builder.
CINES policy rejects explicit --partition= asks on the Genoa nodes,
so SLURM auto-routes based on the resource size. 16 CPUs/task lands
in the exclusive queue (long wait); 4 CPUs/task lands in shared and
starts almost immediately. The projection is BLAS-LSQR bound and
saturates 4 cores per chunk already, so the smaller ask costs no
wall time.
Path A of the Graph2Mat plan: keep the same regression target as
SALTED (per-atom basis coefficient vectors from salted_ft) and use
Graph2Mat as a different backbone over the same target.

graph2mat_ft.projection exposes:

* pack_coeffs_to_point_labels(coeffs, basis_spec, symbols) flattens
  (N_atoms, n_coeffs_per_atom) into atom-major point_labels.
* unpack_point_labels_to_coeffs is the inverse.
* make_basis_configuration bundles a structure into a
  graph2mat.BasisConfiguration so the training driver does not have
  to reach into graph2mat internals.

14 TDD tests pinning shape, dtype preservation, atom-major
ordering, within-atom channel order, length-mismatch ValueError
guards, and BasisConfiguration point_types indexing into the
species basis list.
…mma)

Mirrors salted_ft.model.SALTEDModel. Stub mode (ckpt_path=None)
returns deterministic per-atom coefficients seeded off positions +
numbers + basis_spec via blake2b, so same structure in -> same
coefficients out and small perturbations to any atom change the
output. ckpt_path != None raises NotImplementedError until D6 wires
in the real Graph2Mat backbone, so the failure mode is loud rather
than silently returning stub output during benchmarking.

reconstruct_density(atoms, grid_shape) is the convenience entry
point for the VASP comparison pipeline.

Note: salted_ft.model uses int.from_bytes(seed_bytes[:16], ...)
which only seeds off atom 0 -- different bug, same shape, but left
alone here per the surgical-changes rule. Worth fixing in its own
patch.

10 TDD tests pinning shape, dtype, finiteness, determinism,
position-dependence, species-dependence, output magnitude, the
NotImplementedError gate for ckpt_path, and the reconstruct_density
shape contract.
graph2mat_ft.io re-exports read_chgcar / write_chgcar from
salted_ft.io so the two arms share a single implementation
(including the n_electrons rescaling that VASP ICHARG=1 needs).
Tests pin the identity of the re-exports so a future fix in
salted_ft.io automatically propagates.
Graph2Mat's native target is D_ab in an atom-centered basis. VASP
does not output that; we would have to invent a CHGCAR -> D_ab
projection (10^6 x 10^6 dense LSQR per structure, needs
matrix-free + neighbor-cutoff and its own quality validation).
Multi-week effort, no clear win for the SCF-speedup goal vs the
three arms already in flight.

The PointBasis adapter, projection helpers, model wrapper and
shared IO surface stay in tree as green-tested scaffolding so the
arm can be revived (with SIESTA training data, a matrix-free
projection, or a vector-output hijack) without rewriting from
zero.
scripts/density_model_eval.py loops over a LeMat-Rho-shaped test
parquet, runs the selected arm to predict the density on the
ground-truth grid, and writes per-row NMAPE / RMSE / NRMSE into an
output parquet. Importable for D8 (the comparison-table builder)
via evaluate_dataset(...).

Arm coverage in this alpha:

* salted: fully wired through SALTEDModel.reconstruct_density.
  Stub mode (no ckpt) works; real mode lights up when D6 (SALTED
  training driver) lands.
* charge3net, deepdft: dispatcher raises NotImplementedError with
  a TODO pointing at D7-beta (probe batching). Catches a future
  user feeding a real-arm name and silently getting stub metrics.
* unknown name: ValueError at the boundary.

Metrics are numpy-only on flat or 3D arrays (no probe-padding mask
needed because grid eval has no padding). 14 TDD tests pin metric
values, dispatcher contract, parquet schema (model, ckpt,
material_id, n_atoms, nmape, rmse, nrmse), finiteness, and the
--limit smoke-test path.
scripts/density_model_comparison_table.py concatenates one or more
D7 per-row eval parquets, groups by the model column, and emits a
per-arm summary (n, mean +/- std, median for NMAPE / RMSE /
NRMSE). Writes both a CSV (machine-readable) and a GitHub-flavour
markdown table (paste-into-PR).

build_comparison_table(inputs, csv_path, markdown_path) is
importable so a Lightning callback / pipeline step can call it
directly without spawning a subprocess. CLI driver provided for
ad-hoc use.

10 TDD tests pin: per-arm grouping, mean / std / median values,
n_structures count, multi-file-per-arm aggregation (sharded eval),
markdown content and header structure, and the CSV + markdown
write paths.
The old int.from_bytes(seed_bytes[:16], ...) only consumed the
first 16 bytes of positions + numbers + spec, which is two-thirds
of atom 0's xyz and nothing else. Perturbing any atom past index
0 produced identical stub coefficients, silently collapsing
distinct structures into the same seed.

Switch to a blake2b(digest_size=16) hash over the full buffer so
every atom contributes. Same fix already in graph2mat_ft.model.

Regression test pins the multi-atom case: nudging atom 1 in a
two-atom Fe cell must change the predicted coefficients.
Wires the charge3net arm in scripts/density_model_eval.py. Builds
the full-grid input dict via charge3net's own
KdTreeGraphConstructor (so atom + probe edges match training),
batches probes through src.utils.predictions.split_batch, and
reshapes the concatenated forward output to (Nx, Ny, Nz).

predict_density now accepts an optional pre-loaded model so tests
inject a mock without going through ChargE3NetWrapper + a real
ckpt. The charge3net_ft.model import is forced for its sys.path
side effect (adds ../charge3net) so the data utilities resolve
even when the caller supplies the model directly.

Tests skip cleanly when the charge3net sibling repo is absent
(integration-only). Two new mock-model tests pin: full-grid shape
contract, value reshape order (constant predictions reproduce a
constant grid), and that lowering max_probe_batch increases the
forward-pass count.

DeepDFT branch still gated behind NotImplementedError (separate
forward signature, lands in D7-beta2).
speckhard added 7 commits May 26, 2026 15:30
DeepDFT is the upstream code charge3net forked, so the model
input-dict format is identical: probe_xyz, num_probes,
probe_edges, etc. _deepdft_predict_grid reuses charge3net's data
utilities to build the graph and split_batch to batch probes; the
DeepDFT-specific bits are:

* sys.path side effect from deepdft_ft.runner (adds ../DeepDFT
  and stubs asap3 when its C extension is unbuildable, as on
  Adastra).
* densitymodel.PainnDensityModel(num_interactions=3, node_size=128,
  cutoff=4.0) by default; toggle use_painn=False for SchNet.
* ckpt loading via torch.load with the "model" key wrapper.

Optional model= injection identical to charge3net so tests can
mock the network. Integration test skips when the DeepDFT sibling
repo is absent (this machine); runs on Adastra where it lives.
…p (D6)

Path B of the D6 plan: skip the rholearn integration (would need
multi-week Adastra-side iteration) and train a small
SchNet-style invariant message-passing net directly on D2's
per-atom basis coefficients. MSE loss; AdamW; gradient
accumulation per batch since per-structure forward is variable
size.

Architecture (salted_ft/train_baseline.py):
* Z embedding (nn.Embedding, max_z=120).
* GaussianRBF distance featurisation over neighbours within
  BasisSpec.cutoff.
* Two SchNet-style cfconv layers.
* Per-atom readout MLP -> BasisSpec.n_coeffs_per_atom.

Caveat: invariant model means l>0 channels of the SALTED basis
will be systematically wrong. This is a baseline; upgrade to
e3nn/MACE for proper equivariance if it under-performs.

SaltedTrainingDataset joins D2 source (cartesian_site_positions
column) and projected coefficients (training targets) by
row_index per matching chunk basename, since D2 output does not
carry positions.

submit_salted_baseline_adastra.sh: single-GCD MI250 job,
10 epochs, 24h walltime, ROCm env mirrored from the DeepDFT
submit.

8 TDD tests pinning: forward output shape, dtype, finiteness,
determinism, species-dependence (catches frozen Z embedding),
loss-decrease on a synthetic toy, save/load round-trip, and an
end-to-end train() call on a synthetic 2-row dataset.
Replaces _rholearn_predict (which only raised NotImplementedError)
with _baseline_predict: lazy-loads the SaltedBaselineModel from
the D6 ckpt format {basis_spec, model: state_dict}, caches it on
the wrapper, and forwards through torch.no_grad(). The result is
cast to float64 to match the stub-mode contract.

Removes the eager _ensure_rholearn_importable() check from
__init__ since the baseline path does not need the rholearn
sibling repo. The rholearn-faithful path was deferred (graph2mat
arm is parked, SALTED arm uses path B); when it comes back as a
follow-up we will dispatch on ckpt format inside _baseline_predict.

Two new tests: round-trip a baseline state_dict through SALTEDModel
and verify the predicted coefficients differ from the stub seed
(so we know the ckpt is actually driving inference), and assert a
clear RuntimeError on a malformed ckpt.
…izes

Job 5003891 OOM-killed (CPU RAM) at ~10 min: slurmstepd reported
"Detected 1 oom_kill event" with 64 GB budget. Root cause is the
data-buffer footprint, not a model or training-loop issue. The
upstream-DeepDFT defaults of RotatingPoolData(pool_size=20) +
num_workers=4 keep up to 80 full grids in RAM concurrently. For
QM9 (~50^3) and MP (~100^3) that is fine. LeMat-Rho's r2SCAN
CHGCARs have a long upper tail (200-300^3), and a single 300^3
sample is ~750 MB once density + grid_pos are materialised; a
handful of those in the pool blows past 64 GB.

Cut pool_size 20 -> 5 and num_workers 4 -> 2. Effective in-RAM
grid count drops 80 -> 10. Hyperparameters that affect training
quality (batch_size=2 materials, 1000 probes/material, learning
rate, etc.) are unchanged.

Verified locally: full test suite still green (195 pass).
…Flow (P4)

For each held-out test row, predicts the density via the chosen
arm (salted, charge3net, deepdft) using the existing
density_model_eval.predict_density, writes a CHGCAR with the
n_electrons rescaling salted_ft.io.write_chgcar applies, and
submits the paired baseline + predicted r2SCAN single-point
Flow via entalsim.dft.scf_speedup.make_scf_speedup_pair plus
entalsim.core.submit.submit_workflow.

Driver is dependency-injectable on the two entalsim callables
(make_pair_fn, submit_fn) so its tests pass locally without
entalsim installed; the CLI imports them at runtime via lazy
imports.

Fail-fast guards at run_experiment call time:
* charge3net or deepdft without --ckpt raises ValueError (those
  arms with no weights produce random-init predictions and waste
  HPC time)
* salted without --ckpt is allowed — stub mode is the documented
  fallback while D6 trained weights are pending

One per-row chgcar directory keyed by (model, material_id) so
multiple rows never share a CHGCAR file; make_scf_speedup_pair's
prev_dir mechanism receives the right directory.

9 TDD tests pinning: dry-run writes one CHGCAR per row + does
not submit; make_pair gets metadata with material_id + arm +
experiment; --limit caps rows processed; non-dry-run submits per
row with the right project + worker; submitted=True/False flag
appears on the returned records; charge3net + deepdft without
--ckpt fails fast; salted stub-mode ckpt label propagates;
per-row CHGCAR directories are unique.
…4 hardening)

Reviewer flagged two blockers on the multi-hour submit loop:
* a single bad row killed the batch and left already-submitted
  Flows on Mongo with no resume path
* no per-row logging meant a row-200 failure left no breadcrumb
  for diagnosis

This commit addresses both, plus a chgcar-dir contract nit later.

Per-row resilience:
* try/except Exception around the prediction + flow-build + submit
  body. A failed row records {"error": repr(e), "submitted": False}
  and the loop continues with the next row.

Resumable JSONL manifest:
* records stream to {chgcar_dir}/manifest.jsonl by default
  (overridable via --manifest) AFTER each row, in finally:, so an
  interrupted run leaves an inspectable record.
* --skip-existing reads the manifest at start and skips rows with
  submitted=True for THIS model. Failed rows (submitted=False) are
  always retried.

Observability:
* tqdm.auto wrapper on df_in.iterrows() with desc=
  f"scf_speedup({model_name})" -- visible progress bar without
  spamming the log.
* logger.info per row (material_id, arm, n_jobs, submitted) plus
  logger.exception on per-row failure for full traceback.
* main() configures basicConfig(level=INFO) so the CLI path emits
  logs straight to stderr.

5 new TDD tests:
* TestPerRowResilience: a corrupt positions cell in row 2 of 3
  fails that row only; the other two complete normally.
* TestManifest.test_manifest_jsonl_written_after_each_row: 3 rows
  -> 3 JSONL lines in the manifest.
* TestManifest.test_manifest_defaults_to_chgcar_dir: implicit
  manifest path lands at chgcar_dir/manifest.jsonl.
* TestSkipExisting.test_skip_existing_skips_already_submitted_rows:
  pre-populated manifest with submitted=True skips that row.
* TestSkipExisting.test_skip_existing_does_not_skip_failed_rows:
  submitted=False rows are retried, not skipped.

14 / 14 tests green; full suite green (204+ tests).
Reviewer flagged two worth-flagging items.

LeMaterial#4 CHGCAR directory layout
* was: chgcar_root / f"{model}__{material_id}/CHGCAR"
* now: chgcar_root / model / material_id / CHGCAR
* the flat layout would have been ambiguous for synthesised IDs
  containing the separator (e.g. "oqmd__1234"). Nested avoids that
  entirely and is also more ls-friendly when sweeping models.
* new test test_chgcar_layout_is_nested_by_model_then_material_id
  asserts the path tail.

LeMaterial#8 Test-data realism
* the existing _toy_parquet uses 2-atom H2 cells with
  grid_shape=(4,4,4) and n_electrons=2.0 -- a missing n_electrons
  rescale, a positions-reshape bug, or a grid/atom mismatch would
  all pass silently.
* new TestRealisticRow.test_5_atom_asymmetric_grid_unequal_n_electrons
  exercises an FeO4 row with grid_shape=(8,10,12) and n_electrons=12.5
  != sum(Z). Catches mutations on the reshape and rescale paths.

16 / 16 tests green; full suite green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant