Architecture¶

MULTIFID-TH has two complementary halves: a pair of ETL pipelines that turn raw MOOSE outputs into ML-ready Zarr, and a generic training framework that consumes those Zarr stores through a small set of adapters. This page wires them together with the diagrams you need to navigate the codebase.

Two ETLs feed one trainer¶

        flowchart LR
    subgraph ETL_grid["cases/moose_grid (run_etl.py)"]
        Aex["MOOSE .e + CSV probes"]
        Aetl["ExodusDataSource → MOOSEDataTransformation → MOOSEZarrSink"]
        Aex --> Aetl
    end

    subgraph ETL_alpha["cases/alpha_d (run_etl.py)"]
        Bex["MOOSE .e + case_metadata.txt"]
        Betl["AlphaDSource → AlphaDTransformation → AlphaDZarrSink"]
        Bex --> Betl
    end

    Aetl --> Az["{sim_name}.zarr<br/>mesh · fields · grid · probes · norm_stats"]
    Betl --> Bz["{case}.zarr<br/>features [50,10] · targets [50,1] · sample_weight · metadata"]

    Az --> Train["train.py / evaluate.py<br/>(generic framework)"]
    Bz --> Train

    Train -->|"grid"| FNO["FNO · AFNO · Pix2Pix"]
    Train -->|"graph"| MGN["MeshGraphNet"]
    Train -->|"pointwise"| MLP["MLP (FullyConnected)"]
    Train -->|"profile"| Conv["Conv1DProfile"]

The orchestrator on either side is physicsnemo_curator.etl.ETLOrchestrator, which Hydra wires together from a single etl.yaml via _target_ keys. Both ETLs share the same Source → Transformation → Sink convention, only the inputs and per-step transforms differ.

See:

cases/moose_grid/ → MOOSE Grid case page and the ETL pipeline internals.
cases/alpha_d/ → Alpha-D case page and the Alpha-D surrogate tutorial.

Training framework: registry → adapter → dataset → runner¶

        flowchart TB
    subgraph Reg["Model registry (training/models/__init__.py)"]
        R1["mlp → pointwise"]
        R2["fno · afno · pix2pix → grid"]
        R3["meshgraphnet → graph"]
        R4["conv1d_profile → profile"]
    end

    subgraph Ad["Adapters (training/adapters.py)"]
        A1["GridAdapter<br/>GridPairDataset"]
        A2["GraphAdapter<br/>GraphPairDataset"]
        A3["PointwiseAdapter<br/>TabularPairDataset"]
        A4["ProfileAdapter<br/>AlphaDProfileDataset"]
    end

    subgraph Run["Runner (training/runner.py)"]
        Rn["train() / evaluate()<br/>splits · LR schedule · early stop · run_meta.json"]
    end

    Exp["Experiment hooks<br/>training_step / eval_step"] --> Rn

    R1 --> A3
    R2 --> A1
    R3 --> A2
    R4 --> A4
    A1 --> Rn
    A2 --> Rn
    A3 --> Rn
    A4 --> Rn

Swapping the adapter is the only thing that changes between an FNO run and an MLP run. The runner stays the same.

Key files:

Registry — training.models (src/training/models/__init__.py).
Adapters — training.adapters.
Datasets — training.datasets, training.datasets_tabular, and the case-owned cases.alpha_d.datasets.profile.AlphaDProfileDataset.
Runner — training.runner.
Experiments — base training.experiment.Experiment; case-specific overrides live with the case (e.g. cases.alpha_d.experiment.AlphaDExperiment).

HPO orchestration¶

        flowchart LR
    Cfg["train_*.yaml<br/>(hpo section present)"] --> Gate{"hpo=null<br/>on CLI?"}
    Gate -- "yes" --> Direct["train() once<br/>on full pool"]
    Gate -- "no" --> Split["Hold out test set<br/>Split training pool<br/>→ inner train · validation"]
    Split --> Loop["Optuna study<br/>(TPE / pruner)"]
    Loop --> Trial["Trial: build model · train inner · score on val"]
    Trial --> Loop
    Loop --> Best["best_params.json<br/>best_config.yaml<br/>optimization_history.png · ..."]
    Best --> Retrain{"retrain_best?"}
    Retrain -- "true" --> Final["Retrain on full training pool<br/>evaluate on held-out test"]
    Retrain -- "false" --> End["End (HPO artifacts only)"]
    Direct --> Final

The held-out test set is never used during HPO trials. After optimization the best trial is retrained on the full training pool and evaluated on the test set as usual. See the Hyperparameter Optimization guide for the full search-space format and artifact reference.

Per-case folder layout¶

Every case-specific concern lives in a self-contained src/cases/<case>/ folder, kept out of the generic training core.

        flowchart TB
    SRC["src/"] --> CASES["cases/"]
    SRC --> TRAIN["training/<br/>(generic: registry · adapters · runner · experiment)"]
    SRC --> DS["dataset/<br/>(MOOSEDataset public API)"]
    SRC --> ENT["train.py · evaluate.py"]

    CASES --> MG["moose_grid/"]
    CASES --> AD["alpha_d/"]
    CASES --> CP["case_pressure_drop/"]

    MG --> MGc["configs/<br/>etl_base · etl · train_fno"]
    MG --> MGe["etl/<br/>data_sources · transformations"]
    MG --> MGr["run_etl.py"]

    AD --> ADc["configs/<br/>train_mlp · train_conv1d · etl · pycaret"]
    AD --> ADds["datasets/<br/>profile.py"]
    AD --> ADe["etl/<br/>source · transform · sink"]
    AD --> ADp["physics/<br/>baseline · targets"]
    AD --> ADx["experiment.py<br/>feature_data · metrics · transforms"]
    AD --> ADr["run_etl.py · train.py"]

    CP --> CPc["configs/<br/>case_pressure_drop"]
    CP --> CPm["data · modeling · feature_selection · plotting"]
    CP --> CPr["run_case_pressure_drop.py"]

For newcomers: pick a case folder, open its README (alpha-D) or top-level entry script, and follow the imports outward. The training core does not import from cases/*; coupling flows in one direction only.

`run_meta.json` round-trip¶

        sequenceDiagram
    autonumber
    participant T as train.py
    participant R as runner.train()
    participant FS as Filesystem
    participant E as evaluate.py
    participant Rev as runner.evaluate()

    T->>R: load config · build adapter · build dataset · split
    R->>FS: write model.mdlus
    R->>FS: write run_meta.json<br/>(dataset args · split sims · adapter · model entrypoint · params)
    Note over FS: checkpoint + meta land together

    E->>FS: read run_meta.json (next to ckpt)
    E->>Rev: reconstruct dataset · split · target transform · model
    Rev->>FS: read model.mdlus
    Rev-->>E: per-field MSE / RMSE on held-out test cases
    E->>FS: (optional) metrics.json · plots

This is the single invariant that lets evaluate.py reproduce training conditions exactly without re-passing every flag. Don’t move or rename run_meta.json without updating both train.py and evaluate.py.

Vendored PhysicsNeMo fallback¶

training/__init__.py exposes import_physicsnemo_module / import_physicsnemo_attr. If physicsnemo is not installed, those helpers add the checked-out submodule at physicsnemo/ to sys.path and retry. This means training code works in both the etl-dev image (no PhysicsNeMo) and the etl / etl-gpu / etl-ngc images (PhysicsNeMo installed from PyPI or NGC) — provided the physicsnemo submodule is initialized.