ETL Pipeline¶
The ETL pipeline converts raw MOOSE simulation outputs (Exodus + CSV) into compressed Zarr stores that are ready for ML training.
Architecture¶
The pipeline follows the PhysicsNeMo Curator ETL pattern:
ExodusDataSource ──► MOOSEDataTransformation ──► MOOSEZarrSink
│ │
│ reads .e + CSV writes .zarr │
└──────────────────────────────────────────────────┘
ETLOrchestrator (multiprocessing)
Component |
Class |
File |
|---|---|---|
Source |
|
|
CSV reader |
|
|
Transform |
|
|
Sink |
|
|
Schemas |
|
|
Pipeline stages¶
Stage 1 — ExodusDataSource¶
Reads one Exodus .e file and its associated CSV line-probe files.
Exodus extraction:
Node coordinates (
coordx,coordy, optionallycoordz) →coords [N, D]Element connectivity (
connect1) — converted from 1-indexed to 0-indexed →connectivity [E, K]Time values (
time_whole) →time_steps [T]Element variable names (decoded from
name_elem_var) →field_namesElement variable arrays (
vals_elem_var{i}eb1) →fields [T, E, F]
CSV co-reading (CSVProbeSource):
Looks for files matching {sim_name}_out_{probe_name}_{timestep:04d}.csv in the
configured data directory. Each probe file is read into a [Np, C] array where
Np is the number of probe points and C is the number of columns
(TKE, TKED, id, pressure, vel_x, vel_y, x, y, z).
The Exodus and CSV files use independent filename prefixes and are not required to match.
Stage 2 — MOOSEDataTransformation¶
Three sub-steps applied in order:
2a. Per-field z-score normalization
normalized = (field - mean) / (std + eps)
Statistics are computed per field across all time steps and elements.
mean and std are stored in the output Zarr for later denormalization.
2b. Graph edge construction
For each element, all pairs of its nodes are connected in both directions (undirected graph). Duplicate edges (shared between adjacent elements) are deduplicated.
For a quad mesh with E elements and K=4 nodes per element this produces
C(4,2) × 2 × E candidate edges before deduplication.
2c. Regular-grid interpolation
Element centroids (mean of node positions) are scattered onto a uniform
Nx × Ny grid using scipy.interpolate.griddata (linear method, fill_value=0).
The default resolution is 64 × 64.
Stage 3 — MOOSEZarrSink¶
Writes one Zarr store per simulation with Blosc/zstd compression (level 3, shuffle enabled). Chunk sizes target ~1 MB per chunk.
Output Zarr layout¶
{sim_name}.zarr/
├── mesh/
│ ├── coords float32 [N, D] node coordinates
│ ├── connectivity int32 [E, K] element→node (0-indexed)
│ ├── edge_src int32 [M] graph edge source nodes
│ └── edge_dst int32 [M] graph edge destination nodes
├── fields/
│ ├── pressure float32 [T, E] normalized element pressure
│ ├── vel_x float32 [T, E] normalized x-velocity
│ └── vel_y float32 [T, E] normalized y-velocity
├── probes/
│ └── {probe_name} float32 [Np, C] CSV line-probe values (raw)
├── grid/
│ ├── x float32 [Nx] column x-coordinates
│ ├── y float32 [Ny] row y-coordinates
│ ├── pressure float32 [T,Nx,Ny] interpolated pressure
│ ├── vel_x float32 [T,Nx,Ny] interpolated x-velocity
│ └── vel_y float32 [T,Nx,Ny] interpolated y-velocity
└── metadata/
├── time_steps float32 [T]
├── attrs: field_names (JSON), probe_columns (JSON), sim_name
└── norm_stats/
├── pressure/ attrs: mean, std
├── vel_x/ attrs: mean, std
└── vel_y/ attrs: mean, std
Configuration¶
The pipeline is configured via Hydra. The shared base template lives at
src/cases/moose_grid/configs/etl_base.yaml and the canonical lid-driven
variant at src/cases/moose_grid/configs/etl.yaml. Any key can be overridden on the CLI.
python cases/moose_grid/run_etl.py \
etl.source.input_dir=../data \
etl.source.data_dir=../data \
etl.sink.output_dir=../data/processed \
etl.processing.num_processes=8 \
etl.transformations.moose_transform.grid_nx=128 \
etl.transformations.moose_transform.grid_ny=128 \
etl.sink.compression_level=6
Full config reference¶
Key |
Default |
Description |
|---|---|---|
|
|
Worker processes for parallel file processing |
|
(required) |
Directory containing Exodus |
|
(required) |
Directory containing CSV probe files |
|
|
Grid columns for CNN output |
|
|
Grid rows for CNN output |
|
|
Stability epsilon for normalization |
|
(required) |
Output directory for Zarr stores |
|
|
Overwrite existing Zarr stores |
|
|
Blosc compression level (1–9) |
|
|
Blosc codec name |
|
|
Target chunk size in MB |
Verified output (lid-driven-segregated_out)¶
mesh/coords (10201, 2) 101×101 nodes on a unit square
mesh/connectivity (10000, 4) 10 000 quad elements
mesh/edge_src/dst (80400,) graph edges (deduped, both directions)
fields/* (2, 10000) 2 time steps × 10 000 elements
grid/* (2, 64, 64) 2 time steps × 64×64 interpolated grid