Case Distribution Analysis¶

This guide covers the analyze_case_distribution.py preprocessing tool, which previews how simulation cases are distributed across the three design parameters (Dr, Re, Lr) before training.

Why it matters¶

The alpha-D and case-level pressure-drop surrogates generalise only as well as their training support permits. Bins with few training samples produce unreliable predictions regardless of model capacity. Running this tool up-front answers:

Do I have enough cases in each Dr / Re / Lr bin?
Which bins will my min_Dr / exclude_cases filter remove?
Does a recorded train / test split cover every bin?

It inspects the Zarr directory (and optionally a run_meta.json) and prints coloured tables so under-supported regions are obvious.

How it works¶

data/flow_contraction_expansion/parametric_study/processed/
  Re_*__Dr_*__Lr_*.zarr        <-- discovered by the tool
                                    (parsed from the case name)

data/models/.../run_meta.json   <-- optional; when provided,
                                    Train/Test columns are populated
                                    from split.train_sims / test_sims

Support thresholds (based on the train count when a split is provided, or the total count otherwise):

Marker	Train cases	Meaning
`✗ none`	0	Bin will not be learned at all
`⚠ very low`	< 3	Extreme extrapolation risk
`⚠ low`	< 10	Generalisation in this bin is unreliable
`◦ ok`	< 30	Usable but watch for drift
`✓ good`	≥ 30	Adequate support

Quick start¶

From inside the container:

cd src && python analyze_case_distribution.py \
    --run-meta ../data/models/case_pressure_drop/run_meta.json

From the host with Apptainer:

apptainer exec multifid-th-gpu.sif bash -c \
    'cd src && python analyze_case_distribution.py \
        --run-meta ../data/models/case_pressure_drop/run_meta.json'

Usage examples¶

Inspect the raw Zarr directory (before training)¶

cd src && python analyze_case_distribution.py \
    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed

Supports the whole dataset with a single Total column. Use this to check the raw simulation inventory.

Preview filters that will be applied during training¶

cd src && python analyze_case_distribution.py \
    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
    --min-Dr 0.333

Mirrors the filtering logic in TabularPairDataset. Useful when deciding the data.min_Dr value in cases/alpha_d/configs/train_mlp.yaml: run it with different thresholds and see which bins disappear.

Exclude specific problematic cases¶

cd src && python analyze_case_distribution.py \
    --zarr-dir ../data/flow_contraction_expansion/parametric_study/processed \
    --exclude Re_11927__Dr_0p05__Lr_0p052 \
    --exclude Re_7722__Dr_0p05__Lr_0p052

--exclude can be repeated to drop any number of case names. The filter is applied by exact {stem} match against the Zarr files.

Inspect a prior train / test split¶

cd src && python analyze_case_distribution.py \
    --run-meta ../data/models/case_pressure_drop/run_meta.json

Populates Train and Test columns from the recorded split and classifies Support on the train count. If --zarr-dir is omitted, the tool reads data.zarr_dir from run_meta.json.

Restrict to a subset of axes¶

cd src && python analyze_case_distribution.py \
    --run-meta ../data/models/case_pressure_drop/run_meta.json \
    --axes Dr

Useful when you only care about one parameter (e.g. diagnosing poor performance at large Dr).

CLI reference¶

Flag	Default	Description
`--zarr-dir`	from `run_meta` if provided	Directory of processed `*.zarr` case stores
`--run-meta`	`null`	`run_meta.json` to read a recorded train / test split
`--min-Dr`	`null`	Drop cases whose `Dr` is below this value
`--exclude`	`[]`	Case name to exclude (repeatable)
`--axes`	`Dr Re Lr`	Which parameter axes to report

At least one of --zarr-dir or --run-meta must be provided.

Output sections¶

Header panel¶

Summarises the total case count, the Zarr directory in use, and the train / test split (when a run_meta.json is supplied).

Per-axis distribution tables¶

One table per axis (Dr, Re, Lr). Columns:

Axis value (e.g. Dr = 0.900) – rounded to 3 decimals, except Re which is shown as an integer.
Train / Test – present only when a run-meta is provided.
Total – the union of train, test, and any other cases in the Zarr directory.
Support – coloured marker classifying the training support.

Bins flagged ⚠ very low or ✗ none are likely to show outsized evaluation errors. Cross-reference them with your evaluation-metrics review workflow to confirm.

Typical workflow¶

Before the first ETL→training pass, run with --zarr-dir only to see the raw simulation inventory. Look for bins with fewer than 3 cases and decide whether to gather more simulations or drop them.
Before each HPO run, run with --zarr-dir plus the --min-Dr and --exclude values from your config. Confirm you still have ◦ ok or better support in every bin you care about.
After a training run, run with --run-meta to verify the stratified split gave each bin at least one train and one test case.
When diagnosing a worst-case list (see evaluate_case_pressure_drop.py output), look up the failing cases’ Dr / Re / Lr in this table. If they land in a ⚠ low-support bin, the fix is data, not model.

Adding a new axis¶

The axis set is currently hard-coded to ("Dr", "Re", "Lr") to match the case-name convention (Re_*__Dr_*__Lr_*). If you add a new design parameter to the simulation campaign:

Extend the case-name pattern in the ETL.
Update parse_case_params in src/case_pressure_drop/distribution.py to extract the new key.
Add the key to the AXES tuple and to the axis index maps inside bin_by.