feature_selection¶
Shared feature-analysis helpers used by alpha-D and case-pressure-drop.
Generic feature-analysis data container.
Case-specific feature loaders (e.g.
cases.alpha_d.feature_data.load_feature_matrix) materialise feature
matrices as this dataclass; the generic
feature_selection.pycaret_selection consumes it. The dataclass itself
is alpha-D-agnostic — it just carries the per-row feature matrix, the
target, the case grouping, and bookkeeping for the case-level CV split.
- class feature_selection.data.FeatureAnalysisData(X, y, groups, feature_names, target_name, case_ids, rows_per_case, local_velocity_normalization)[source]¶
Bases:
object- groups: numpy.ndarray¶
PyCaret-based feature selection (generic, case-driven).
Callers pass a FeatureAnalysisData (whose construction stays
case-side, so the case enforces its own ALLOWLIST upstream) together
with the post-selection allowlist to validate against.
V1 contract
The DataFrame handed to PyCaret is built exclusively from a
FeatureAnalysisDatainstance. PyCaret never reads Zarr directly, so the caller-supplied allowlist remains the single guard against target-adjacent columns leaking back into the selected feature set.setup()lockspolynomial_features,feature_interaction,pca, andgroup_featuresoff. Those settings would synthesize columns thatTabularPairDatasetcannot reproduce from the Zarr stores, which silently breaks the handoff to PhysicsNeMo training viadata.input_columns_file.Case-level train/test split runs before
setup(). Row-level holdout inside PyCaret would place rows from the same case into both splits (rows inside a case are spatially correlated, so that leaks). The pre-split test frame is passed through viatest_data.Inside
setup(),fold_strategy='groupkfold'withfold_groups='case_id'keeps internal CV group-safe.Output
selected_features.txtis one name per line, no header, drop-in fordata.input_columns_filein the MLP training config.
- feature_selection.pycaret_selection.build_dataframe(data)[source]¶
Materialize
FeatureAnalysisDataas a pandas DataFrame.Columns:
feature_names + [target_name, case_id]. Thecase_idcolumn is used for GroupKFold inside PyCaret and for the case-level holdout split, and is always dropped from the final selection artifact.
- feature_selection.pycaret_selection.case_level_split(df, *, case_id_col, test_ratio, seed)[source]¶
Split a row-level DataFrame into train/test without crossing cases.
- feature_selection.pycaret_selection.enforce_allowlist(selected, allowlist)[source]¶
Raise if any selected feature is outside the caller’s allowlist.
Kept as a top-level function so the v1 contract is unit-testable without importing PyCaret. The caller supplies the allowlist; this keeps the library case-agnostic.
- Return type:
- feature_selection.pycaret_selection.run_pycaret_selection(data, *, pycaret_cfg, output_dir, allowlist)[source]¶
Run the v1 PyCaret selection path and write artifacts.
allowlistis the set of permitted post-selection feature names; callers supply it (typically the case-side ALLOWLIST constant) so this library stays alpha-D-agnostic.
- feature_selection.pycaret_selection.write_selected_features(path, selected, *, allowlist=None)[source]¶
Write
selected_features.txt.One name per line, no header, no blank lines, trailing newline. Drop-in for
data.input_columns_filein the training config. Whenallowlistis supplied (typically the case-side ALLOWLIST), the file is rejected if any name falls outside it.- Return type:
Run manifest for feature-analysis reproducibility.
Captures config, dataset identity (zarr paths + mtimes hash), sklearn /
numpy versions, git SHA (best-effort), and seeds. Written next to the
report as manifest.json.