feature_selection

Shared feature-analysis helpers used by alpha-D and case-pressure-drop.

Generic feature-analysis data container.

Case-specific feature loaders (e.g. cases.alpha_d.feature_data.load_feature_matrix) materialise feature matrices as this dataclass; the generic feature_selection.pycaret_selection consumes it. The dataclass itself is alpha-D-agnostic — it just carries the per-row feature matrix, the target, the case grouping, and bookkeeping for the case-level CV split.

class feature_selection.data.FeatureAnalysisData(X, y, groups, feature_names, target_name, case_ids, rows_per_case, local_velocity_normalization)[source]

Bases: object

X: numpy.ndarray
y: numpy.ndarray
groups: numpy.ndarray
feature_names: list[str]
target_name: str
case_ids: list[str]
rows_per_case: list[int]
local_velocity_normalization: bool
property n_cases: int

PyCaret-based feature selection (generic, case-driven).

Callers pass a FeatureAnalysisData (whose construction stays case-side, so the case enforces its own ALLOWLIST upstream) together with the post-selection allowlist to validate against.

V1 contract

  • The DataFrame handed to PyCaret is built exclusively from a FeatureAnalysisData instance. PyCaret never reads Zarr directly, so the caller-supplied allowlist remains the single guard against target-adjacent columns leaking back into the selected feature set.

  • setup() locks polynomial_features, feature_interaction, pca, and group_features off. Those settings would synthesize columns that TabularPairDataset cannot reproduce from the Zarr stores, which silently breaks the handoff to PhysicsNeMo training via data.input_columns_file.

  • Case-level train/test split runs before setup(). Row-level holdout inside PyCaret would place rows from the same case into both splits (rows inside a case are spatially correlated, so that leaks). The pre-split test frame is passed through via test_data.

  • Inside setup(), fold_strategy='groupkfold' with fold_groups='case_id' keeps internal CV group-safe.

  • Output selected_features.txt is one name per line, no header, drop-in for data.input_columns_file in the MLP training config.

feature_selection.pycaret_selection.build_dataframe(data)[source]

Materialize FeatureAnalysisData as a pandas DataFrame.

Columns: feature_names + [target_name, case_id]. The case_id column is used for GroupKFold inside PyCaret and for the case-level holdout split, and is always dropped from the final selection artifact.

feature_selection.pycaret_selection.case_level_split(df, *, case_id_col, test_ratio, seed)[source]

Split a row-level DataFrame into train/test without crossing cases.

feature_selection.pycaret_selection.enforce_allowlist(selected, allowlist)[source]

Raise if any selected feature is outside the caller’s allowlist.

Kept as a top-level function so the v1 contract is unit-testable without importing PyCaret. The caller supplies the allowlist; this keeps the library case-agnostic.

Return type:

None

feature_selection.pycaret_selection.run_pycaret_selection(data, *, pycaret_cfg, output_dir, allowlist)[source]

Run the v1 PyCaret selection path and write artifacts.

allowlist is the set of permitted post-selection feature names; callers supply it (typically the case-side ALLOWLIST constant) so this library stays alpha-D-agnostic.

Return type:

dict[str, Any]

feature_selection.pycaret_selection.write_selected_features(path, selected, *, allowlist=None)[source]

Write selected_features.txt.

One name per line, no header, no blank lines, trailing newline. Drop-in for data.input_columns_file in the training config. When allowlist is supplied (typically the case-side ALLOWLIST), the file is rejected if any name falls outside it.

Return type:

None

Run manifest for feature-analysis reproducibility.

Captures config, dataset identity (zarr paths + mtimes hash), sklearn / numpy versions, git SHA (best-effort), and seeds. Written next to the report as manifest.json.

feature_selection.manifest.build_manifest(*, config, zarr_dir, feature_names, target_name, n_rows, n_cases, seeds, repo_root=None)[source]

Assemble a manifest dict describing the current run.

Return type:

dict[str, Any]

feature_selection.manifest.write_manifest(manifest, output_dir)[source]
Return type:

Path