ML Pipeline Guide — Stargazer AI

← Back to Home

Overview — ML pipeline (brief)

Upload a labelled dataset (CSV) that contains the target column and features.
Preprocess: handle missing values, convert types, encode categorical columns, scale numeric features.
Feature engineering: create derived features the model benefits from (example: ratios, log transforms).
Balance classes if needed (SMOTE is used here) and run cross-validation.
Train candidate models, evaluate with cross-validation, select the best model and inspect metrics and confusion matrix.
Upload unlabelled test data and run predictions using the same preprocessing pipeline.

Required / recommended features

The pipeline expects features similar to the list below. Not every dataset will use the exact same names — mapping/renaming columns is common.

koi_disposition (target: e.g. CANDIDATE, FALSE POSITIVE, CONFIRMED)
koi_period
koi_time0bk
koi_impact
koi_duration
koi_depth
koi_ror
koi_srho
koi_prad
koi_sma
koi_incl
koi_num_transits
koi_steff
koi_slogg
koi_smet
koi_srad
koi_smass
koi_kepmag, koi_gmag, koi_rmag, koi_imag, koi_zmag
koi_jmag, koi_hmag, koi_kmag

Column-name differences

If your dataset uses different column names, rename them to the ones above before uploading. Common examples:

period → koi_period
depth → koi_depth
duration → koi_duration
prad → koi_prad

Cleaning & preprocessing checklist

Remove duplicate rows and obvious corrupt records.
Ensure the target column (koi_disposition) exists and has consistent labels.
Numeric columns: impute missing values (median recommended), clip extreme outliers (IQR rule) or winsorize.
Categorical columns: fill missing with the mode or the token 'UNKNOWN', then encode (LabelEncoder or one-hot).
Scale numeric features (StandardScaler) before training models that expect scaled inputs.
Create simple derived features used by the app: planets_to_star_radius_ratio = koi_prad / koi_srad, log_period = log1p(koi_period), depth_to_duration = koi_depth / koi_duration.
Split a hold-out test set or use cross-validation (the app uses stratified CV + SMOTE for class imbalance).

Quick data-mapping example

If your CSV has different names, you can prepare a small mapping script before upload (example using pandas):

import pandas as pd
df = pd.read_csv('my_data.csv')
rename_map = {'period':'koi_period', 'depth':'koi_depth', 'duration':'koi_duration', 'prad':'koi_prad'}
df.rename(columns=rename_map, inplace=True)
df.to_csv('my_data_renamed.csv', index=False)

Training tips

Keep a small validation split or use stratified folds to preserve class distribution.
Try these metrics: accuracy, precision, recall, F1 (macro) and review the confusion matrix.
Use class balancing (SMOTE) if classes are imbalanced; monitor for overfitting.
For the XGBoost model used in this app: tune n_estimators, max_depth, and learning_rate.

Recommended CSV formatting

Comma-separated, UTF-8, with a header row.
Use consistent missing value markers (empty cells are fine).
Include the target column for labelled uploads.

If you want, I can add a small helper page to help rename and preview a mapping interactively before upload — would you like that?