← Back to Home
Overview — ML pipeline (brief)
- Upload a labelled dataset (CSV) that contains the target column and features.
- Preprocess: handle missing values, convert types, encode categorical columns, scale numeric features.
- Feature engineering: create derived features the model benefits from (example: ratios, log transforms).
- Balance classes if needed (SMOTE is used here) and run cross-validation.
- Train candidate models, evaluate with cross-validation, select the best model and inspect metrics and confusion matrix.
- Upload unlabelled test data and run predictions using the same preprocessing pipeline.
Required / recommended features
The pipeline expects features similar to the list below. Not every dataset will use the exact same names — mapping/renaming columns is common.
- koi_disposition (target: e.g. CANDIDATE, FALSE POSITIVE, CONFIRMED)
- koi_period
- koi_time0bk
- koi_impact
- koi_duration
- koi_depth
- koi_ror
- koi_srho
- koi_prad
- koi_sma
- koi_incl
- koi_num_transits
- koi_steff
- koi_slogg
- koi_smet
- koi_srad
- koi_smass
- koi_kepmag, koi_gmag, koi_rmag, koi_imag, koi_zmag
- koi_jmag, koi_hmag, koi_kmag
Column-name differences
If your dataset uses different column names, rename them to the ones above before uploading. Common examples:
period
→ koi_period
depth
→ koi_depth
duration
→ koi_duration
prad
→ koi_prad
Cleaning & preprocessing checklist
- Remove duplicate rows and obvious corrupt records.
- Ensure the target column (
koi_disposition
) exists and has consistent labels.
- Numeric columns: impute missing values (median recommended), clip extreme outliers (IQR rule) or winsorize.
- Categorical columns: fill missing with the mode or the token
'UNKNOWN'
, then encode (LabelEncoder or one-hot).
- Scale numeric features (StandardScaler) before training models that expect scaled inputs.
- Create simple derived features used by the app:
planets_to_star_radius_ratio = koi_prad / koi_srad
, log_period = log1p(koi_period)
, depth_to_duration = koi_depth / koi_duration
.
- Split a hold-out test set or use cross-validation (the app uses stratified CV + SMOTE for class imbalance).
Quick data-mapping example
If your CSV has different names, you can prepare a small mapping script before upload (example using pandas):
import pandas as pd
df = pd.read_csv('my_data.csv')
rename_map = {'period':'koi_period', 'depth':'koi_depth', 'duration':'koi_duration', 'prad':'koi_prad'}
df.rename(columns=rename_map, inplace=True)
df.to_csv('my_data_renamed.csv', index=False)
Training tips
- Keep a small validation split or use stratified folds to preserve class distribution.
- Try these metrics: accuracy, precision, recall, F1 (macro) and review the confusion matrix.
- Use class balancing (SMOTE) if classes are imbalanced; monitor for overfitting.
- For the XGBoost model used in this app: tune
n_estimators
, max_depth
, and learning_rate
.
Recommended CSV formatting
- Comma-separated, UTF-8, with a header row.
- Use consistent missing value markers (empty cells are fine).
- Include the target column for labelled uploads.
If you want, I can add a small helper page to help rename and preview a mapping interactively before upload — would you like that?