Stargazer AI — Dataset & Training Guide

Quick reference for preparing data and training the model

← Back to Home

Overview — ML pipeline (brief)

  1. Upload a labelled dataset (CSV) that contains the target column and features.
  2. Preprocess: handle missing values, convert types, encode categorical columns, scale numeric features.
  3. Feature engineering: create derived features the model benefits from (example: ratios, log transforms).
  4. Balance classes if needed (SMOTE is used here) and run cross-validation.
  5. Train candidate models, evaluate with cross-validation, select the best model and inspect metrics and confusion matrix.
  6. Upload unlabelled test data and run predictions using the same preprocessing pipeline.

Required / recommended features

The pipeline expects features similar to the list below. Not every dataset will use the exact same names — mapping/renaming columns is common.

Column-name differences

If your dataset uses different column names, rename them to the ones above before uploading. Common examples:

Cleaning & preprocessing checklist

Quick data-mapping example

If your CSV has different names, you can prepare a small mapping script before upload (example using pandas):

import pandas as pd
df = pd.read_csv('my_data.csv')
rename_map = {'period':'koi_period', 'depth':'koi_depth', 'duration':'koi_duration', 'prad':'koi_prad'}
df.rename(columns=rename_map, inplace=True)
df.to_csv('my_data_renamed.csv', index=False)
            

Training tips

Recommended CSV formatting

If you want, I can add a small helper page to help rename and preview a mapping interactively before upload — would you like that?