Rainfall Prediction – Kaggle Competition

🌧️ Project Overview

This project was developed for the Kaggle “Binary Prediction with a Rainfall Dataset” competition. The goal was to predict the occurrence of rainfall using a set of environmental features. My final solution achieved a public ROC AUC score of 99.95, placing it among the top scores in the competition.

🔍 Problem Scope

Generate predictions for the full test set with the probability of rain conditional to the available regressors (e.g. temperature, wind speed, cloud cover, and humidity).
60% of the test set is used only at the end of the competition, when the private score is computed, the remaining 40% is used to calculate the public score with each submission (it’s unknown which samples are part of the public score computation.)

🛠️ Technical Approach

✅ Feature Engineering

Normalized key predictors and engineered meaningful ratios and deltas.
Created visualizations to validate distributional consistency between training and test data.
Analyzed correlations with the target to prioritize high-signal features.

✅ Modeling

Trained two high-performing models:
- Random Forest Classifier
- XGBoost Classifier, using GPU acceleration via CUDA for faster tuning and inference
Used Optuna for Bayesian hyperparameter optimization with custom search spaces.

✅ Validation Strategy

Applied Grouped KFold and Random KFold to ensure the model generalized across known splits.
Carefully monitored overfitting using out-of-fold performance.

🚀 Deployment

To demonstrate the model’s predictions interactively and reproducibly, I:

Built a Streamlit app that allows exploration of predictions and feature importances.
Packaged the entire app using Docker to simplify reproducibility.
Hosted the application on a private cloud server accessible here:

👉 Live App

📊 Visual Preview

🧠 Lessons Learned

GPU acceleration can be very valuable even for structured data when doing intensive feature selection and hyperparameter optimization.
Cross validation is still one of the best tools in the ML field.
Binary classification with Random Forest and XGboost are still very powerful tools.

📁 Links

📈 Kaggle Competition Page
🐍 GitHub Repo: https://github.com/jairgs/kaggle-rainfall-prediction

If you’d like help writing the missing “Lessons Learned” section or a custom visual banner, I can help with that too!