Rainfall Prediction – Kaggle Competition
🌧️ Project Overview
This project was developed for the Kaggle “Binary Prediction with a Rainfall Dataset” competition. The goal was to predict the occurrence of rainfall using a set of environmental features. My final solution achieved a public ROC AUC score of 99.95, placing it among the top scores in the competition.
🔍 Problem Scope
Generate predictions for the full test set with the probability of rain conditional to the available regressors (e.g. temperature, wind speed, cloud cover, and humidity).
60% of the test set is used only at the end of the competition, when the private score is computed, the remaining 40% is used to calculate the public score with each submission (it’s unknown which samples are part of the public score computation.)
🛠️ Technical Approach
✅ Feature Engineering
- Normalized key predictors and engineered meaningful ratios and deltas.
- Created visualizations to validate distributional consistency between training and test data.
- Analyzed correlations with the target to prioritize high-signal features.
✅ Modeling
- Trained two high-performing models:
- Random Forest Classifier
- XGBoost Classifier, using GPU acceleration via CUDA for faster tuning and inference
- Used Optuna for Bayesian hyperparameter optimization with custom search spaces.
✅ Validation Strategy
- Applied Grouped KFold and Random KFold to ensure the model generalized across known splits.
- Carefully monitored overfitting using out-of-fold performance.
🚀 Deployment
To demonstrate the model’s predictions interactively and reproducibly, I:
- Built a Streamlit app that allows exploration of predictions and feature importances.
- Packaged the entire app using Docker to simplify reproducibility.
- Hosted the application on a private cloud server accessible here:
👉 Live App
📊 Visual Preview

🧠 Lessons Learned
- GPU acceleration can be very valuable even for structured data when doing intensive feature selection and hyperparameter optimization.
- Cross validation is still one of the best tools in the ML field.
- Binary classification with Random Forest and XGboost are still very powerful tools.
📁 Links
- 📈 Kaggle Competition Page
- 🐍 GitHub Repo: https://github.com/jairgs/kaggle-rainfall-prediction
If you’d like help writing the missing “Lessons Learned” section or a custom visual banner, I can help with that too!