Churn Prediction Using Voice Data and Customer Transactions
🏢 Background
Cemex, one of Mexico’s largest industrial companies with global operations needed to anticipate customer churn using underutilized audio data sources and other customer transactional data.
🎯 Objective
Identify high-risk customers by analyzing phone conversation audio and historical transaction patterns, then combine insights into a predictive model for churn detection.
🔊 Audio Data Processing
Cemex’s call center was the primary order channel, and years of audio logs had never been leveraged. The steps included:
- Async pipeline for deep speech-to-text APIs, carefully handling request tracking and response pairing due to high latency.
- Translation of Spanish transcripts using neural machine translation services to ensure consistency across the dataset.
- Feature extraction using NLP models to detect:
- Sentiment
- Tone
- Conversation duration, frequency, resolution indicators
- Mapping those features back to unique customer profiles.
🧾 Transactional Data & Churn Labeling
- Aggregated multi-year transactional records (order frequency, product mix, payment patterns).
- Engineered time-based features and behavioral segments.
- Defined a churn label based on long inactivity periods or termination signals in the customer relationship.
- Merged behavioral and voice-derived data per customer.
🤖 Predictive Modeling
- Trained classification models to predict churn risk:
- Balanced classes using oversampling/undersampling techniques.
- Applied cross-validation, probability calibration, and Optuna-based hyperparameter tuning.
- Compared multiple models (e.g., XGBoost, Random Forest, Logistic Regression).
- Calibrated probability thresholds for actionable insights.
🛠️ Deliverables
- Reusable prediction pipeline for inference on new customer data.
- Final churn risk scores for each customer.
- A lightweight dashboard with filters for:
- Churn probability range
- Segment analysis
- Voice/NLP signal contributions
- Documentation for retraining and model deployment.
⚙️ Technologies Used
- NLP & Audio:
`nltk`, `textblob`, etc. - Translation:
IBM Cloud, AI & Watson
, - Data Science: pandas, scikit-learn, XGBoost
- Pipeline orchestration: Python3
- Dashboard: Power BI
📌 Impact
- Enabled Cemex to act early on potential churn, particularly in clients with unresolved call issues or sentiment shifts.
- Surfaced new insights from an untapped audio dataset, transforming voice data into strategic intelligence.
Feel free to contact me if your organization is exploring voice data or customer behavior analytics.