🏢 Background

Cemex, one of Mexico’s largest industrial companies with global operations needed to anticipate customer churn using underutilized audio data sources and other customer transactional data.


🎯 Objective

Identify high-risk customers by analyzing phone conversation audio and historical transaction patterns, then combine insights into a predictive model for churn detection.


🔊 Audio Data Processing

Cemex’s call center was the primary order channel, and years of audio logs had never been leveraged. The steps included:

  • Async pipeline for deep speech-to-text APIs, carefully handling request tracking and response pairing due to high latency.
  • Translation of Spanish transcripts using neural machine translation services to ensure consistency across the dataset.
  • Feature extraction using NLP models to detect:
    • Sentiment
    • Tone
    • Conversation duration, frequency, resolution indicators
  • Mapping those features back to unique customer profiles.

🧾 Transactional Data & Churn Labeling

  • Aggregated multi-year transactional records (order frequency, product mix, payment patterns).
  • Engineered time-based features and behavioral segments.
  • Defined a churn label based on long inactivity periods or termination signals in the customer relationship.
  • Merged behavioral and voice-derived data per customer.

🤖 Predictive Modeling

  • Trained classification models to predict churn risk:
    • Balanced classes using oversampling/undersampling techniques.
    • Applied cross-validation, probability calibration, and Optuna-based hyperparameter tuning.
  • Compared multiple models (e.g., XGBoost, Random Forest, Logistic Regression).
  • Calibrated probability thresholds for actionable insights.

🛠️ Deliverables

  • Reusable prediction pipeline for inference on new customer data.
  • Final churn risk scores for each customer.
  • A lightweight dashboard with filters for:
    • Churn probability range
    • Segment analysis
    • Voice/NLP signal contributions
  • Documentation for retraining and model deployment.

⚙️ Technologies Used

  • NLP & Audio: `nltk`, `textblob`, etc.
  • Translation: IBM Cloud, AI & Watson,
  • Data Science: pandas, scikit-learn, XGBoost
  • Pipeline orchestration: Python3
  • Dashboard: Power BI

📌 Impact

  • Enabled Cemex to act early on potential churn, particularly in clients with unresolved call issues or sentiment shifts.
  • Surfaced new insights from an untapped audio dataset, transforming voice data into strategic intelligence.

Feel free to contact me if your organization is exploring voice data or customer behavior analytics.