Customer churn isn’t just a marketing problem - it’s a business survival issue. In competitive industries like e-commerce, losing one customer often means losing several revenue streams, especially when one account can represent multiple users.

This post is a breakdown of my churn prediction capstone project for the postgraduate data science program at UT Austin - also tied to my master’s in data science at Deakin U. The project was closed-source, so I can’t release the full notebook, but I’ll walk you through everything I did including code snippets, results, charts, what I learned, and where this project fits in my larger journey into machine learning and MLOps.


🧩 The Problem: Why Churn Matters So Much

The e-commerce company in question faced intense competitive pressure and a rising churn rate. Each lost account didn’t just represent a single user - it impacted groups of customers. That meant multiplicative revenue loss.

Our goals were:

  • Predict churn at the account level
  • Identify top drivers of churn
  • Segment customers for targeted retention
  • Recommend cost-effective, revenue-positive strategies

🧪 Dataset Overview & Initial Processing

The dataset had ~11,000 rows and 19 columns, combining both categorical and numerical variables. A few key points:

  • Heavy class imbalance: Only ~17% of the accounts had churned
  • Several columns had missing values which were handled via KNN imputation
  • Mix of behavioral data (logins, payments, complaints), service scores, and engagement metrics

Churn Distribution Chart


🔍 Exploratory Data Analysis (EDA)

Key univariate and bivariate findings:

  • Device Used: Desktop users churned more (~20%) than mobile (~16%)
  • Revenue Patterns: Lower revenue/month was linked to higher churn
  • Complaints: Past complaints were strongly correlated with churn
  • Tenure: Shorter tenure customers churned more
  • Payment Method: Debit and UPI users churned more than Credit Card users

Churn Rate Details

Interestingly, some expected predictors (like cashback or service score) didn’t show strong correlation individually - indicating multi-factorial churn behavior.


📊 Clustering for Segmentation

Rather than just modeling churn, we wanted actionable segmentation. Using the elbow method and KMeans, we identified 4 customer clusters.

Each cluster showed different churn profiles, behavior patterns, and required tailored business strategies.

Clustering Elbow Chart


📌 Cluster Highlights & Strategic Takeaways

Churn Rate by Cluster

Cluster 1 – Large Segment, High Churn (~20%)

  • Male-dominated, married, mobile-first
  • Mid-to-high revenue, but frequent complaints
  • Strategy: Premium customer care, non-discount loyalty programs, debit-card reward partnerships

Cluster 3 – Smallest, but Worst Churn (~21%)

  • Newer users with high cashback dependency
  • Low service scores, high complaints, revenue decline
  • Strategy: Early engagement triggers, reduce cashback reliance, introduce tiered perks

Cluster 0 – Mid-Tier Churn (~13%)

  • HNI and Super users, higher spenders
  • Credit card and UPI users, moderate complaints
  • Strategy: VIP retention plans, gamified reward systems, re-engagement emails

Cluster 2 – Best Segment (~5% Churn)

  • Long-tenure, high-revenue customers
  • Highest satisfaction, lowest complaints
  • Strategy: Just keep them happy! Introduce premium perks and incentivize digital payments

🤖 ML Model Evaluation & Results

This wasn’t just a segmentation project - I evaluated 6 different machine learning models for churn prediction:

ModelTest AccuracyPrecisionRecallF1-ScoreAUC-ROCOverfitting
Decision Tree0.93120.77320.83640.80350.8934Yes
Random Forest0.97420.93970.90500.92200.9936No
XGBoost0.96940.91890.89710.90790.9908No
AdaBoost0.87080.59570.72300.65320.9025Yes
Naïve Bayes0.71270.33860.74140.46480.7770No
SVM0.77440.41220.79950.54400.8526No
Tuned RF0.97650.98220.87600.92610.9960No
  • Random Forest & XGBoost were the top performers with high precision and AUC scores, and no signs of overfitting, so I ended up tuning the random forest further.

  • I also experimented with SMOTE for handling class imbalance, but performance dropped significantly across models so oversampling isn’t always the solution.

ROC & Precision Recall Curve ROC & Precision Recall Curve SMOTE


🔍 Feature Importance (XGBoost)

Top predictors of churn from the model:

Feature Importance Bar Chart

  • Tenure and Complaints were the most influential features.
  • Other strong indicators: customer care interaction history, cashback dependency, and service score.

🎯 Business Outcomes & Learnings

This wasn’t just a modeling exercise - it was a strategy recommendation engine.

Key takeaways:

  • Churn is complex: No one variable dominates, but multiple small signals do.
  • Clustering works: It lets us build segment-specific retention tactics.
  • Complaints matter: Past complaints + low satisfaction are gold for early churn signals.
  • Cashback ≠ loyalty: High discount reliance often signals at-risk behavior.

🧭 Business Recommendations by Cluster

Cluster 1 (Large Segment, High Churn ~20%)

  • Premium support: Faster complaint resolution and priority escalation.
  • Loyalty rewards without discounts: e.g., early access perks.
  • Debit-card partner offers & wallet payment nudges.
  • Predictive churn alerts + personalized retention bonuses.

Cluster 3 (Smallest, Highest Churn ~21%)

  • High-touch onboarding: Loyalty bonuses in first 3–6 months.
  • Reduce cashback reliance: Use membership perks instead.
  • VIP service model for high revenue churners.
  • Improve care response time: Add real-time channels.

Cluster 2 (Lowest Churn ~5%)

  • Maintain delight: Premium memberships, early sales access.
  • Encourage digital payments with UPI/wallet incentives.
  • Reward repeat purchases via tiered cashback.

Cluster 0 (Moderate Churn ~13%)

  • HNI support: Personalized account managers.
  • Gamified engagement: Spend & Earn reward mechanics.
  • Proactive re-engagement emails.
  • Introduce premium subscription loyalty tier.

🚀 What’s Next?

This project helped me:

  • Think beyond models and into business strategy
  • Practice handling imbalanced data
  • Understand how to build insights*with both EDA and clustering

It’s part of a broader set of real-world projects I’ll be tackling throughout my second year of the master’s program - with more blogs and eventually open-sourced templates (when possible).

If you’re interested in just nerding out - feel free to connect or drop a note.


Thanks for reading! 🧠📊