With ever growing competition in telecommunications markets, operators have to increasingly rely on business intelligence to offer the right incentives. We calculate the customer churn (Prediction of customers who are at risk of leaving a company) over a certain period of time using Bayesian network in spark framework. We believe that our analysis techniques will allow telecom operators to better understand the social behavior of their customers, and potentially provide major insights for designing effective incentives.


As Mobile Telecom penetration is increasing, the focus is shifting from customer acquisition to customer retention. To maintain profitability, telecom service providers must control churn, the loss of subscribers who switch from one carrier to another. In order to retain customers, the operators have to offer the right incentives, adopt the right marketing strategies, and place their network assets appropriately. The call detail record (CDR) is an important data source for quantifying and analyzing user experience and behavior. A CDR contains various details pertaining to each call: who called whom, when was it made, how long it lasted, etc. We use spark framework to effectively use for analyzing the CDR data, get the required features and calculate the customer churn rate to predict in advance which users may leave, more likely to take early action to prevent churn and minimize losses. The CDR data can also be represented in the form of graph, which gives business insights to Telecom operators for designing strategies.

Problem Description:

The problem is to predict for each user the probability of leaving the network provider in the next month or quarter or some such fixed period.

A sample of the available data is given below. Once this problem is solved with some degree of satisfaction then global user behavior questions like the top 10% of users likely to leave in the next quarter can be obtained.

Design Approach:

The process of CDR data analysis involves large amount of data. Hadoop distributed file system is chosen to store the vast amount of CDR data as it provides an easy and flexible storage platform.

We get the raw CDR data from hdfs and do customer churn analysis in spark using the prediction model that is given below.


Sample Data:

Call Details Dataset:

Customer Details Dataset:


Algorithm Model:

 The abstract format of the inference engine is Bayesian network, which is generally more complex than neural networks. The nodes of the Bayesian network represent the subscribers to the phone network. This Bayesian network is a directed graph. Any two nodes are connected by two edges(Ni -> Nj  and    Nj->Ni)  if the users are influenced by each other in their choice of home network. They may not be even talking to each other. For example, two residential neighbors may not talk to each other over phone, but still they could influence each other in their choice of network. The reason for this being a directed graph is because the influence of each other may not be equal. Each edge is assigned a weight, which represents the probability of Ni leaving the phone network because Nj has chosen to leave.

Let’s fix a node say Ni, the sum of the weights of all the edges going out of Ni should be 1. There could also be self-connecting edges (a node connecting to itself). The weight of this self connect represents the probability of Ni leaving the network for their own reasons. The probabilities represent they changing the network within a fixed period say in a month or in a quarter. The whole thing can be expanded to incorporate calculating the probabilities for any given period. Each user or node may have several attributes. For any particular node, the influences are determined by the attributes of the nodes to which this user/node is connected. How the users’ attributes determine the influences (to the users with whom they connected), which in turn determine the weights, are calculated using statistical or machine learning approach.  These weights can be adjusted by periodically calculating the influences.

At any point of time, each node will have a probability assigned to it representing the current probability that the user will leave/change the network. Fox example, node Ni is connected to nodes Nj1, Nj2 …NjX. supposing all of them are in the same network, the current probability of Ni leaving the network is equivalent to weight of the self connecting edge of Ni. Supposing one of them say Nj2 decides to leave the network, then the current probability of Ni leaving the network is equivalent to the sum of weight of the self connecting edge of Ni and the weight of the edge  from Ni->Nj2. If Nj2 comes back to the original network, then weight of the edge Ni->Nj2 will be subtracted from current probability of Ni.

   A more sophisticated approach in calculating the probability of each user leaving would be to take into account the probabilities of the users(out vertices) to which they are connected to . This can be done while calculating

the probability of a user instead of just adding the weights of the edges one takes the sum of the products of the weight of the edge with its associated outer vertex probability. One starts with any vertex and keep doing this procedure (mentioned in the above line) till the solution  for the whole graph stabilizes.

This will give a better prediction but computationally intensive and also the procedure needs to converge. A little bit of linear algebra shows that there is a unique solution for the entire graph.

Machine Learning:

As mentioned above the probabilities associated to the  edges depend upon how much the out vertex influence the in vertex which in turn based upon the correlation of the features associated to the users that the vertices represent. There might be some seasonal trends in these correlations, so we may need to use time series analysis to keep track of how these correlations change and modify the probabilities associated to the edges.

Based on the above prediction model, the prediction results are stored in HDFS /HBase. Apart from calculating the customer churn from CDR data, we are representing the entire CDR data in the form of graph with titan db. It can be viewed with visualization tools.


Benefits of the Proposed Solution:

  • Faster execution of computations by using Spark framework



We are developing a solution using a prediction model which we hope will give best results in measuring customer churn in advance. The implementing part is done in Apache Spark framework for better computation. The results of customer churn analysis calculated based on the available CDR data will be stored in high available HDFS. Raw CDR data available in the graph representation in titan db can be visualized by using visualization tool.

5 Independence Way   Princeton, NJ 08540   info@airisdata.com   609.281.5030   Careers   Blog   Contact Us
Copyright © 2016. airis.DATA. All Rights Reserved.

Parquet, Avro, Kafka, Apache Hadoop, Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.