Here are some of our recent machine learning papers.
Machine Learning and Spark by Kristina Rogale Plazonic
On Demand Risk Management by Ravi Kore
Graph Analysis of CDR (Coming Soon)
Twitter Sentiment Analysis in Apache Spark (Coming Soon)
Reference Architecture
Reference Applications
Data Pipeline Acceleration for VaR / On-Demand Risk Management:
Apache Spark was chosen as the distributed computation framework to implement VaR solution over the Hadoop file system. We are developing the application in Scala to get the maximum utilization of multi-core CPUs. We store Reference Data for the current date as a cached RDD so that it can be used for multiple runs that could be launched by different users. We use in-memory file system from Apache Ignite and store it as a shared RDD. We also cache the scenario files generated for previous runs. We store any scenario files generated for other dates on HDFS in Parquet format. We will broadcast position data to all the nodes. By doing this, we aim to reduce the network I/O for lookups and joins on the position data.
Recommendation Engine / Click-Stream Data and Predictions for Advertising
Our application predicts the probability of clicking specific advertisement by user. We have take huge click stream data provided by kaggle to work. The dataset includes huge amount of user, ads and clicks data and collected from clickstream. We need to prepare features set out of all the raw data and then train the system using them.
Utilizing Call Data Records (CDR) to Build Graphs
1. Ingesting the CDR data into cdap using streams. – I got the sample CDR data which contains call details like source num,destination num,call duration,etc., I created streams in CDAP and pulling the data from HDFS into streams.
2. Creating graphs with the available data in CDAP – With the data from streams, I have created graphs in TITAN with vertices as mobiles number(adding additional properties such as details about the caller), edges(with properties such as call duration) whether incoming or outgoing or missed and direction to the vertices(in or out). I created spark workflow in CDAP to perform this.
3. Visualization of the graphs: – To view the graph in user friendly manner, I have explored rexster server Rest API which does visualization of graphs created in titan. Now, I am able to view the graphs in UI and explore the graph details.