Here are some of our recent machine learning papers.


download2  Machine Learning and Spark by Kristina Rogale PlazonicSpark_logo

download2On Demand Risk Management by Ravi Kore

download2What is Machine Learning?

download2Graph Analysis of CDR (Coming Soon)

download2Twitter Sentiment Analysis in Apache Spark (Coming Soon)


Reference Architecture


Reference Applications

Data Pipeline Acceleration for VaR / On-Demand Risk Management:  

Apache Spark was chosen as the distributed computation framework to implement VaR solution over the Hadoop file system. We are developing the application in Scala to get the maximum utilization of subwaymulti-core CPUs. We store Reference Data for the current date as a cached RDD so that it can be used for multiple runs that could be launched by different users. We use in-memory file system from Apache Ignite and store it as a shared RDD. We also cache the scenario files generated for previous runs. We store any scenario files generated for other dates on HDFS in Parquet format. We will broadcast position data to all the nodes. By doing this, we aim to reduce the network I/O for lookups and joins on the position data.


Recommendation Engine / Click-Stream Data and Predictions for Advertising

Our application predicts the probability of clicking specific advertisement by user. We have take huge click stream data provided by kaggle to work. The dataset includes huge amount of user, ads and clicks data and  collected from clickstream. We need to prepare features set out of all the raw data and then train the system using them. 

Utilizing Call Data Records (CDR) to Build Graphs 

1. Ingesting the CDR data into cdap using streams. – I got the sample CDR data which contains call details like source num,destination num,call duration,etc., I created streams in CDAP and pulling the data from HDFS into streams.

2. Creating graphs with the available data in CDAP – With the data from streams, I have created graphs in TITAN with vertices as mobiles number(adding additional properties such as details about the caller), edges(with properties such as call duration) whether incoming or outgoing or missed and direction to the vertices(in or out). I created spark workflow in CDAP to perform this.

3. Visualization of the graphs: – To view the graph in user friendly manner, I have explored rexster server Rest API which does visualization of graphs created in titan. Now, I am able to view the graphs in UI and explore the graph details.

5 Independence Way   Princeton, NJ 08540   609.281.5030   Careers   Blog   Contact Us
Copyright © 2016. airis.DATA. All Rights Reserved.

Parquet, Avro, Kafka, Apache Hadoop, Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.