Organon, our lab is where we experiment and innovate. An environment where our architects, scientists and developers engineer creative solutions to solve real problems. A sandbox for solution incubation, benchmarking, partner product integration and evaluation.
We use large public data sets for our reference apps. Our scientists partake in Kaggle competitions to form best in breed solutions with the latest data science tools and libraries, such as xg boost.
We are constantly enhancing our lab with the latest open source projects and servers. We have a task force that works on improving performance and scaling in our cluster to provide you with the in-depth knowledge you need. We participate in Spark mailing lists to learn and assist on issues. We have found a reported several bugs that have been resolved.
Our lab also enables us to host Workshops and Meetups with access to real Hadoop, NoSQL and Apache Spark clusters. We develop complex reference applications and functionality machine learning demos on our clusters in our state of the art lab environment.
Our reference architecture utilizes the latest proven technologies in machine learning, big data and graph analytics. To fit your environment, whenever possible we support more than one best of breed option. For ingestion routing we can support Kafka or Amazon Kinesis, depending on your environment. We can work with files in HDFS or S3 in all the major formats including ORC, Avro, Parquet, Sequence, CSV and HBase data. We support Apache Ignite for In-Memory Files and multiple machine learning libraries. For graph databases, we have worked with Titan and Neo4J. We work with a number of key-value stores, time series databases, NewSQL and NoSQL databases. We run Spark solutions on Apache Mesos, EC2, Cloud, Databricks, Stand-alone, Apache YARN from Cloudera or Hortonworks.
Data Pipeline Acceleration for VaR / On-Demand Risk Management:
Apache Spark was chosen as the distributed computation framework to implement VaR solution over the Hadoop file system. We are developing the application in Scala to get the maximum utilization of multi-core CPUs. We store Reference Data for the current date as a cached RDD so that it can be used for multiple runs that could be launched by different users. We use in-memory file system from Apache Ignite and store it as a shared RDD. We also cache the scenario files generated for previous runs. We store any scenario files generated for other dates on HDFS in Parquet format. We will broadcast position data to all the nodes. By doing this, we aim to reduce the network I/O for lookups and joins on the position data.
Recommendation Engine / Click-Stream Data and Predictions for Advertising
Our application predicts the probability of clicking specific advertisement by user. We have take huge click stream data provided by Kaggle to work. The dataset includes huge amount of user, ads and clicks data and collected from clickstream. We need to prepare features set out of all the raw data and then train the system using them.
Utilizing Call Data Records (CDR) to Build Graphs
Our application ingests the CDR data into CDAP using streams. Our CDR data
which contains call details like source num,destination num,call duration,etc.. We load the data from CDR delimited files stored in HDFS.