Machine Learning with Apache Spark

  • Spark works with hundreds of millions and billions of examples
    • In general “more data for training beats fancy algorithm”
  • Many 3rd party ML libraries are available on Spark – h2o, DeepLearning4J
  • Spark has borrowed best features of many different systems
    • Dataframes from R
    • Pipelines from scikit-learn
    • Spark-packages from R/perl/others
    • Java/Scala for enterprise use
    • Interactive shell like R and Python – in Scala  (and notebooks)
  • Spark unifies many other different areas useful in data science:
    • Plug in many different data sources
    • You can use SQL in SparkSQL for data extraction
    • You can do graph analysis

One area we have worked on within Machine Learning is Record Matching.

Augmenting Rules Engine(s) with ensembles of decision trees –
Random Forests or Gradient Boosted Machines.

    • Automatic rule discovery.
    • Generalizes better to previously unseen examples.
    • You can reuse existing rules (as features).
    • Use confidence scores to prioritize examples for human review.
  • Active learning
    • Humans and machine work together.
    • Humans are presented only with ambiguous and informative examples.
    • Predictions are recomputed as labeling by humans is going on.


Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer.

Training set is the dataset which is appropriately labeled and used for training Machine Learning Models.

The process of using a training set and a machine learning model to identify the values of the parameters of the model.

Model testing is using the trained model to predict the outcomes on a new, but labeled dataset. The real labels are compared with the ones the model predicted in order to calculate the goodness of the model.



5 Independence Way   Princeton, NJ 08540   609.281.5030   Careers   Blog   Contact Us
Copyright © 2016. airis.DATA. All Rights Reserved.

Parquet, Avro, Kafka, Apache Hadoop, Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.