Spark Source Github

Scala + Spark Programming Workshop

Tuesday, Mar 22, 2016, 7:00 PM

NJ Hadoop – Big Data Meetup
3525 Quakerbridge Road, Suite 903 Hamilton, NJ 08619 Hamilton , NJ

71 Hadoop Went

Speakers:Rajiv Singla, Sr. Data Engineer, airisDATATimothy Spann, Sr. Solution Architect, airisDATAAgenda – Recurring sessionsThe following will be covered in many sessions for folks to better learn scala and spark.Scala1.  Overview2.  Basics3.  Functions4.  Collections5.  Case Classes and Pattern Matching6.  Type ParameterizationSpark…

Check out this Meetup →

Spark History Server:   http://localhost:18080/

 

 

 

by Srinivasarao Daruna developer3

Glimpse of what we are trying to achieve in features job: (statistics of one of the difficult step that took more time to complete):

Some interesting statistics:

  • A left outer join with 13+ Billion Rows.
  • 13,111,464,843 Rows (13+ Billion rows) with 38 Columns on one left hand side.
  • Data Size turns out to be 597 GB.
  • Spilled Data turns out to be 3.9 TB.

GC Configuration Options:

Winner of Garbage Collection process is Garbage First (G1). The final configuration worked out is,

spark.executor.extraJavaOptions=-XX:+UseG1GC
spark.executor.extraJavaOptions=-XX:MaxGCPauseMillis=400

Experimented with many GC Pause times and 400 (in milli seconds) is the better choice for this use case.

 

Garbage Collection Framework Suggestion:

Smaller Data – Go for Parallel GC ( which is by default used in Spark)

Huge Data with Huge Memory Management – Go for G1 using configurations mentioned below. 

 

Physical Plan for the SQL Join:

What is In Memory Tabular Scan? :

Tungsten In Memory tabular scan, that pushes the predicates and columns to Parquet.

What is Tungsten Exchange:

Shuffle in Tungsten

Why does it sort.? Other wise it wont know which is key is what place and it make intelligent decision based on the sort easily rather than going every where..

In a normal flow, the Stage gets broken when you get shuffle in RDD, Stage gets broken in Tungsten project when you get Tungsten Exchange.

Two types of joins are helpful, it can be broadcast hash join or sort merge join. Broadcast hash join is only useful if you have lesser data one side, and importantly which can fit into memory. Ideally lesser than 20 – 50 MB.

If any one sees Cross Product at the final step, it is going to be problematic.

Physical Plan for the step:

clickstreamdag

Spark_logo-300x300

by the airisDATA Team

At our recent Scala + Spark SQL Workshop we have introduction workshops for Scala and Apache Spark.   A number of questions were brought up, I have summarized a lot of the answers here along with some additional resources.   The Scala presentation is here.  Additional materials from the meetup will be posted soon.   Check out both the NJ Hadoop / Big Data Meetup and the NJ Data Science Spark Meetupfor more great workshops and talks.

Github Resources 

Github Resources

Setup Resources

Download Scala 2.10.x

Download Apache Spark 1.6 (Prebuild with Hadoop)

Scala Install

Setup

Make sure you have the Java 7 or Java 8 SDK installed for your platform (Mac, Linux or Windows).   Then you’ll need to download Scala 2.10.x, SBT and then Spark.   I also recommend downloading Maven. See our article on DZONE about setting up a developer machine.

Solving Local Spark Issues

export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SCALA_HOME=~/Downloads/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin

For Windows, use SET instead of EXPORT and and not :.

Training Resources

Books

Learning Spark

Scala by Example by Odersky – most material is from or similar to material covered in both of his classes on Coursera.

Scala Overview by Odersky et al.

Programming in Scala, First Edition by Odersky.

Structure and Interpretation of Computer Programs A classic computer science text that teaches some advanced programming concepts using Lisp and the basis of Martin Odersky‘s coursera class. Formerly the MIT standard text on programming. View on Amazon

Scala for the Impatient

Programming Scala

A big list of Scala books linked at Scala-Lang

Learning Spark

Advanced Analytics with Spark

Spark Cookbook

Tutorials
Scala for Java Programmers
Scala Tutorial
Effective Scala (Twitter)
Scala Tour

E-Books
Books at Lightbend (Typesafe)
AtomicScala (sample)

Scala Koans/Exercises
Scala Exercises
Scala Koans

Resources
Scala Roundup for Java Engineers
Scala Info at StackOverflow
Scala Cheetsheats
Scala Notes
Cake Solutions Blog
Scala School (Twitter)
Functional Programming in Scala
How to Learn Scala
Scala Lang Overviews
Learning Scala in Small Bites

Online Free Courses – Scala
Functional Programming with Scala
Reactive Programming with Scala

Online Free Courses – Spark
Big Data Analysis with Spark
Distributed Machine Learning with Spark
Introduction to Spark
Spark Fundamentals
Data Science / Engineering Spark
CS100
CS190

Start the discussion by brief Overview of Java Collections vs Scala Collections

Advantages of using Scala Collections:

1. Immutable Collections provided by Scala library by default – Need google Guava library in java with verbose syntax

2. A more consistent model across i.e. Collection and Map interfaces – both are iterable unlike java where as map interface does not implement iterable interface

3. Difference between Traversable and Iterable interfaces

4. Parallel Collections

5. Scala Rich Collection API functions – like map, flatmap, foldleft, partition, zip, groupByKey etc which not only reduce the code but also very expressive.

We run public workshops as well as private ones for customers.   These can range from introductiondarkearthand executive briefing level courses to in-depth deep dive week long courses on Scala, Deep Learning, Machine Learning, Apache Spark, Graph Analytics and other topics in Data Science.  We have experienced presenters including professors with doctorates.   At airis.DATA we are strong believers in giving back to the community and sharing our knowledge.

We have been organizing Meetups since 2010.

November 2024

SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

December 2024

SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

November 2024

SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

December 2024

SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

underbridge2

spark-streaming-logo

We have a lot of exciting hands-on workshops and presentations coming to our two Meetups hosted at Princeton University and in Hamilton Township, Mercer County.   All four meetups are filling up at record pace and will be full soon.   Click right over to the Meetup sites to sign up.  If they are full, send us an email and we’ll see if there’s a cancellation.

Scala + Spark SQL Workshop by Rajiv Singla and Kristina Rogale Plazonic
Thursday, March 10, 2016:   to 3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400,Hamilton Township, NJ

Agenda

a) Scala and Spark
– Why the functional paradigm?
– Functional programming fundamentals
– A Program feature and Hands-on
(e.g functions, collections, pattern matching, implicits – speaker choice)
– Tie it back to Spark

b) Spark SQL
– Data frames and Data sets
– Logical and physical plan
– Hands-on Workshop (bring your laptop.   Download a VM with Apache Spark, Apache Spark Standalone or get a community version of Databricks).

Speakers
Rajiv Singla – Data Engineer, airisDATA
Kristina Rogale Plazonic, Data Scientist, airisDATA

 

Workshop – How to Build Recommendation Engine using Spark 1.6 and HDP by Alex Zeltov (Hortonworks)
Thursday, March 17, 2016:  7:00 PM
Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544, Princeton, NJ

Agenda

a) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.

b) Follow along – Build a Recommendation Engine – This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.

Hands-on session (pre-reqs) – Please download and come to the session

* Hortonworks Sandbox on a VM No data center, no cloud service and no internet connection needed! Full control of the environment. http://hortonworks.com/products/hortonworks-sandbox/#install

Speaker: Alex Zeltov (Sr. Solutions Engineer, Hortonworks)

SnappyData + Spark – In-memory OLTP and OLAP

Monday, Mar 28, 2016, 7:00 PM

NJ Hadoop – Big Data Meetup
3525 Quakerbridge Road, Suite 903 Hamilton, NJ 08619 Hamilton , NJ

2 Hadoop Attending

Check out this Meetup →

 

Machine learning – H20 hands-on workshop by Sergey Fogelson
Thursday, March 31, 2016: 7:00 PM
Princeton University – Lewis Library Rm 122, Washington Road and Ivy Lane, Princeton, NJ 08544, Princeton, NJ

Agenda
a) Description of the Use Case
b) Machine learning algorithm(s) to solve the use case
c) Building a pipeline in H20

The sandbox details will be published soon with details on data set.

Speaker: Sergey Fogelson, Director Data Science, airisDATA 

 

Deep dive Avro and Parquet – Read Avro/Write Parquet using Kafka and Spark by Srinivas Daruna and Tim Spann
Tuesday April 5, 2016:  7:00 PM to 9:00 PM
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400,Hamilton Township, NJ

Agenda

a) Avro and Parquet – When and Why to use which format?
b) Data modeling – Avro and Parquet schema
c) Workshop
– Read Avro input from Kafka
– Transform data in Spark
– Write data frame to Parquet
– Read back from Parquet

Speakers

Timothy Spann – Sr. Solutions Architect, airisDATA
Srinivas Daruna – Data Engineer, airisDATA

 

For more information on our Meetups, check out our Meetups page.   For more information on our people giving these talks, check out our team.

We have been using ignite on spark for one of our use cases. We are using Ignite’s SharedRDD feature. Following links should get you started in that direction. We have been using for the basic use case and works fine so far. There is not a whole lot of documentation on spark-ignite integration though. Some pain points that we observed are that it gives serialization errors when used on non-basic data types(UDTs etc.)

5 Independence Way   Princeton, NJ 08540   info@airisdata.com   609.281.5030   Careers   Blog   Contact Us
Copyright © 2016. airis.DATA. All Rights Reserved.

Parquet, Avro, Kafka, Apache Hadoop, Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation.