Spark History Server: http://localhost:18080/
by Srinivasarao Daruna
Glimpse of what we are trying to achieve in features job: (statistics of one of the difficult step that took more time to complete):
Some interesting statistics:
GC Configuration Options:
Winner of Garbage Collection process is Garbage First (G1). The final configuration worked out is,
spark.executor.extraJavaOptions=-XX:+UseG1GC spark.executor.extraJavaOptions=-XX:MaxGCPauseMillis=400
Experimented with many GC Pause times and 400 (in milli seconds) is the better choice for this use case.
Garbage Collection Framework Suggestion:
Smaller Data – Go for Parallel GC ( which is by default used in Spark)
Huge Data with Huge Memory Management – Go for G1 using configurations mentioned below.
Physical Plan for the SQL Join:
What is In Memory Tabular Scan? :
Tungsten In Memory tabular scan, that pushes the predicates and columns to Parquet.
What is Tungsten Exchange:
Shuffle in Tungsten
Why does it sort.? Other wise it wont know which is key is what place and it make intelligent decision based on the sort easily rather than going every where..
In a normal flow, the Stage gets broken when you get shuffle in RDD, Stage gets broken in Tungsten project when you get Tungsten Exchange.
Two types of joins are helpful, it can be broadcast hash join or sort merge join. Broadcast hash join is only useful if you have lesser data one side, and importantly which can fit into memory. Ideally lesser than 20 – 50 MB.
If any one sees Cross Product at the final step, it is going to be problematic.
Physical Plan for the step:
by the airisDATA Team
At our recent Scala + Spark SQL Workshop we have introduction workshops for Scala and Apache Spark. A number of questions were brought up, I have summarized a lot of the answers here along with some additional resources. The Scala presentation is here. Additional materials from the meetup will be posted soon. Check out both the NJ Hadoop / Big Data Meetup and the NJ Data Science Spark Meetupfor more great workshops and talks.
Github Resources
Setup Resources
Download Apache Spark 1.6 (Prebuild with Hadoop)
Setup
Make sure you have the Java 7 or Java 8 SDK installed for your platform (Mac, Linux or Windows). Then you’ll need to download Scala 2.10.x, SBT and then Spark. I also recommend downloading Maven. See our article on DZONE about setting up a developer machine.
Solving Local Spark Issues
export SPARK_MASTER_IP=127.0.0.1
export SPARK_LOCAL_IP=127.0.0.1
export SCALA_HOME=~/Downloads/scala-2.10.6
export PATH=$PATH:$SCALA_HOME/bin
For Windows, use SET instead of EXPORT and ; and not :.
Training Resources
Books
Scala by Example by Odersky – most material is from or similar to material covered in both of his classes on Coursera.
Scala Overview by Odersky et al.
Programming in Scala, First Edition by Odersky.
Structure and Interpretation of Computer Programs A classic computer science text that teaches some advanced programming concepts using Lisp and the basis of Martin Odersky‘s coursera class. Formerly the MIT standard text on programming. View on Amazon
A big list of Scala books linked at Scala-Lang
Tutorials
Scala for Java Programmers
Scala Tutorial
Effective Scala (Twitter)
Scala Tour
E-Books
Books at Lightbend (Typesafe)
AtomicScala (sample)
Scala Koans/Exercises
Scala Exercises
Scala Koans
Resources
Scala Roundup for Java Engineers
Scala Info at StackOverflow
Scala Cheetsheats
Scala Notes
Cake Solutions Blog
Scala School (Twitter)
Functional Programming in Scala
How to Learn Scala
Scala Lang Overviews
Learning Scala in Small Bites
Online Free Courses – Scala
Functional Programming with Scala
Reactive Programming with Scala
Online Free Courses – Spark
Big Data Analysis with Spark
Distributed Machine Learning with Spark
Introduction to Spark
Spark Fundamentals
Data Science / Engineering Spark
CS100
CS190
Start the discussion by brief Overview of Java Collections vs Scala Collections
Advantages of using Scala Collections:
1. Immutable Collections provided by Scala library by default – Need google Guava library in java with verbose syntax
2. A more consistent model across i.e. Collection and Map interfaces – both are iterable unlike java where as map interface does not implement iterable interface
3. Difference between Traversable and Iterable interfaces
4. Parallel Collections
5. Scala Rich Collection API functions – like map, flatmap, foldleft, partition, zip, groupByKey etc which not only reduce the code but also very expressive.
We run public workshops as well as private ones for customers. These can range from introductionand executive briefing level courses to in-depth deep dive week long courses on Scala, Deep Learning, Machine Learning, Apache Spark, Graph Analytics and other topics in Data Science. We have experienced presenters including professors with doctorates. At airis.DATA we are strong believers in giving back to the community and sharing our knowledge.
We have been organizing Meetups since 2010.
November 2024
December 2024
|
November 2024
December 2024
|
by Tim Spann
The team from airisData attended Spark Summit East and talked with a lot of great people, saw some amazing presentations and learned alot. Check out my post at Dzone for some a Top 15 List.
Here are some presentations that I loved:
We have a lot of exciting hands-on workshops and presentations coming to our two Meetups hosted at Princeton University and in Hamilton Township, Mercer County. All four meetups are filling up at record pace and will be full soon. Click right over to the Meetup sites to sign up. If they are full, send us an email and we’ll see if there’s a cancellation.
Scala + Spark SQL Workshop by Rajiv Singla and Kristina Rogale Plazonic
Thursday, March 10, 2016: to 3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400,Hamilton Township, NJ
Agenda
a) Scala and Spark
– Why the functional paradigm?
– Functional programming fundamentals
– A Program feature and Hands-on
(e.g functions, collections, pattern matching, implicits – speaker choice)
– Tie it back to Spark
b) Spark SQL
– Data frames and Data sets
– Logical and physical plan
– Hands-on Workshop (bring your laptop. Download a VM with Apache Spark, Apache Spark Standalone or get a community version of Databricks).
Speakers
Rajiv Singla – Data Engineer, airisDATA
Kristina Rogale Plazonic, Data Scientist, airisDATA
Workshop – How to Build Recommendation Engine using Spark 1.6 and HDP by Alex Zeltov (Hortonworks)
Thursday, March 17, 2016: 7:00 PM
Princeton University – Lewis Library Rm 122
Washington Road and Ivy Lane, Princeton, NJ 08544, Princeton, NJ
Agenda
a) Hands-on – Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.
b) Follow along – Build a Recommendation Engine – This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.
Hands-on session (pre-reqs) – Please download and come to the session
* Hortonworks Sandbox on a VM No data center, no cloud service and no internet connection needed! Full control of the environment. http://hortonworks.com/products/hortonworks-sandbox/#install
Speaker: Alex Zeltov (Sr. Solutions Engineer, Hortonworks)
Machine learning – H20 hands-on workshop by Sergey Fogelson
Thursday, March 31, 2016: 7:00 PM
Princeton University – Lewis Library Rm 122, Washington Road and Ivy Lane, Princeton, NJ 08544, Princeton, NJ
Agenda
a) Description of the Use Case
b) Machine learning algorithm(s) to solve the use case
c) Building a pipeline in H20
The sandbox details will be published soon with details on data set.
Speaker: Sergey Fogelson, Director Data Science, airisDATA
Deep dive Avro and Parquet – Read Avro/Write Parquet using Kafka and Spark by Srinivas Daruna and Tim Spann
Tuesday April 5, 2016: 7:00 PM to 9:00 PM
3525 Quakerbridge Rd #1400 IBIS Office Plaza, Suite 1400,Hamilton Township, NJ
Agenda
a) Avro and Parquet – When and Why to use which format?
b) Data modeling – Avro and Parquet schema
c) Workshop
– Read Avro input from Kafka
– Transform data in Spark
– Write data frame to Parquet
– Read back from Parquet
Speakers
Timothy Spann – Sr. Solutions Architect, airisDATA
Srinivas Daruna – Data Engineer, airisDATA
For more information on our Meetups, check out our Meetups page. For more information on our people giving these talks, check out our team.
We have been using ignite on spark for one of our use cases. We are using Ignite’s SharedRDD feature. Following links should get you started in that direction. We have been using for the basic use case and works fine so far. There is not a whole lot of documentation on spark-ignite integration though. Some pain points that we observed are that it gives serialization errors when used on non-basic data types(UDTs etc.)
5 Independence Way Princeton, NJ 08540 info@airisdata.com 609.281.5030 Careers Blog Contact Us
Copyright © 2016. airis.DATA. All Rights Reserved. Parquet, Avro, Kafka, Apache Hadoop, Apache Spark and the Apache Spark Logo are trademarks of the Apache Software Foundation. |