by Srinivasarao Daruna
Glimpse of what we are trying to achieve in features job: (statistics of one of the difficult step that took more time to complete):
Some interesting statistics:
- A left outer join with 13+ Billion Rows.
- 13,111,464,843 Rows (13+ Billion rows) with 38 Columns on one left hand side.
- Data Size turns out to be 597 GB.
- Spilled Data turns out to be 3.9 TB.
GC Configuration Options:
Winner of Garbage Collection process is Garbage First (G1). The final configuration worked out is,
Experimented with many GC Pause times and 400 (in milli seconds) is the better choice for this use case.
Garbage Collection Framework Suggestion:
Smaller Data – Go for Parallel GC ( which is by default used in Spark)
Huge Data with Huge Memory Management – Go for G1 using configurations mentioned below.
Physical Plan for the SQL Join:
What is In Memory Tabular Scan? :
Tungsten In Memory tabular scan, that pushes the predicates and columns to Parquet.
What is Tungsten Exchange:
Shuffle in Tungsten
Why does it sort.? Other wise it wont know which is key is what place and it make intelligent decision based on the sort easily rather than going every where..
In a normal flow, the Stage gets broken when you get shuffle in RDD, Stage gets broken in Tungsten project when you get Tungsten Exchange.
Two types of joins are helpful, it can be broadcast hash join or sort merge join. Broadcast hash join is only useful if you have lesser data one side, and importantly which can fit into memory. Ideally lesser than 20 – 50 MB.
If any one sees Cross Product at the final step, it is going to be problematic.
Physical Plan for the step: