Published
- 1 min read
Apache Spark
Possible reasons why your spakr application is slow
- Insufficient resources - Increase the number of executors and cores: —num-executors 10 —executor-cores 4 —executor-memory 8G
- Skewed Resource Allocation Use salting techniques or partitioning to redistribute skewed data.
- Inefficient Data Processing Wide Transformations & Shuffles : Problem: Operations like groupBy, join, or distinct cause expensive shuffles. spark.sql.shuffle.partitions=200
- Small partitions spark.sql.shuffle.partitions=200,
- Suboptimal Data Management - Processing raw or unoptimized data formats (e.g., CSV). Use columnar formats like Parquet or ORC.
- Inefficient Joins Large shuffles due to improper join strategy. Fix val result = largeDF.join(broadcast(smallDF), “key”)
- Excessive use of Joins, count and groupBy Cache only essential data : df.persist(StorageLevel.MEMORY_AND_DISK) use spark sl functions for better performance
- Memory management issue spark.memory.fraction=0.6 spark.memory.storageFraction=0.5
- use a good garbage collector - UseG1GC 10.Incorrect shuffle paritions : having too many or too less partitions would keep the executors more busy or free
- Lack of monitoring and debugging , enable event logs in apache spark
Sample EMR cluster calculation
- Total memory = 256GB
- Total cores per machine - 16 cores
- Total machines / nodes = 10
- Total cores = 160
- total cores per executor = 4
- total executors = 160 / 4 = 40 executors and 10 machines