Apache Spark • Joshita Mishra's Blog

Possible reasons why your spakr application is slow

Insufficient resources - Increase the number of executors and cores: —num-executors 10 —executor-cores 4 —executor-memory 8G
Skewed Resource Allocation Use salting techniques or partitioning to redistribute skewed data.
Inefficient Data Processing Wide Transformations & Shuffles : Problem: Operations like groupBy, join, or distinct cause expensive shuffles. spark.sql.shuffle.partitions=200
Small partitions spark.sql.shuffle.partitions=200,
Suboptimal Data Management - Processing raw or unoptimized data formats (e.g., CSV). Use columnar formats like Parquet or ORC.
Inefficient Joins Large shuffles due to improper join strategy. Fix val result = largeDF.join(broadcast(smallDF), “key”)
Excessive use of Joins, count and groupBy Cache only essential data : df.persist(StorageLevel.MEMORY_AND_DISK) use spark sl functions for better performance
Memory management issue spark.memory.fraction=0.6 spark.memory.storageFraction=0.5
use a good garbage collector - UseG1GC 10.Incorrect shuffle paritions : having too many or too less partitions would keep the executors more busy or free
Lack of monitoring and debugging , enable event logs in apache spark

Sample EMR cluster calculation