joshita / dev

Published

- 1 min read

Apache Spark

img of Apache Spark

Possible reasons why your spakr application is slow

  1. Insufficient resources - Increase the number of executors and cores: —num-executors 10 —executor-cores 4 —executor-memory 8G
  2. Skewed Resource Allocation Use salting techniques or partitioning to redistribute skewed data.
  3. Inefficient Data Processing Wide Transformations & Shuffles : Problem: Operations like groupBy, join, or distinct cause expensive shuffles. spark.sql.shuffle.partitions=200
  4. Small partitions spark.sql.shuffle.partitions=200,
  5. Suboptimal Data Management - Processing raw or unoptimized data formats (e.g., CSV). Use columnar formats like Parquet or ORC.
  6. Inefficient Joins Large shuffles due to improper join strategy. Fix val result = largeDF.join(broadcast(smallDF), “key”)
  7. Excessive use of Joins, count and groupBy Cache only essential data : df.persist(StorageLevel.MEMORY_AND_DISK) use spark sl functions for better performance
  8. Memory management issue spark.memory.fraction=0.6 spark.memory.storageFraction=0.5
  9. use a good garbage collector - UseG1GC 10.Incorrect shuffle paritions : having too many or too less partitions would keep the executors more busy or free
  10. Lack of monitoring and debugging , enable event logs in apache spark

Sample EMR cluster calculation

  1. Total memory = 256GB
  2. Total cores per machine - 16 cores
  3. Total machines / nodes = 10
  4. Total cores = 160
  5. total cores per executor = 4
  6. total executors = 160 / 4 = 40 executors and 10 machines