I have an apache spark application that does the following steps:
inputFile(#s3loc)
mapPartititions(mapper).groupByKey.mapPartitions(reducer).saveAsHadoopFile(params)
When I run this on a small data size it runs fine (around 100 files each one a gzipped 4k-5MB file). When the input size is large (same file size but 14k files) I get a java heap space error on message.serialization and bytearray and something of the sort.
I experimented a bit with my cluster (EMR) and for a cluster size of 60 m2.2x large machines each with 32 gigs of RAM and 4 cores I set the spark.default.parallelism=960 ie, 4 tasks per core. This threw the same error as above. When I changed this parallelism to 240 or 320 my tasks executed smoothly but it was pretty slow. What is causing this heap overflow? Most places that I have read up recommend around 3-4 tasks per core which should make 960 a good choice. How do I increase the number of tasks without causing a heap overflow?
Aucun commentaire:
Enregistrer un commentaire