lundi 7 septembre 2015

sparkSqL - Performance with HDFS

I have 3 node HDFS setup. I have a file (single file with 10million records, approx 1.3GB size) copied into it with replication factor of 2. File contains 10 columns containing structured data.

I have SPARK standalone cluster install on same 3 machines. I am trying to get top 5 talkers using spark SQL query, program written in scala and executed using spark-submit command.

spark-submit --class "sparkSample" --master spark://master:7077 --executor-memory 4G target/scala-2.10/spark-sample_2.10-1.0.jar

Program is something like this: val textFile = sc.textFile("hdfs://master:54310/data/newfile.txt") val schemastring = "col1 " val schema = StructType(schemastring.split(" ").map(fldname => StructField(fldname,StringType,true))) val rowRDD = textFile.map(_.split(",")).map(p => Row(p(0).trim)) val pdtFrame = sqlContext.createDataFrame(rowRDD, schema) pdtFrame.registerTempTable("tbl") sqlContext.sql("set spark.sql.shuffle.partitions=300") sqlContext.sql("select col1,count(*) as cnt from tbl group by sip order by cnt desc limit 5").collect().foreach(println)

I get the results in 32 seconds (as shown on Spark UI). I see one job with two tasks. .1 dataFrame.collect() 2. top().

I am using Amazon AWS m4.large instance types for all three machines (2 core + 8Gig RAM each instance = 6 core + 24GB RAM).

Is this acceptable performance for this scenario, can i get it better than this with any optimization using sparkSQL?




Aucun commentaire:

Enregistrer un commentaire