I am a newbie to apache spark. My setup has 3 nodes (AWS M4 large) each with hadoop and apache spark installed. I have a file with 10Million records in comma separated file (approximately 1.5GB). This file is copied on HDFS and with replication factor of 2. I have a spark scala program (built into jar). Program reads the HDFS file, splits the records by commas, makes a keyvalue pair of one of the columns and count of that column. Finally, i sort the keyvalue pairs (K,V) by the value field (V) to get highest counts first. Then i take 5 records and print them.
Program is to get top 5 talkers in the file.
I get the results in 35 seconds. I see spark distributing the work and i see best data locality also. Just wondering if 35 seconds is expected for this kind of operation on apache spark. Thanks in advance for time spent reading this message.
Aucun commentaire:
Enregistrer un commentaire