vendredi 3 avril 2015

Specifying an external configuration file for Apache Spark

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime. I want



  • Different configuration files depending on the environment (local, aws)

  • I'd like to specify application specific parameters


As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.


Java Spark Code



import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
public static void main(String[] args) {
String inputFile = args[0]; // Should be some file on your system

SparkConf conf = new SparkConf();// .setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> logData = sc.textFile(inputFile).cache();

final String filterString = conf.get("filterstr");

long numberLines = logData.filter(new Function<String, Boolean>() {
public Boolean call(String s) {
return s.contains(filterString);
}
}).count();

System.out.println("Line count: " + numberLines);
}
}


Config File


the configuration file is based on http://ift.tt/1CDbfBt and it looks like:



spark.app.name test_app
spark.executor.memory 2g
spark.master local
simplespark.filterstr a


The Problem


I execute the application using the following arguments:



/path/to/inputtext.txt --conf /path/to/configfile.config


However, this doesn't work, since the exception



Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration


gets thrown. To me means the configuration file is not being loaded.


My questions are:



  1. What is wrong with my setup?

  2. Is specifying application specific parameters in the spark configuration file good practice?





Aucun commentaire:

Enregistrer un commentaire