vendredi 9 octobre 2015

Parallel file processing using cloud services

I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).

I'm a bit lost in the variety of cloud options out there:

  • Amazon lambda has insufficient system resources (not enough memory).
  • Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
  • Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
  • Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me.
  • Microsoft Data Lake is not released yet and looks hadoop-y.
  • Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).

Can anyone advise what appropriate solution(s) would be for this?




Aucun commentaire:

Enregistrer un commentaire