I have many images that I need to run through a java program to create more image files -- an embarrassingly parallel case. Each input file is about 500 mb, needs about 4 GB of memory during processing, and takes 30 seconds to 2 minutes to run. The java program is multithreaded but more gain comes from parallelizing on the input files than from using more threads. I need to kick off processes several times a day (I do not want to turn on/off the cluster manually nor pay for it 24/7).
I'm a bit lost in the variety of cloud options out there:
- Amazon lambda has insufficient system resources (not enough memory).
- Google Cloud DataFlow, it appears that I would have to write my own pipeline source to use their Cloud Storage buckets. Fine, but I don't want to waste time doing that if it's not an appropriate solution (which it might be, I can't tell yet).
- Google Cloud Dataproc, this is not a map/reduce hadoop-y situation, but might work nonetheless. I'd rather not manage my own cluster though.
- Google compute engine or AWS with autoscaling, and I just kick off processes for each core on the machine. More management from me.
- Microsoft Data Lake is not released yet and looks hadoop-y.
- Microsoft Batch seems quite appropriate (but I'm asking because I remain curious about other options).
Can anyone advise what appropriate solution(s) would be for this?
Aucun commentaire:
Enregistrer un commentaire