I have a spark (1.3.1) cluster on ec2 (region: us-east). Since the past two months I haven't had any problem with it, but since yesterday I can't ssh one slave (or I can but it takes a really really really long time). My jobs don't fail, they are just waiting and waiting because they are trying to connect to one slave and the slave doesn't answer.
I tried to create a new spark with spark-ec2, but I got this error :
Warning: SSH connection error. (This could be temporary.)
Host: 54.90.24.42
SSH return code: 255
SSH output: ssh: connect to host 54.90.24.42 port 22: Connection refused
.
Warning: SSH connection error. (This could be temporary.)
Host: XX.XXX.XXX.XX
SSH return code: 255
SSH output: ssh: connect to host XX.XXX.XXX.XX port 22: Connection refused
As I am writing a colleague report a similar problem on another cluster :
org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-10-231-187-233.ec2.internal/10.231.187.233:54801
All those problems seem to be linked.
Does someone have an idea what could it be?
Aucun commentaire:
Enregistrer un commentaire