I have been facing an issue with my aws EMR cluster where some jobs in a step get stuck. I know the best solution would be to actually solve the stuck job issue. In the mean time is there a way I can do the following?
- List the current step on the cluster.
- List the jobs in the step with status RUNNING.
- Kill one of those jobs randomly and wait for 3600 seconds to check the running jobs again. At the same maintain the number of jobs with PENDING status.
- If the number of jobs with status PENDING still has not decreased then kill another job.
I have been facing this issue for the last 3 days and have been manually killing the jobs by logging into the cluster. Help would be greatly appreciated! Thanks in advance!
Aucun commentaire:
Enregistrer un commentaire