I have been working on installing solr cloud over AWS. As fresh run everything works fine and since i have used Hadoop as one of the dependencies it is understood that there is high availability.. So taking this base of though, i tried to stop the cloudera manager (essentially freezing hadoop , solr, other components). Then stop the instances come back next day to resume the working.. But this theory never works.. Below are the step by step things i do before i switch off and resume.
- 7:55, created folders for every datanode and in ~/recovery directory and also checked each nodes health
- copied the namenode current directory nn + dn from all 9 hosts with the help 2.*.sh script
- Stopped the cloudera manager & ready to switch off the cluster
- AT 8:04 cluster is stopped in cloudera manager. Made sure enough time is there between event 2 & 3 above.
- Physically stopping the aws instances at 8:05.
- Everything is stopped at 8:08
- Starting all the nodes again.. at 8:12
- started all all fine.. HDFS lost some blocks, some corrupted and some missing
- Solr cloud failed.. totally because it is observed most of the blocks belong to solr cloud..
As you see i have taken all the precautions even i have redistributed the nn + dd which i have saved before switch off. But it did not work..
This i have failed the 4th time, and restoring the cloud is a painful procedure. Why i want to do this, i want to save the client some valuable money when we are not undergoing any testing.
I am still not sure why i can resume from my physical machine and not from aws.. And why only Solr looses bytes.
Aucun commentaire:
Enregistrer un commentaire