I want to autoscale a classification script, which we use to categorize new documents in our database. This is the first time I try to run a script on Amazon using autoscale, so I'm very new to this. I need to scale this because of the CPU load and the amount of time it took to classify all documents by one server.
We receive new documents which need to be classified, multiple times a day. I use Amazon SQS which stores the id's of the new documents and I use EC2 for the instances which processes the documents in the queue. By default I use 1 instance and according to the amount of messages in the queue I want to automatically scale (add more instances). As far as I know each instance should run his own version of my script.
The classifier: The classifier is the "machine" where we put new documents in, and it returns the category which fits best. The classifier needs to be trained first (so it knows which documents should be in which category) based on documents we have categorized manually. I save the trained classifier to a file. I have problems with scaling this part. I don't know or anyone knows or this is even possible? As you can see the classifier needs to train itself on all train-data:
# Train a classifier with train-data
def train_classifier(self, classifier, train_data, subset = 10000):
## train given classifier with given data
trained_classifier = classifier.fit(train_data.data[:subset], train_data.target[:subset])
return trained_classifier
So does anyone know or this is even possible to scale over multiple instances? It can take quite some CPU to train a classifier and we don't know how big it may get in the future, if we get more root categories and more documents to train on.
The classification part is easier. After the classifier is trained I save it to one file, which my "classify" scripts on each instance will use to classify one message from the queue at a time. So the last part is not a problem to realise, the training part is the part I'm stuck with. So anyone who can help?
Aucun commentaire:
Enregistrer un commentaire