samedi 28 mars 2015

SQS with MongoDB for handling duplicates

I have a couple of ideas for stopping duplicate handling of messages from Amazon's SQS queues. The app will also have a MongoDB server, which I think can be an effective part of either strategy:




  1. Store queue items in Mongo, with a 'status' field - default to Pending. Then use SQS to queue the ID of the new message. One of the worker processes will get the ID, then do a findAndModify on the actual item in Mongo to set the status to Processing, unless it's already being processed, when it will flag that up.




  2. Store queue items in the queue. Workers pick up items from the queue, then attempt to do an insert into Mongo with the item ID and some other info. If the item already existed, don't do the insert or continue, since it's a dupe.




The problems and questions I have:




  • Solution 1 seems counter-intuitive: why use SQS at all? I think it's because polling SQS is more correct than a whole load of worker processes polling Mongo for work.




  • Solution 2 I don't know how to implement. Is there an atomic find-and-insert-if-doesn't-exist? A simple get-or-insert-but-tell-me-which-occurred operation would do the trick.




  • Will any of these work in a large scale scenario, and/or is there a proven method that I haven't grasped?






....Humm, just wrote the question above, then had a thought for a get-or-insert-but-tell-me-which-occurred operation (in JS psuedocode):



var thingy = getrandomnumber();

findAndModify({
new: false,
upsert: true,
query: { $eq: { id: item_id } },
update: { thingy: thingy },
fields: { thingy: 1 }
});


If the item exists (and this is a conflict), then since new is false, the old document will be returned.


If the item didn't exist, new is false, so an empty document {} would be returned.


So either we got {}, indicating it resulted in an insert, or an actual document, indicating it was a get, and that ID already exists... all atomic. The thingy is in there because I don't know if MongoDB actually needs data there, I guess it would? If I used $inc on a duplicates field instead, would that work with an upsert? Then we could get stats on dupes later.


Is that right, maybe that would work?





Aucun commentaire:

Enregistrer un commentaire