dimanche 26 avril 2015

Fast document search - 400K files and search terms with wild cards

Starting to work on a project where I need to search a large number of text documents. A user will essentially try to perform a regex type search to see which files contain the specified search string. I need the response time to be very quick - i.e. 3 seconds or so. What type of platform is suitable for this task? I am presuming Hadoop for example, would be too slow.

For example, freepatentsonline (a site I sometimes use), gives me patent results in an instant (WITHOUT the resources of Google Patents). How do these guys do it so quickly? I need something like that functionality. Essentially, I don't know what search strings the user will enter, and it could include wild cards to search over like 400-500K documents (I think the above guys search over millions of patents).

Do I absolutely need to pre-index the documents to get this kind of speed? And how could you even do that with wildcards? Thanks guys.




Aucun commentaire:

Enregistrer un commentaire