jeudi 21 mai 2015

Web Crawler for Amazon

First I will define and say what is the problem I am trying to solve. Then I will tell what I ahem done yet. 1) I am working on creating a web crawler for Amazon.com. I want to scrape all the data about a product say category, title, ItemId, price, color, customer reviews, technical details, etc. I want whatever information is possible to get.

2) So, I started using Amazon Product Advertising API. It provides a great set of functions and parameters to extra whatever information you are looking for. But as I proceeded, I found a limitation to using this API. It only allows to move from page number 1 to 10 and gives error if you enter a page number beyond that. Also, umber of items returned from each page are limited to 10, even when the actual URL on each page have more than 10 items. Thus, I had to say goodbye to it.

3) Now I am thinking to directly start and look for patterns in the URL of Amazon website. This can be done using regular expressions and will involve a lot of patter searching. Also, someone said it won't be as efficient because some of the data that is loaded on an html page is using Javascript. And when I use the crawler, I will only have the information which is static, not the one loaded from javascript.

Does anyone have any suggestion about my idea. Or if anyone could suggest something better? But it must be remembered that I am trying to get as much information for as many products as possible.

Yes, I forgot to mention that I am using Java.

Thanks in advance.




Aucun commentaire:

Enregistrer un commentaire