I was trying to crawl a site, I wrote a script which is perfectly working in my Local system, but when I run on amazon instance, it throw connection error/protocol error.
A Connection error occurred. - ConnectionError(ProtocolError('Connection aborted.', error(110, 'Connection timed out')),)
when I hit requests.get(url) line, it kinda get hang for a while, and I have to I interrupt this proccess by CTRL+C (In Ubuntu).
I tried this script on 3 different AWS instances, and I'm sure that I've never run any script to crawl that site on those instance, so I can be sure that this site has not blocked this particular IP.
I tried all possible things, like cookies settings, making session and all but no success, all these are working fine in my local system.
I want to know if this is possible to block all remote IPs or is there any thing such that there server can detect that Its a headless browser/remote machine not actual browser?
I was following this approach
import json
import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.headers['User-Agent'] = "Mozilla/5.0 (X11; Linux x86_64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Ubuntu Chromium/43.0.2357.130 Chrome/43.0.2357.130 Safari/537.36"
s.headers['Connection'] = "keep-alive"
s.headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
s.headers['Accept-Encoding'] = "gzip, deflate, sdch"
s.headers["Accept-Language"] = "en-US,en;q=0.8"
url = "http://ift.tt/1Jgk1sm"
resp = s.get(url)
this code is working perfectly fine in local but not on remote.
I can tell you exact site(in comments) if you need more information. Any help/ideas/trick/hint would be appreciated. I have enough experience in web crawling/scrapping but still I dont have any clue, I have already spent a whole day on this.
Aucun commentaire:
Enregistrer un commentaire