When writing the crawler program, I met across several problems. This blog is to records the solution which I used in my program.
Access denied when using url directly
Some websites add some features to protect them from being scraped. In this case, we should add the headers.
The fields in headers
:
- User-agent: like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
- Referer: like
https://www.google.com/
(See the introduction of crawler: Web Crawler Basic)
Max retries exceeded with url
1 | requests.exceptions.SSLError: HTTPSConnectionPool(host='www.mzitu.com', port=443): Max retries exceeded with url: /184325/8 (Caused by SSLError(SSLError("bad handshake: SysCallError(60, 'ETIMEDOUT')"))) |
First time to solve this question:
- add
sleep(1)
in every iteration of downloading image- It improve the problem and make the program stick longer time to do crawler work, but not solve it completely.
- Be able to download around 300 images until failing.
- To improve the ability of anti-anti-spider, in the second version, I added two other functions:
- Randomly sleep some time when accessing the images
- Randomly choose the User-Agent using
fake_useragent
Python module. - The performance is not very good. I’m not sure whether it is due to the problem of my ip.
- If want to have more ability of scraping, more IPs are needed in my program.(To be continued)
Connection reset by peer
I’m not sure what’s the reason of this problem.
I add the proxies to avoid this problem.
how to get the proxies.
Use it in the
requests.get
function.
Although the program become slow, it become much more robust during scraping.