Web Crawler Basic

There are several modules for accessing html through Python.

Urllib and requests are two different modules in Python which can be used for crawler.

For beginner, the requests is recommended.

Headers

HTTP request header is the information, in the form of a text record, that a user’s browser sends to a Web serveer containing the details of what the browser wants and will accept back from server.

User-Agent

The User-Agent appears in an HTTP Request Header, not an HTTP Response one. In general, the request is sent from browser to the web application. So the user-agent is filled by the browser. Different browsers will fill up this field with different values.

Blog: Web Crawler Get the User-Agent

Referer

Optional HTTP header field that identifies the address of the webpage that linked to the resource being requested. By checking the referee, the new webpage can see where the request originated.

Some websites use this to ban the crawler, and you may need to update your referer.

Reference

  1. Python 爬虫基础之urllib和requests
  2. my demo of web crawler
  3. stackoverflow: HTTP request header