There are several modules for accessing html through Python.
Urllib
and requests
are two different modules in Python which can be used for crawler.
For beginner, the requests
is recommended.
Headers
HTTP request header is the information, in the form of a text record, that a user’s browser sends to a Web serveer containing the details of what the browser wants and will accept back from server.
User-Agent
The User-Agent appears in an HTTP Request Header, not an HTTP Response one. In general, the request is sent from browser to the web application. So the user-agent is filled by the browser. Different browsers will fill up this field with different values.
Blog: Web Crawler Get the User-Agent
Referer
Optional HTTP header field that identifies the address of the webpage that linked to the resource being requested. By checking the referee, the new webpage can see where the request originated.
Some websites use this to ban the crawler, and you may need to update your referer.
Reference
- Python 爬虫基础之urllib和requests
- my demo of web crawler
- stackoverflow: HTTP request header