Feature #758
Handle forms authentication
100%
Description
Add support for forms authentication.
Subtasks
Related issues
History
#1 Updated by Luke Murphey over 10 years ago
"Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work."
#3 Updated by Luke Murphey over 9 years ago
Might be able to support with this: https://github.com/lorien/grab
#6 Updated by Luke Murphey about 9 years ago
Might be able to make this work using mechanize (http://wwwsearch.sourceforge.net/mechanize/)
See http://stackoverflow.com/questions/4847226/form-based-authentication-with-python
#7 Updated by Luke Murphey about 9 years ago
- Priority changed from Normal to High
#8 Updated by Luke Murphey over 8 years ago
#9 Updated by Luke Murphey over 7 years ago
httplib2 doesn't seem to persist sessions very well, see:
- https://github.com/jcgregorio/httplib2/wiki/Examples
- https://stackoverflow.com/questions/923296/keeping-a-session-in-python-while-making-http-requests
Request is likely a better option: https://stackoverflow.com/questions/19566645/python-httplib2-https-login-fails
#10 Updated by Luke Murphey over 7 years ago
The user is going to need to define several things:
- Authentication URL
- Username
- Password
- Username field name
- Password field name
- How do I get the built-in client and the browser to both support this?
- Can the cookies be transferred?
- What about CSRF protection?
#11 Updated by Luke Murphey over 7 years ago
A conf file might look like this:
authentication_url=https://domain.com/auth username_field=username password_field=password authentication_action=POST
#12 Updated by Luke Murphey over 7 years ago
Tried mechanize. It actually works. Here is a Redmine login:
import mechanize url = "http://Lukemurphey.net/login" br = mechanize.Browser() br.set_handle_robots(False) # ignore robots br.open(url) br.select_form(nr=2) br.form['username'] = 'Luke' br.form['password'] = 'OPENSESAME' res = br.submit() content = res.read()To use it, you will need:
- mechanize: https://github.com/python-mechanize/mechanize
- html5lib 0.999999999: https://pypi.python.org/pypi/html5lib
- six 1.10: https://pypi.python.org/pypi/six
- webencodings 0.5.1: https://pypi.python.org/pypi/webencodings
#13 Updated by Luke Murphey over 7 years ago
To convert this over, I would need:
- The ability to obtain the raw HTML in get_result_built_in_client()
- The ability to obtain the response code in get_result_built_in_client()
- The ability to assign a proxy server in get_http_client()
- Note that get_http_client() is not currently used by scrape page
- https://stackoverflow.com/questions/1997894/pythons-mechanize-proxy-support
- The ability to assign HTTP credentials
- The ability to set the user-agent
- Break out authentication from scrape_page() into a setAuthentication() call
- Put the HTTP client under an abstraction layer so that I can switch it out as I see fit
#14 Updated by Luke Murphey over 7 years ago
- Category set to Input: Web Spider
- Target version set to 4.4
#16 Updated by Luke Murphey over 7 years ago
- Related to Feature #1963: Auto-discover form fields added
#17 Updated by Luke Murphey over 7 years ago
Need to have some logging if the fields cannot be determined.
#18 Updated by Luke Murphey over 7 years ago
- Related to Feature #1968: Add browser support for forms authentication added
#19 Updated by Luke Murphey over 7 years ago
- Status changed from New to Closed