Feature #758: Handle forms authentication - Website Input - LukeMurphey.net

#1 Updated by Luke Murphey about 11 years ago

"Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work."

http://blog.hartleybrody.com/web-scraping/

#2 Updated by Luke Murphey almost 11 years ago

http://wwwsearch.sourceforge.net/mechanize/
http://stackoverflow.com/questions/11685235/login-using-python-in-basic-html-form

#3 Updated by Luke Murphey almost 10 years ago

Might be able to support with this: https://github.com/lorien/grab

#4 Updated by Luke Murphey almost 10 years ago

http://answers.splunk.com/answers/301772/website-input-how-do-i-monitor-a-forum-that-requir.html#answer-305006

#5 Updated by Luke Murphey over 9 years ago

https://answers.splunk.com/answers/339200/website-input-how-far-off-is-support-for-forms-bas.html

#6 Updated by Luke Murphey over 9 years ago

Might be able to make this work using mechanize (http://wwwsearch.sourceforge.net/mechanize/)

See http://stackoverflow.com/questions/4847226/form-based-authentication-with-python

#7 Updated by Luke Murphey over 9 years ago

Priority changed from Normal to High

#8 Updated by Luke Murphey over 9 years ago

http://wwwsearch.sourceforge.net/mechanize/

#9 Updated by Luke Murphey about 8 years ago

httplib2 doesn't seem to persist sessions very well, see:

Request is likely a better option: https://stackoverflow.com/questions/19566645/python-httplib2-https-login-fails

#10 Updated by Luke Murphey about 8 years ago

The user is going to need to define several things:

Authentication URL
Username
Password
Username field name
Password field name

Concerns:

How do I get the built-in client and the browser to both support this?
1. Can the cookies be transferred?
What about CSRF protection?

#11 Updated by Luke Murphey about 8 years ago

A conf file might look like this:

authentication_url=https://domain.com/auth
username_field=username
password_field=password
authentication_action=POST

#12 Updated by Luke Murphey about 8 years ago

Tried mechanize. It actually works. Here is a Redmine login:

import mechanize

url = "http://Lukemurphey.net/login" 
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)

br.select_form(nr=2)

br.form['username'] = 'Luke'
br.form['password'] = 'OPENSESAME'

res = br.submit()
content = res.read()

To use it, you will need:

mechanize: https://github.com/python-mechanize/mechanize
html5lib 0.999999999: https://pypi.python.org/pypi/html5lib
six 1.10: https://pypi.python.org/pypi/six
webencodings 0.5.1: https://pypi.python.org/pypi/webencodings

#13 Updated by Luke Murphey about 8 years ago

To convert this over, I would need:

The ability to obtain the raw HTML in get_result_built_in_client()
1. https://stackoverflow.com/questions/9552773/raw-html-vs-dom-scraping-in-python-using-mechanize-and-beautiful-soup
The ability to obtain the response code in get_result_built_in_client()
1. https://stackoverflow.com/questions/11809696/python-mechanize-browser-openurl-status-code
The ability to assign a proxy server in get_http_client()
1. Note that get_http_client() is not currently used by scrape page
2. https://stackoverflow.com/questions/1997894/pythons-mechanize-proxy-support
The ability to assign HTTP credentials
The ability to set the user-agent
1. http://stockrt.github.io/p/emulating-a-browser-in-python-with-mechanize/

I think I should also:

Break out authentication from scrape_page() into a setAuthentication() call
Put the HTTP client under an abstraction layer so that I can switch it out as I see fit

#14 Updated by Luke Murphey about 8 years ago

Category set to Input: Web Spider
Target version set to 4.4

#15 Updated by Luke Murphey about 8 years ago

http://mechanize.readthedocs.io/en/latest/browser_api.html

#16 Updated by Luke Murphey about 8 years ago

Related to Feature #1963: Auto-discover form fields added

#17 Updated by Luke Murphey about 8 years ago

Need to have some logging if the fields cannot be determined.

#18 Updated by Luke Murphey about 8 years ago

Related to Feature #1968: Add browser support for forms authentication added

#19 Updated by Luke Murphey about 8 years ago

Status changed from New to Closed

	Related to Website Input - Feature #1963: Auto-discover form fields	Closed	08/02/2017
	Related to Website Input - Feature #1968: Add browser support for forms authentication	Closed	08/07/2017

Project

General

Profile

Splunk Apps » Website Input

Issues

Custom queries

Feature #758

Handle forms authentication

History