Project

General

Profile

Feature #758

Handle forms authentication

Added by Luke Murphey over 10 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Input: Web Spider
Target version:
Start date:
07/30/2017
Due date:
% Done:

100%


Description

Add support for forms authentication.


Subtasks

Task #1949: Use a wrapper around the http clientClosed

Task #1950: Include mechanize in the appClosed

Task #1951: Create a web client wrapper around mechanizeClosed

Task #1952: Add options for forms authentication (backend)Closed

Task #1953: Use form authentication in web clientClosed

Task #1954: Add ability to define form authentication on front-endClosed

Task #1955: Update controller to work with web-client wrapperClosed

Task #1956: Implement proxy support for mechanizeClosed

Feature #1961: Update controller to support previews with forms authenticationClosed

Bug #1962: Deal with unit test that responds with a 401Closed

Task #1966: Update documentation and text regarding HTTP basic authentication and form fieldsClosed

Feature #1967: Improve handling for form fieldsClosed


Related issues

Related to Website Input - Feature #1963: Auto-discover form fields Closed 08/02/2017
Related to Website Input - Feature #1968: Add browser support for forms authentication Closed 08/07/2017

History

#1 Updated by Luke Murphey over 10 years ago

"Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work."

http://blog.hartleybrody.com/web-scraping/

#3 Updated by Luke Murphey about 9 years ago

Might be able to support with this: https://github.com/lorien/grab

#7 Updated by Luke Murphey almost 9 years ago

  • Priority changed from Normal to High

#10 Updated by Luke Murphey over 7 years ago

The user is going to need to define several things:

  1. Authentication URL
  2. Username
  3. Password
  4. Username field name
  5. Password field name
Concerns:
  1. How do I get the built-in client and the browser to both support this?
    1. Can the cookies be transferred?
  2. What about CSRF protection?

#11 Updated by Luke Murphey over 7 years ago

A conf file might look like this:

authentication_url=https://domain.com/auth
username_field=username
password_field=password
authentication_action=POST

#12 Updated by Luke Murphey over 7 years ago

Tried mechanize. It actually works. Here is a Redmine login:

import mechanize

url = "http://Lukemurphey.net/login" 
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)

br.select_form(nr=2)

br.form['username'] = 'Luke'
br.form['password'] = 'OPENSESAME'

res = br.submit()
content = res.read()
To use it, you will need:

#13 Updated by Luke Murphey over 7 years ago

To convert this over, I would need:

  1. The ability to obtain the raw HTML in get_result_built_in_client()
    1. https://stackoverflow.com/questions/9552773/raw-html-vs-dom-scraping-in-python-using-mechanize-and-beautiful-soup
  2. The ability to obtain the response code in get_result_built_in_client()
    1. https://stackoverflow.com/questions/11809696/python-mechanize-browser-openurl-status-code
  3. The ability to assign a proxy server in get_http_client()
    1. Note that get_http_client() is not currently used by scrape page
    2. https://stackoverflow.com/questions/1997894/pythons-mechanize-proxy-support
  4. The ability to assign HTTP credentials
    1. https://stackoverflow.com/questions/5291576/basic-and-form-authentication-with-mechanize-ruby
    2. https://stackoverflow.com/questions/40919543/python-mechanize-implementation-of-http-basic-auth
    3. https://stackoverflow.com/questions/1097380/can-python-mechanize-handle-http-auth
  5. The ability to set the user-agent
    1. http://stockrt.github.io/p/emulating-a-browser-in-python-with-mechanize/
I think I should also:
  • Break out authentication from scrape_page() into a setAuthentication() call
  • Put the HTTP client under an abstraction layer so that I can switch it out as I see fit

#14 Updated by Luke Murphey over 7 years ago

  • Category set to Input: Web Spider
  • Target version set to 4.4

#16 Updated by Luke Murphey over 7 years ago

#17 Updated by Luke Murphey over 7 years ago

Need to have some logging if the fields cannot be determined.

#18 Updated by Luke Murphey over 7 years ago

  • Related to Feature #1968: Add browser support for forms authentication added

#19 Updated by Luke Murphey over 7 years ago

  • Status changed from New to Closed

Also available in: Atom PDF