Project

General

Profile

Task #1809

Refactor scrape_page call

Added by Luke Murphey about 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Input: Web Spider
Target version:
Start date:
04/04/2017
Due date:
% Done:

100%


Description

The scrape_page call takes a ton of arguments which makes it unwieldy.

Associated revisions

Revision 398 (diff)
Added by lukemurphey about 7 years ago

Separating scraping from the input code

Reference #1809

Revision 401 (diff)
Added by lukemurphey about 7 years ago

Moving parameters to constructor & function calls

Reference #1809

Revision 407 (diff)
Added by lukemurphey about 7 years ago

Moving parameters to constructor & function calls

Reference #1809

Revision 408 (diff)
Added by lukemurphey about 7 years ago

Refactored controller and input

Redesigning the controller and input to work with the refactored code

Reference #1809

History

#1 Updated by Luke Murphey about 7 years ago

Some options:
  1. Make a separate scraper class
  2. Pass arguments as a config object
  3. Pass arguments as *args or **kwargs

#2 Updated by Luke Murphey about 7 years ago

Here are the functions invoked by scrape_page():

  • resolve_proxy_type
  • get_result_single
    • get_result_built_in_client
      • detect_encoding
    • get_result_browser
      • get_display
      • get_firefox_profile
      • add_auth_to_url
    • unescape
    • escape_field_name
    • extract_links
      • cleanup_link

#3 Updated by Luke Murphey about 7 years ago

Here are the functions that are modular input specific:

  • get_file_path
  • run

I think the best approach is to

  1. break out the modular input and leave everything else in the scraper
  2. make the arguments to scrape_page class variables
  3. updated the controller, search command, tests

#4 Updated by Luke Murphey about 7 years ago

What should be in the constructor call versus the command call?

  • basics: url, username, password, etc. (in scrape_page call)
  • output customization: output_matches_as_mv, output_matches_as_separate_fields, etc. (in constructor)
  • charset detection: charset_detect_meta_enabled, etc.
  • proxy info: proxy_type, etc.
  • spider info: page_limit, etc. (in constructor)

#5 Updated by Luke Murphey about 7 years ago

Phases:
  1. Break out WebScraper class
  2. Add constructor options for proxy info and charset
    1. Modify get_result_browser, get_result_single, get_result_built_in_client, get_http_client
  3. Update search command
  4. Update modular input
  5. Update controller

#6 Updated by Luke Murphey about 7 years ago

  • Status changed from New to Closed
  • Target version set to 4.2
  • % Done changed from 0 to 100

Also available in: Atom PDF