Task #1809
Refactor scrape_page call
Start date:
04/04/2017
Due date:
% Done:
100%
Description
The scrape_page call takes a ton of arguments which makes it unwieldy.
Associated revisions
Separating scraping from the input code
Reference #1809
Moving parameters to constructor & function calls
Reference #1809
Moving parameters to constructor & function calls
Reference #1809
Refactored controller and input
Redesigning the controller and input to work with the refactored code
Reference #1809
History
#1 Updated by Luke Murphey over 7 years ago
Some options:
- Make a separate scraper class
- Pass arguments as a config object
- Pass arguments as *args or **kwargs
#2 Updated by Luke Murphey over 7 years ago
Here are the functions invoked by scrape_page():
- resolve_proxy_type
- get_result_single
- get_result_built_in_client
- detect_encoding
- get_result_browser
- get_display
- get_firefox_profile
- add_auth_to_url
- unescape
- escape_field_name
- extract_links
- cleanup_link
- get_result_built_in_client
#3 Updated by Luke Murphey over 7 years ago
Here are the functions that are modular input specific:
- get_file_path
- run
I think the best approach is to
- break out the modular input and leave everything else in the scraper
- make the arguments to scrape_page class variables
- updated the controller, search command, tests
#4 Updated by Luke Murphey over 7 years ago
What should be in the constructor call versus the command call?
- basics: url, username, password, etc. (in scrape_page call)
- output customization: output_matches_as_mv, output_matches_as_separate_fields, etc. (in constructor)
- charset detection: charset_detect_meta_enabled, etc.
- proxy info: proxy_type, etc.
- spider info: page_limit, etc. (in constructor)
#5 Updated by Luke Murphey over 7 years ago
Phases:
- Break out WebScraper class
- Add constructor options for proxy info and charset
- Modify get_result_browser, get_result_single, get_result_built_in_client, get_http_client
- Update search command
- Update modular input
- Update controller
#6 Updated by Luke Murphey over 7 years ago
- Status changed from New to Closed
- Target version set to 4.2
- % Done changed from 0 to 100