Feature #762
Web spider support
100%
Description
Add ability to have input spread to multiple pages. To do this, the input would need to be updated to support:
- A seed URL (that indicates where to start looking for matching URLs)
- A URL filter to limit what to look at
- Depth limit (limits how deep to keep looking for content)
- URL limit (limits how many files to evaluate)
- Selector to use for grabbing content
- (optional) content-type filter
This may not be done as a modular input but rather as a one-time input to get data into Splunk.
Subtasks
History
#1 Updated by Luke Murphey over 10 years ago
- Assignee deleted (
Luke Murphey)
#2 Updated by Luke Murphey almost 10 years ago
This may be a good search command. With a search command, users could set up some processing to do things like get certain page numbers. For example, if I wanted to pull notes from Bible.com (https://www.bible.com/users/LukeMurphey/notes), then I would need to increment the page number variable to get the next page.
#3 Updated by Luke Murphey over 9 years ago
- Assignee set to Luke Murphey
- Target version set to 2.0
#5 Updated by Luke Murphey over 9 years ago
- Status changed from New to In Progress
#7 Updated by Luke Murphey over 9 years ago
To simplify, I think I could just use the main URL as the seed URL and then have a separate section for spidering.
The spider section would include:
- URL filter
- Depth limit
- URL limit
#8 Updated by Luke Murphey over 9 years ago
A good test case is Ryobi Tools. In that case, I want it to extract the content from https://www.ryobitools.com/outdoor/products/list/family/one-plus and then go to the pages. I need to have the app distinguish between pages that are identical but just have a different page number.
#9 Updated by Luke Murphey over 9 years ago
Could use Etags but these are not necessarily provided by all servers.
#10 Updated by Luke Murphey almost 9 years ago
- Priority changed from Normal to High
#11 Updated by Luke Murphey almost 9 years ago
- Status changed from In Progress to New
#12 Updated by Luke Murphey over 8 years ago
Could you use Scrapy: http://scrapy.org/download/
#13 Updated by Luke Murphey over 8 years ago
I will likely need to break up the scrape_page function up. It currently:
- Preps and validates the arguments
- Resolves the proxy configuration
- Resolves the user-agent string
- Performs the HTTP request, records the stats regarding load time, size
- Resolves the encoding and decodes the content
- Parses the HTML
- Runs the selector against the HTML
- Outputs the matches
#14 Updated by Luke Murphey over 8 years ago
Question about using scrapy:
- Can apply a limit?
- How are links extracted?
- How to handle proxies?
- How to set user-agent string?
- How is encoding detected?
- What types of proxy servers does scrapy support?
- How can authentication be handled?
#15 Updated by Luke Murphey over 8 years ago
- Write extractors
- Loop on the results
#16 Updated by Luke Murphey over 8 years ago
- Status changed from New to Closed