Project

General

Profile

Feature #762

Web spider support

Added by Luke Murphey over 9 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
04/29/2016
Due date:
% Done:

100%


Description

Add ability to have input spread to multiple pages. To do this, the input would need to be updated to support:

  • A seed URL (that indicates where to start looking for matching URLs)
  • A URL filter to limit what to look at
  • Depth limit (limits how deep to keep looking for content)
  • URL limit (limits how many files to evaluate)
  • Selector to use for grabbing content
  • (optional) content-type filter

This may not be done as a modular input but rather as a one-time input to get data into Splunk.


Subtasks

Task #1303: Convert scrape_page to use more helper functionsClosedLuke Murphey

Task #1304: Update input to output multiple resultsClosedLuke Murphey

Task #1305: Update input to recursively spider websitesClosed

Task #1309: Implement extracted link processingClosedLuke Murphey

Task #1310: Implement domain_limitClosed

Task #1306: Update search command to include options for scrapingClosed

Task #1307: Update input page to include options for scrapingClosed

Task #1308: Update preview to output scraping resultsClosed

History

#1 Updated by Luke Murphey over 9 years ago

  • Assignee deleted (Luke Murphey)

#2 Updated by Luke Murphey about 9 years ago

This may be a good search command. With a search command, users could set up some processing to do things like get certain page numbers. For example, if I wanted to pull notes from Bible.com (https://www.bible.com/users/LukeMurphey/notes), then I would need to increment the page number variable to get the next page.

#3 Updated by Luke Murphey almost 9 years ago

  • Assignee set to Luke Murphey
  • Target version set to 2.0

#5 Updated by Luke Murphey almost 9 years ago

  • Status changed from New to In Progress

#7 Updated by Luke Murphey over 8 years ago

To simplify, I think I could just use the main URL as the seed URL and then have a separate section for spidering.

The spider section would include:

  • URL filter
  • Depth limit
  • URL limit

#8 Updated by Luke Murphey over 8 years ago

A good test case is Ryobi Tools. In that case, I want it to extract the content from https://www.ryobitools.com/outdoor/products/list/family/one-plus and then go to the pages. I need to have the app distinguish between pages that are identical but just have a different page number.

#9 Updated by Luke Murphey over 8 years ago

Could use Etags but these are not necessarily provided by all servers.

#10 Updated by Luke Murphey about 8 years ago

  • Priority changed from Normal to High

#11 Updated by Luke Murphey about 8 years ago

  • Status changed from In Progress to New

#12 Updated by Luke Murphey almost 8 years ago

Could you use Scrapy: http://scrapy.org/download/

#13 Updated by Luke Murphey almost 8 years ago

I will likely need to break up the scrape_page function up. It currently:

  1. Preps and validates the arguments
  2. Resolves the proxy configuration
  3. Resolves the user-agent string
  4. Performs the HTTP request, records the stats regarding load time, size
  5. Resolves the encoding and decodes the content
  6. Parses the HTML
  7. Runs the selector against the HTML
  8. Outputs the matches

#15 Updated by Luke Murphey almost 8 years ago

To do mine own spider I would just need to:
  • Write extractors
  • Loop on the results

#16 Updated by Luke Murphey almost 8 years ago

  • Status changed from New to Closed

Also available in: Atom PDF