Feature #762: Web spider support - Website Input - LukeMurphey.net

Feature #762

Web spider support

Added by Luke Murphey about 11 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Luke Murphey

Category:

Target version:

2.0

Start date:

04/29/2016

Due date:

% Done:

100%

Description

Add ability to have input spread to multiple pages. To do this, the input would need to be updated to support:

A seed URL (that indicates where to start looking for matching URLs)
A URL filter to limit what to look at
Depth limit (limits how deep to keep looking for content)
URL limit (limits how many files to evaluate)
Selector to use for grabbing content
(optional) content-type filter

This may not be done as a modular input but rather as a one-time input to get data into Splunk.

Subtasks

History

#1 Updated by Luke Murphey about 11 years ago

Assignee deleted (~~Luke Murphey~~)

#2 Updated by Luke Murphey over 10 years ago

This may be a good search command. With a search command, users could set up some processing to do things like get certain page numbers. For example, if I wanted to pull notes from Bible.com (https://www.bible.com/users/LukeMurphey/notes), then I would need to increment the page number variable to get the next page.

#3 Updated by Luke Murphey about 10 years ago

Assignee set to Luke Murphey
Target version set to 2.0

#4 Updated by Luke Murphey about 10 years ago

See http://answers.splunk.com/answers/239251/what-is-the-best-way-to-monitor-a-web-page-contain.html

#5 Updated by Luke Murphey about 10 years ago

Status changed from New to In Progress

#6 Updated by Luke Murphey about 10 years ago

http://www.xavierdupre.fr/blog/2013-10-27_nojs.html

#7 Updated by Luke Murphey almost 10 years ago

To simplify, I think I could just use the main URL as the seed URL and then have a separate section for spidering.

The spider section would include:

URL filter
Depth limit
URL limit

#8 Updated by Luke Murphey almost 10 years ago

A good test case is Ryobi Tools. In that case, I want it to extract the content from https://www.ryobitools.com/outdoor/products/list/family/one-plus and then go to the pages. I need to have the app distinguish between pages that are identical but just have a different page number.