Bug #2859: Scrape page does not work for some pages - Website Input - LukeMurphey.net

Bug #2859

Scrape page does not work for some pages

Added by Luke Murphey about 5 years ago. Updated about 5 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Category:

Target version:

Start date:

10/21/2020

Due date:

% Done:

100%

History

#1 Updated by Luke Murphey about 5 years ago

Observations:

\It works with the search command:
- | webscrape selector=".u\.txt\:lg" url="https://splunkbase.splunk.com/app/1818/" depth_limit=25 empty_matches=0
It works with Firefox

I see an exception

2020-10-21 08:47:14,495 ERROR Exception generated while attempting to content for url=https://splunkbase.splunk.com/app/1818/
Traceback (most recent call last):
  File "/Users/lmurphey/Splunk/7291/etc/apps/website_input/bin/website_input_ops_rest_handler.py", line 285, in get_load_page
    content = web_client.get_url(url, 'GET')
  File "/Users/lmurphey/Splunk/7291/etc/apps/website_input/bin/website_input_app/web_client.py", line 492, in get_url
    raise ConnectionFailure(str(e), e)
ConnectionFailure: <urlopen error ('The read operation timed out',)>, caused by URLError(SSLError('The read operation timed out',),)

Questions:

What happens when I strip the endpoint down?
- Still fails
Does this fail in the unit tests?
- It does
Does the httplib2 client work?
- DefaultWebClient is the MechanizeClient
- It successfully returns but times out

Does this work from the CLI?

import sys
sys.path.append(os.path.join("."))
import mechanize
browser = mechanize.Browser()
response = browser.open("https://splunkbase.splunk.com/app/1818/", timeout=15)
content = response.read(800)

ImportError: This package should not be accessible on Python 3. Either you are trying to run from the python-future src folder or your installation of python-future is corrupted.

This is because I tried the python 3 client on an old host

Possible causes:

Python 3 issues
Old mechanize client
- Tried the new one: same issue
SSL problem
Low timeout
- This was it

#2 Updated by Luke Murphey about 5 years ago

Status changed from New to Closed
% Done changed from 0 to 100

Also available in: Atom PDF

Project

General

Profile

Splunk Apps » Website Input

Issues

Custom queries

Bug #2859

Scrape page does not work for some pages

History

#1 Updated by Luke Murphey about 5 years ago

#2 Updated by Luke Murphey about 5 years ago