Project

General

Profile

Feature #2220

Stream results without caching them to reduce memory usage

Added by Luke Murphey about 6 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Input: Web Spider
Target version:
Start date:
02/24/2018
Due date:
% Done:

100%


non-streaming.png View (88.5 KB) Luke Murphey, 03/02/2018 09:08 PM

streaming.png View (93.8 KB) Luke Murphey, 03/02/2018 09:23 PM

Associated revisions

Revision 644 (diff)
Added by lukemurphey about 6 years ago

Adding test server that includes a large chunk of text

This is useful for performance testing

Reference #2220

Revision 645 (diff)
Added by lukemurphey about 6 years ago

Adding streaming of results as they arrive

Reference #2220

Revision 647 (diff)
Added by lukemurphey about 6 years ago

Increasing the size of test output to increase chance of seeing performance problems

Reference #2220

Revision 648 (diff)
Added by lukemurphey about 6 years ago

Forcing garbage collection to make performance analysis easier

This also reduce overall memory usage by cleaning some things up
while the input waits for the next run

Reference #2220

History

#1 Updated by Luke Murphey about 6 years ago

web_input.py: calls output_event in the modular input class

To make this work, I would need to:
  1. Pass an output result function to scrape_page()
  2. Pass the output result function to get_result_single()
  3. Keep a result count around. Maybe just having get_result_single() return a small set of fields, like just the URL.

#2 Updated by Luke Murphey about 6 years ago

Before I do the change I should test this and monitor memory usage. I could do this by patching the internal web-server to return content with variable URLs that should cause the existing design to load up memory.

Then, I can run this afterwards to show that memory usage is down.

#3 Updated by Luke Murphey about 6 years ago

  • Target version changed from 4.5.3 to 4.5.2

#4 Updated by Luke Murphey about 6 years ago

Monitoring memory with:

source=top PID=57782 | timechart max(RES) as memory

#5 Updated by Luke Murphey about 6 years ago

It seems like some memory isn't being reclaimed. When I use output only when contents change, the memory usage seems like it is actually lower and reclaims faster once the input is done.

Next steps:

  1. Disable streaming, see if the problem changes
  2. Disable parts of the output function, see if the problem changes

#6 Updated by Luke Murphey about 6 years ago

  • % Done changed from 0 to 70

#7 Updated by Luke Murphey about 6 years ago

Without streaming:

#8 Updated by Luke Murphey about 6 years ago

With streaming:

#9 Updated by Luke Murphey about 6 years ago

  • Status changed from New to Closed
  • % Done changed from 70 to 100

Also available in: Atom PDF