Feature #2220: Stream results without caching them to reduce memory usage - Website Input - LukeMurphey.net

Feature #2220

Stream results without caching them to reduce memory usage

Added by Luke Murphey over 7 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Luke Murphey

Category:

Input: Web Spider

Target version:

4.5.2

Start date:

02/24/2018

Due date:

% Done:

100%

non-streaming.png View (88.5 KB) Luke Murphey, 03/02/2018 09:08 PM

streaming.png View (93.8 KB) Luke Murphey, 03/02/2018 09:23 PM

Associated revisions

Revision 644 (diff)
Added by lukemurphey over 7 years ago

Adding test server that includes a large chunk of text

This is useful for performance testing

Reference #2220

Revision 645 (diff)
Added by lukemurphey over 7 years ago

Adding streaming of results as they arrive

Reference #2220

Revision 647 (diff)
Added by lukemurphey over 7 years ago

Increasing the size of test output to increase chance of seeing performance problems

Reference #2220

Revision 648 (diff)
Added by lukemurphey over 7 years ago

Forcing garbage collection to make performance analysis easier

This also reduce overall memory usage by cleaning some things up
while the input waits for the next run

Reference #2220

History

#1 Updated by Luke Murphey over 7 years ago

web_input.py: calls output_event in the modular input class

To make this work, I would need to:

Pass an output result function to scrape_page()
Pass the output result function to get_result_single()
Keep a result count around. Maybe just having get_result_single() return a small set of fields, like just the URL.

#2 Updated by Luke Murphey over 7 years ago

Before I do the change I should test this and monitor memory usage. I could do this by patching the internal web-server to return content with variable URLs that should cause the existing design to load up memory.

Then, I can run this afterwards to show that memory usage is down.

#3 Updated by Luke Murphey over 7 years ago

Target version changed from 4.5.3 to 4.5.2

#4 Updated by Luke Murphey over 7 years ago

Monitoring memory with:

source=top PID=57782 | timechart max(RES) as memory

#5 Updated by Luke Murphey over 7 years ago

It seems like some memory isn't being reclaimed. When I use output only when contents change, the memory usage seems like it is actually lower and reclaims faster once the input is done.

Next steps: