Project

General

Profile

Task #471

Feature #466: Lemma lookup

Diogenes Lemma Importer

Added by Luke Murphey about 12 years ago. Updated almost 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

100%


Description

Add the ability to import Diogenes lemma list.

Associated revisions

Revision 234 (diff)
Added by Luke Murphey about 12 years ago

Added an importer capable of importing Diogenes lemma importer. Closes #471.

Revision 234 (diff)
Added by Luke Murphey about 12 years ago

Added an importer capable of importing Diogenes lemma importer. Closes #471.

Revision 226 (diff)
Added by Luke Murphey about 12 years ago

Added an importer capable of importing Diogenes lemma importer. Closes #471.

History

#1 Updated by Luke Murphey about 12 years ago

  • Subject changed from Diogenes Lemm Importer to Diogenes Lemma Importer

#2 Updated by Luke Murphey about 12 years ago

  • % Done changed from 0 to 80

#3 Updated by Luke Murphey about 12 years ago

  • Status changed from New to Closed
  • % Done changed from 80 to 100

#4 Updated by Luke Murphey about 12 years ago

The importer is taking too long to complete and uses a lot more memory than expected.

#5 Updated by Luke Murphey about 12 years ago

  • Status changed from Closed to In Progress

#6 Updated by Luke Murphey about 12 years ago

The files contains 114098 lines. According to the unit tests, it takes about 8 seconds to handle about 100 entries. Doing the math, importing the entire file should take about 9000 seconds or about 2 and a half hours.

My experience was that it took much longer than this but this may have been due to the fact that it ran out of memory was was swapping excessively.

#7 Updated by Luke Murphey about 12 years ago

Importing 1000 entries took 63.85333 seconds. Memory usage does seem to climb. It climbed from about 26 MB to 30.1 MB.

#8 Updated by Luke Murphey about 12 years ago

Disabling the return of the created lemmas caused memory to grow up to 29.5 MBs. Memory growth was still observed though.

#9 Updated by Luke Murphey about 12 years ago

Caching Dialect and Case isntances improved the import performance for 1,000 items from 63 seconds to 49 seconds. Memory use was unchanged, topping at about 29.5 MB.

#10 Updated by Luke Murphey about 12 years ago

Removing the calls to convert the beta-code to unicode did not improve performance noticeably.

#11 Updated by Luke Murphey about 12 years ago

The problem is in parse_description(). The appending of cases and dialects to the word_description instance is very slow. Removing both of these causes performance the import of 1,000 items to complete in about 10 seconds. Both of them seem to take about the same amount of time.

Memory usage goes up to about 27 MB.

#12 Updated by Luke Murphey about 12 years ago

Oddly enough, commenting out the call to parse_form() in parse_lemma() along with a call to gc.collect() seems to slow down the memory usage.

#13 Updated by Luke Murphey about 12 years ago

This has something to do with the database calls. I commented out all of the save() calls and memory usage stays at 21.2 MB.

#14 Updated by Luke Murphey about 12 years ago

Arg.

This is simple. It is because Django was in debug mode which causes it to save the DB queries.

See http://stackoverflow.com/questions/2338041/python-django-polling-of-database-has-memory-leak.

#15 Updated by Luke Murphey about 12 years ago

Memory usage is staying solidly at 24 MBs.

#16 Updated by Luke Murphey about 12 years ago

  • Status changed from In Progress to Closed

#17 Updated by Luke Murphey about 12 years ago

The last import took 15307 seconds or about 4 hours and 15 minutes.

#18 Updated by Luke Murphey about 12 years ago

We currently don't support the "alphabetic" attribute:

2012-12-16 19:22:51,276 [WARNING] reader.importer.Diogenes: Attribute was not expected: attribute=alphabetic, line_number=113795

#19 Updated by Luke Murphey about 12 years ago

BTW: unexpected attributes can be found with the following search:

sourcetype="django" "Attribute was not expected:" | rex field=_raw "Attribute was not expected: (?<attribute>.*)," 

#20 Updated by Luke Murphey almost 12 years ago

  • Assignee set to Luke Murphey

Also available in: Atom PDF