Task #479
Feature #466: Lemma lookup
Diogenes Analyses Importer
100%
Description
Need to write an importer that takes the analyses file and marks up the lemma entries with the appropriate meaning.
Subtasks
History
#1 Updated by Luke Murphey almost 12 years ago
- Status changed from New to In Progress
#2 Updated by Luke Murphey almost 12 years ago
hmm, lining up the analyses file with the lemma is proving difficult. Consider the following which shows that matches are not occurring:
Forms discovered, count=2, line_number=100000, form_number=1 Forms discovered, count=2, line_number=100000, form_number=2 Forms discovered, count=2, line_number=100000, form_number=3 Forms discovered, count=2, line_number=100000, form_number=4 Forms discovered, count=1, line_number=100001, form_number=1 Forms discovered, count=1, line_number=100001, form_number=2 Forms discovered, count=1, line_number=100001, form_number=3 Forms discovered, count=1, line_number=100001, form_number=4 Forms discovered, count=1, line_number=100001, form_number=5 Forms discovered, count=1, line_number=100001, form_number=6 Forms discovered, count=1, line_number=100002, form_number=1 Forms discovered, count=1, line_number=100002, form_number=2 Forms discovered, count=1, line_number=100002, form_number=3 Forms discovered, count=1, line_number=100003, form_number=1 Forms discovered, count=1, line_number=100003, form_number=2 Forms discovered, count=1, line_number=100003, form_number=3 Forms discovered, count=1, line_number=100004, form_number=1 Forms discovered, count=1, line_number=100004, form_number=2 Forms discovered, count=1, line_number=100004, form_number=3 Forms discovered, count=0, line_number=100005, form_number=1 Forms discovered, count=0, line_number=100005, form_number=2
Here are lines 100000 through 100005:
a)nalou=to {7566000 9 a)na__lou=to,a)nali/skw use up imperf ind mp 3rd sg (doric aeolic)} {7566000 9 a)na_lou=to,a)nali/skw use up imperf ind mp 3rd sg (homeric ionic)} {7566000 9 a_)nalou=to,a)nalo/w use up imperf ind mp 3rd sg (doric aeolic)} {7566000 9 a)nalo/w use up imperf ind mp 3rd sg (homeric ionic)} a)naloume/nas {7585355 9 a)naloume/na_s,a)na/llomai leap fut part mid fem acc pl (attic epic doric)} {7585355 9 a)naloume/na_s,a)na/llomai leap fut part mid fem gen sg (doric)} {7566000 9 a)na_loume/na_s,a)nali/skw use up pres part mp fem acc pl} {7566000 9 a)na_loume/na_s,a)nali/skw use up pres part mp fem gen sg (doric aeolic)} {7566000 9 a)naloume/na_s,a)nalo/w use up pres part mp fem acc pl} {7566000 9 a)naloume/na_s,a)nalo/w use up pres part mp fem gen sg (doric aeolic)} a)naloume/nh {7585355 9 a)na/llomai leap fut part mid fem nom/voc sg (attic epic)} {7566000 9 a)na_loume/nh,a)nali/skw use up pres part mp fem nom/voc sg (attic epic ionic)} {7566000 9 a)nalo/w use up pres part mp fem nom/voc sg (attic epic ionic)} a)naloume/nhn {7585355 9 a)na/llomai leap fut part mid fem acc sg (attic epic)} {7566000 9 a)na_loume/nhn,a)nali/skw use up pres part mp fem acc sg (attic epic ionic)} {7566000 9 a)nalo/w use up pres part mp fem acc sg (attic epic ionic)} a)naloume/nhs {7585355 9 a)na/llomai leap fut part mid fem gen sg (attic epic)} {7566000 9 a)na_loume/nhs,a)nali/skw use up pres part mp fem gen sg (attic epic ionic)} {7566000 9 a)nalo/w use up pres part mp fem gen sg (attic epic ionic)} a)naloume/nou {7585355 9 a)na/llomai leap fut part mid masc/neut gen sg (attic epic doric)} {7566000 9 a)na_loume/nou,a)nali/skw use up pres part mp masc/neut gen sg} {7566000 9 a)nalo/w use up pres part mp masc/neut gen sg}
#3 Updated by Luke Murphey almost 12 years ago
I wonder, should I just import the analyses file directly? Or rather, what does the lemmata file give me that the analyses file doesn't?
#4 Updated by Luke Murphey almost 12 years ago
From lemmata:
a(/bra {537850 9 a(/bra_,a(/bra favourite slave fem nom/voc/acc dual}{537850 9 a(/bra_,a(/bra favourite slave fem nom/voc sg (attic doric aeolic)} a(/brai {537850 9 a(/bra favourite slave fem nom/voc pl}{537850 9 a(/bra_|,a(/bra favourite slave fem dat sg (attic doric aeolic)} a(/brais {537850 9 a(/bra favourite slave fem dat pl} a(/bran {537850 9 a(/bra_n,a(/bra favourite slave fem acc sg (attic doric aeolic)} a(/bras {537850 9 a(/bra_s,a(/bra favourite slave fem acc pl}{537850 9 a(/bra_s,a(/bra favourite slave fem gen sg (attic doric aeolic)} a(/bra| {537850 9 a(/brai,a(/bra favourite slave fem nom/voc pl}{537850 9 a(/bra_|,a(/bra favourite slave fem dat sg (attic doric aeolic)} ... a(bra=n {537850 9 a(/bra favourite slave fem gen pl (doric aeolic)}{555266 9 a(bro/s graceful masc/fem gen pl (doric)} ... a(brw=n {537850 9 a(/bra favourite slave fem gen pl}{555266 9 a(bro/s graceful fem gen pl}{555266 9 a(bro/s graceful masc/neut gen pl}{555266 9 a(bro/s graceful masc/fem/neut gen pl}
From the analysis:
a(/bra 537850 a(/bra (fem nom/voc/acc dual) (fem nom/voc sg (attic doric aeolic)) a(/brai (fem nom/voc pl) (fem dat sg (attic doric aeolic)) a(/brais (fem dat pl) a(/bran (fem acc sg (attic doric aeolic)) a(/bras (fem acc pl) (fem gen sg (attic doric aeolic)) a(/bra| (fem nom/voc pl) (fem dat sg (attic doric aeolic)) a(bra=n (fem gen pl (doric aeolic)) a(brw=n (fem gen pl)
It seems like the lemma contains all forms of a lemma within a single line. The analysis file breaks up the forms onto each line with the line containing all possible meanings for the form. If this is the case, then I really need to be using the analyses file and not the lemma.
#5 Updated by Luke Murphey almost 12 years ago
The greek-analyses file has 911871 lines.
The following search will return the progress of the analyses import:
sourcetype="django" | stats max(line_number) as line_number | eval progress=100*line_number/911871
#6 Updated by Luke Murphey almost 12 years ago
You can clear out the entries in the database related to the Greek lemma:
drop table reader_lemma drop table reader_case drop table reader_dialect drop table reader_worddescription drop table reader_worddescription_cases drop table reader_worddescription_dialects drop table reader_wordform
#7 Updated by Luke Murphey almost 12 years ago
Some of the entries in the analyses file that are not in the greek-lemmata file. These can found with the following search:
sourcetype="django" "Unable to find the lemma for an analysis entry" | stats count(sourcetype)
#8 Updated by Luke Murphey almost 12 years ago
Successfully imported 911871 entries from the analyses file. However, the lemmas could not be found for 5,581 entries (see attached for a list).
The unmatched entries can be viewed with the following Splunk search:
sourcetype="django" "Unable to find the lemma for an analysis entry" | sort _time | table form line_number form_number
#9 Updated by Luke Murphey almost 12 years ago
- File unmatched_analyses.csv View added
#10 Updated by Luke Murphey almost 12 years ago
It seems to take about six hours to import the analyses.
#11 Updated by Luke Murphey almost 12 years ago
- % Done changed from 0 to 60
#12 Updated by Luke Murphey almost 12 years ago
Some leftover issues:
- ζῳάγρια lists the definition as "ransom paid for a prisoner taken alive) reward for life saved" (note the unmatched parenthesis)
- Some entries are getting imported correctly because they have exclamation marks which are not valid beta-code from what I can tell so far
#13 Updated by Luke Murphey almost 12 years ago
- % Done changed from 60 to 70
#14 Updated by Luke Murphey almost 12 years ago
- Tracker changed from Bug to Task
#15 Updated by Luke Murphey almost 12 years ago
- % Done changed from 70 to 90
Some entries have leading characters that are preventing them from being matched accordingly.
#16 Updated by Luke Murphey almost 12 years ago
Lots of forms are being skipped. You can view them with the following search:
sourcetype="django" "Unable to find the lemma for an analysis entry" | table form line_number form_number
Also, many are not matching the regex. You can see these with the following search:
sourcetype="django" "Analysis entry does not match the regex" | table form line_number form_number
#17 Updated by Luke Murphey almost 12 years ago
The importer is now taking about 80 seconds to import 1,000 entries. The import started at 04:13:27 and has imported 352,000 at 2012-12-24 12:05:36. It is importing about 12 entries per second.
#18 Updated by Luke Murphey almost 12 years ago
Moving the commit to the file level did not improve performance noticeably.
#19 Updated by Luke Murphey almost 12 years ago
I tried deferring loading of all fields in the lemma query to speed up performance (http://stackoverflow.com/questions/2846029/django-set-foreign-key-using-integer) but this didn't seem to help much.
#20 Updated by Luke Murphey almost 12 years ago
Time to load each entry seems to take about 0.03 seconds.
#21 Updated by Luke Murphey almost 12 years ago
The slowdown seems to be in the query of the lemmas for the lemma ID associated with the reference number. This is somewhat surprising since the reference field is indexed.
#22 Updated by Luke Murphey almost 12 years ago
Actually, the reference number field was not indexed for some reason. I manually created it with:
Create index "lemma_reference_number" ON "reader_lemma" ("reference_number");
#23 Updated by Luke Murphey almost 12 years ago
Many entries have single quotes at odd places ('ναι, εῦσθ'). Perhaps I should drop these entirely.
#24 Updated by Luke Murphey almost 12 years ago
- Status changed from In Progress to Closed