Project

General

Profile

Task #479

Feature #466: Lemma lookup

Diogenes Analyses Importer

Added by Luke Murphey over 11 years ago. Updated over 11 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
12/24/2012
Due date:
% Done:

100%


Description

Need to write an importer that takes the analyses file and marks up the lemma entries with the appropriate meaning.

unmatched_analyses.csv View (164 KB) Luke Murphey, 12/20/2012 07:21 PM


Subtasks

Bug #492: Analysis entries are not matching the extraction regular expressionClosedLuke Murphey

History

#1 Updated by Luke Murphey over 11 years ago

  • Status changed from New to In Progress

#2 Updated by Luke Murphey over 11 years ago

hmm, lining up the analyses file with the lemma is proving difficult. Consider the following which shows that matches are not occurring:

Forms discovered, count=2, line_number=100000, form_number=1
Forms discovered, count=2, line_number=100000, form_number=2
Forms discovered, count=2, line_number=100000, form_number=3
Forms discovered, count=2, line_number=100000, form_number=4
Forms discovered, count=1, line_number=100001, form_number=1
Forms discovered, count=1, line_number=100001, form_number=2
Forms discovered, count=1, line_number=100001, form_number=3
Forms discovered, count=1, line_number=100001, form_number=4
Forms discovered, count=1, line_number=100001, form_number=5
Forms discovered, count=1, line_number=100001, form_number=6
Forms discovered, count=1, line_number=100002, form_number=1
Forms discovered, count=1, line_number=100002, form_number=2
Forms discovered, count=1, line_number=100002, form_number=3
Forms discovered, count=1, line_number=100003, form_number=1
Forms discovered, count=1, line_number=100003, form_number=2
Forms discovered, count=1, line_number=100003, form_number=3
Forms discovered, count=1, line_number=100004, form_number=1
Forms discovered, count=1, line_number=100004, form_number=2
Forms discovered, count=1, line_number=100004, form_number=3
Forms discovered, count=0, line_number=100005, form_number=1
Forms discovered, count=0, line_number=100005, form_number=2

Here are lines 100000 through 100005:

a)nalou=to
    {7566000 9 a)na__lou=to,a)nali/skw    use up    imperf ind mp 3rd sg (doric aeolic)}
    {7566000 9 a)na_lou=to,a)nali/skw    use up    imperf ind mp 3rd sg (homeric ionic)}
    {7566000 9 a_)nalou=to,a)nalo/w    use up    imperf ind mp 3rd sg (doric aeolic)}
    {7566000 9 a)nalo/w    use up    imperf ind mp 3rd sg (homeric ionic)}
a)naloume/nas
    {7585355 9 a)naloume/na_s,a)na/llomai    leap    fut part mid fem acc pl (attic epic doric)}
    {7585355 9 a)naloume/na_s,a)na/llomai    leap    fut part mid fem gen sg (doric)}
    {7566000 9 a)na_loume/na_s,a)nali/skw    use up    pres part mp fem acc pl}
    {7566000 9 a)na_loume/na_s,a)nali/skw    use up    pres part mp fem gen sg (doric aeolic)}
    {7566000 9 a)naloume/na_s,a)nalo/w    use up    pres part mp fem acc pl}
    {7566000 9 a)naloume/na_s,a)nalo/w    use up    pres part mp fem gen sg (doric aeolic)}
a)naloume/nh
    {7585355 9 a)na/llomai    leap    fut part mid fem nom/voc sg (attic epic)}
    {7566000 9 a)na_loume/nh,a)nali/skw    use up    pres part mp fem nom/voc sg (attic epic ionic)}
    {7566000 9 a)nalo/w    use up    pres part mp fem nom/voc sg (attic epic ionic)}
a)naloume/nhn
    {7585355 9 a)na/llomai    leap    fut part mid fem acc sg (attic epic)}
    {7566000 9 a)na_loume/nhn,a)nali/skw    use up    pres part mp fem acc sg (attic epic ionic)}
    {7566000 9 a)nalo/w    use up    pres part mp fem acc sg (attic epic ionic)}
a)naloume/nhs
    {7585355 9 a)na/llomai    leap    fut part mid fem gen sg (attic epic)}
    {7566000 9 a)na_loume/nhs,a)nali/skw    use up    pres part mp fem gen sg (attic epic ionic)}
    {7566000 9 a)nalo/w    use up    pres part mp fem gen sg (attic epic ionic)}
a)naloume/nou
    {7585355 9 a)na/llomai    leap    fut part mid masc/neut gen sg (attic epic doric)}
    {7566000 9 a)na_loume/nou,a)nali/skw    use up    pres part mp masc/neut gen sg}
    {7566000 9 a)nalo/w    use up    pres part mp masc/neut gen sg}

#3 Updated by Luke Murphey over 11 years ago

I wonder, should I just import the analyses file directly? Or rather, what does the lemmata file give me that the analyses file doesn't?

#4 Updated by Luke Murphey over 11 years ago

From lemmata:

a(/bra    {537850 9 a(/bra_,a(/bra    favourite slave    fem nom/voc/acc dual}{537850 9 a(/bra_,a(/bra    favourite slave    fem nom/voc sg (attic doric aeolic)}
a(/brai    {537850 9 a(/bra    favourite slave    fem nom/voc pl}{537850 9 a(/bra_|,a(/bra    favourite slave    fem dat sg (attic doric aeolic)}
a(/brais    {537850 9 a(/bra    favourite slave    fem dat pl}
a(/bran    {537850 9 a(/bra_n,a(/bra    favourite slave    fem acc sg (attic doric aeolic)}
a(/bras    {537850 9 a(/bra_s,a(/bra    favourite slave    fem acc pl}{537850 9 a(/bra_s,a(/bra    favourite slave    fem gen sg (attic doric aeolic)}
a(/bra|    {537850 9 a(/brai,a(/bra    favourite slave    fem nom/voc pl}{537850 9 a(/bra_|,a(/bra    favourite slave    fem dat sg (attic doric aeolic)}
...
a(bra=n    {537850 9 a(/bra    favourite slave    fem gen pl (doric aeolic)}{555266 9 a(bro/s    graceful    masc/fem gen pl (doric)}
...
a(brw=n    {537850 9 a(/bra    favourite slave    fem gen pl}{555266 9 a(bro/s    graceful    fem gen pl}{555266 9 a(bro/s    graceful    masc/neut gen pl}{555266 9 a(bro/s    graceful    masc/fem/neut gen pl}

From the analysis:

a(/bra    537850    a(/bra (fem nom/voc/acc dual) (fem nom/voc sg (attic doric aeolic))    a(/brai (fem nom/voc pl) (fem dat sg (attic doric aeolic))    a(/brais (fem dat pl)    a(/bran (fem acc sg (attic doric aeolic))    a(/bras (fem acc pl) (fem gen sg (attic doric aeolic))    a(/bra| (fem nom/voc pl) (fem dat sg (attic doric aeolic))    a(bra=n (fem gen pl (doric aeolic))    a(brw=n (fem gen pl)

It seems like the lemma contains all forms of a lemma within a single line. The analysis file breaks up the forms onto each line with the line containing all possible meanings for the form. If this is the case, then I really need to be using the analyses file and not the lemma.

#5 Updated by Luke Murphey over 11 years ago

The greek-analyses file has 911871 lines.

The following search will return the progress of the analyses import:

sourcetype="django" | stats max(line_number) as line_number | eval progress=100*line_number/911871

#6 Updated by Luke Murphey over 11 years ago

You can clear out the entries in the database related to the Greek lemma:

drop table reader_lemma
drop table reader_case
drop table reader_dialect
drop table reader_worddescription
drop table reader_worddescription_cases
drop table reader_worddescription_dialects
drop table reader_wordform

#7 Updated by Luke Murphey over 11 years ago

Some of the entries in the analyses file that are not in the greek-lemmata file. These can found with the following search:

sourcetype="django" "Unable to find the lemma for an analysis entry" | stats count(sourcetype)

#8 Updated by Luke Murphey over 11 years ago

Successfully imported 911871 entries from the analyses file. However, the lemmas could not be found for 5,581 entries (see attached for a list).

The unmatched entries can be viewed with the following Splunk search:

sourcetype="django" "Unable to find the lemma for an analysis entry" | sort _time | table form line_number form_number

#10 Updated by Luke Murphey over 11 years ago

It seems to take about six hours to import the analyses.

#11 Updated by Luke Murphey over 11 years ago

  • % Done changed from 0 to 60

#12 Updated by Luke Murphey over 11 years ago

Some leftover issues:

  1. ζῳάγρια lists the definition as "ransom paid for a prisoner taken alive) reward for life saved" (note the unmatched parenthesis)
  2. Some entries are getting imported correctly because they have exclamation marks which are not valid beta-code from what I can tell so far

#13 Updated by Luke Murphey over 11 years ago

  • % Done changed from 60 to 70

#14 Updated by Luke Murphey over 11 years ago

  • Tracker changed from Bug to Task

#15 Updated by Luke Murphey over 11 years ago

  • % Done changed from 70 to 90

Some entries have leading characters that are preventing them from being matched accordingly.

#16 Updated by Luke Murphey over 11 years ago

Lots of forms are being skipped. You can view them with the following search:

sourcetype="django" "Unable to find the lemma for an analysis entry" | table form line_number form_number

Also, many are not matching the regex. You can see these with the following search:

sourcetype="django" "Analysis entry does not match the regex" | table form line_number form_number

#17 Updated by Luke Murphey over 11 years ago

The importer is now taking about 80 seconds to import 1,000 entries. The import started at 04:13:27 and has imported 352,000 at 2012-12-24 12:05:36. It is importing about 12 entries per second.

#18 Updated by Luke Murphey over 11 years ago

Moving the commit to the file level did not improve performance noticeably.

#19 Updated by Luke Murphey over 11 years ago

I tried deferring loading of all fields in the lemma query to speed up performance (http://stackoverflow.com/questions/2846029/django-set-foreign-key-using-integer) but this didn't seem to help much.

#20 Updated by Luke Murphey over 11 years ago

Time to load each entry seems to take about 0.03 seconds.

#21 Updated by Luke Murphey over 11 years ago

The slowdown seems to be in the query of the lemmas for the lemma ID associated with the reference number. This is somewhat surprising since the reference field is indexed.

#22 Updated by Luke Murphey over 11 years ago

Actually, the reference number field was not indexed for some reason. I manually created it with:

Create index "lemma_reference_number" ON "reader_lemma" ("reference_number");

#23 Updated by Luke Murphey over 11 years ago

Many entries have single quotes at odd places ('ναι, εῦσθ'). Perhaps I should drop these entirely.

#24 Updated by Luke Murphey over 11 years ago

  • Status changed from In Progress to Closed

Also available in: Atom PDF