Feature #2783
Improved lemmatization
Start date:
06/13/2020
Due date:
% Done:
100%
History
#1 Updated by Luke Murphey almost 4 years ago
Obs:
- https://raw.githubusercontent.com/PhVerkerk/Eulexis_off_line/master/Eulexis_data/analyses_gr.txt has parses but no short definitions
- https://raw.githubusercontent.com/PhVerkerk/Eulexis_off_line/master/Eulexis_data/LSJ.csv has links to LSJ
- https://github.com/PhVerkerk/Eulexis_off_line/blob/master/Eulexis_data/trad_gr_en_fr_de.csv has short definitions in multiple languages
- This project is related to https://github.com/pjheslin/diogenes
- Perseus has an API too: http://sites.tufts.edu/perseusupdates/2012/11/01/morphology-service-beta/
- Is the trad_gr_en_fr_de.csv better that the one I have?
- How would I want to use these online tools?
- Have a UI option to parse it
- Have the UI parse it when it cannot find a parse
- Have the UI parse it when it cannot find a LSJ definition
- Where is the Diogenes list? How does it compare?
- What does the Diogenes data look like?
- What are some strange parses I ought to consider as test cases?
- https://textcritical.net/work/new-testament/Luke
- καὶ
- γενόμενοι
- ἔδοξε
- κἀμοὶ
- καθεξῆς
- λόγων
- τὴν
- ΕΓΕΝΕΤΟ
- στεῖρα
- https://textcritical.net/work/new-testament/Luke
#2 Updated by Luke Murphey almost 4 years ago
- Target version deleted (
4.2.0)
#3 Updated by Luke Murphey over 3 years ago
- Target version set to 4.6.7
https://textcritical.net/work/antiquitates-judaicae
- Τοῖς
- lentil
- τὰς
- lentil
- ἱστορίας
- συγγράφειν
- ἀντί-συγγράφω (pres inf act): write
- βουλομένοις
- ἀνά-βούλομαι (pres part mid-pass masc/nuet dat pl): will
- οὐ
- οὐ (adverbial indeclinable): in truth
- μίαν
- εἷς (fem acc sg): sem
- οὐδὲ
- οὐδόσ1 (masc voc sg): threshold
- τὴν
- αὐτὴν
- ἀυτή (fem acc sg): cry
- ὁρῶ
- ἀμφί-ὁράω (pres mid imperat 2nd sg): Inscr. destombeaux des rois
- τῆς
- σπουδῆς
- ἀνά-σπουδάζω (fut act ind 2nd sg): to be busy
- γινομένην
- ἀνά-γίγνομαι (pres part mid-pass fem acc sg): come into a new state of being
- αἰτίαν,
- ἀλλὰ
- ἄλλος (nuet nom/voc/acc pl): y
- πολλὰς
- καὶ
- ἀντί-καίω (pres act imperat 2nd sg): kindle
- πλεῖστον
- ἀλλήλων
- διαφερούσας.
#4 Updated by Luke Murphey over 3 years ago
I lemmatized "Τοῖς τὰς ἱστορίας συγγράφειν βουλομένοις οὐ μίαν οὐδὲ τὴν αὐτὴν ὁρῶ τῆς σπουδῆς γινομένην αἰτίαν, ἀλλὰ πολλὰς καὶ πλεῖστον ἀλλήλων διαφερούσας" at https://outils.biblissima.fr/en/eulexis-web/
This doesn't look much better than what I have.
#5 Updated by Luke Murphey over 3 years ago
There is a lemmatizer here: https://docs.cltk.org/en/latest/greek.html
This is based on greek_models_cltk (https://github.com/nodage/greek_models_cltk)
Import it here: https://docs.cltk.org/en/latest/importing_corpora.html
from cltk.tag.pos import POSTag tagger = POSTag('greek') tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
It also parses TEI: https://docs.cltk.org/en/latest/greek.html#tei-xml
#6 Updated by Luke Murphey over 3 years ago
#7 Updated by Luke Murphey over 3 years ago
#8 Updated by Luke Murphey over 3 years ago
from cltk.corpus.utils.importer import CorpusImporter corpus_importer = CorpusImporter('greek') corpus_importer.import_corpus('greek_models_cltk')
#10 Updated by Luke Murphey over 3 years ago
- Target version deleted (
4.6.7)
#11 Updated by Luke Murphey about 2 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100