Project

General

Profile

Bug #446

Feature #403: Perseus Book Importer

Fix issue where some works import no divisions or verses

Added by Luke Murphey about 12 years ago. Updated about 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

100%

Associated revisions

Revision 172 (diff)
Added by Luke Murphey about 12 years ago

Chapters are now treated as chunks. Reference #446.

Revision 172 (diff)
Added by Luke Murphey about 12 years ago

Chapters are now treated as chunks. Reference #446.

Revision 165 (diff)
Added by Luke Murphey about 12 years ago

Chapters are now treated as chunks. Reference #446.

Revision 178 (diff)
Added by Luke Murphey about 12 years ago

Fixed issue where the importer did not import some works which had a head tag. Reference #446.

Revision 178 (diff)
Added by Luke Murphey about 12 years ago

Fixed issue where the importer did not import some works which had a head tag. Reference #446.

Revision 171 (diff)
Added by Luke Murphey about 12 years ago

Fixed issue where the importer did not import some works which had a head tag. Reference #446.

Revision 179 (diff)
Added by Luke Murphey about 12 years ago

Changed the import policy to account to allow import of plut.068_teubner_gk.xml. Reference #446.

Revision 179 (diff)
Added by Luke Murphey about 12 years ago

Changed the import policy to account to allow import of plut.068_teubner_gk.xml. Reference #446.

Revision 172 (diff)
Added by Luke Murphey about 12 years ago

Changed the import policy to account to allow import of plut.068_teubner_gk.xml. Reference #446.

History

#1 Updated by Luke Murphey about 12 years ago

The following works are affected:

1 xen.mem_gk.xml 
2 xen.hell_gk.xml 
3 xen.cyrop_gk.xml 
4 xen.anab_gk.xml 
5 char_gk.xml 
6 idylls_gk.xml 
7 plut.083_loeb_gk.xml 
8 plut.068_teubner_gk.xml 
9 pind_gk.xml 
10 paus_gk.xml 
11 nonnos_03.xml 
12 nonnos_02.xml 
13 nonnos_01.xml 
14 lys_gk.xml 
15 lyc_gk.xml 
16 71_gk.xml 
17 hyp_gk.xml 
18 hp.littre_gk.xml 
19 dl_gk.xml 
20 diochr01_gk.xml
21 din_gk.xml 
22 dem51-61_gk.xml 
23 dem41-50_gk.xml 
24 dem31-40_gk.xml 
25 dem21-30_gk.xml 
26 dem11-20_gk.xml 
27 dem01-10_gk.xml 
28 demad_gk.xml 
29 call_02.xml 
30 nt_gk.xml 
31 bacchyl_gk.xml 
32 aret_gk.xml 
33 andoc_gk.xml 

#2 Updated by Luke Murphey about 12 years ago

The problem is that these works are chunked by chapters but the chapters are not indicated as chunks so the importer doesn't know to break them into divisions.

#3 Updated by Luke Murphey about 12 years ago

  • % Done changed from 0 to 50

#4 Updated by Luke Murphey about 12 years ago

  • % Done changed from 50 to 70

#5 Updated by Luke Murphey about 12 years ago

Making chapters chunks reduced the number of works with this error to 27 (from 33).

Theophrastus' Characters is not importing because it has no verses. The division content looks correct but the milestones are not being detected. Below is an example that does not import:

<chapter>
<head>*ei)rwnei/as *a'</head>
<p><milestone n="1" unit="section"/>*(h me\n ou)=n ei)rwnei/a do/ceien a)\n ei)=nai, w(s tu/pw| labei=n,
prospoi/hsis e)pi\ xei=ron pra/cewn kai\ lo/gwn, o( de\ ei)/rwn
<milestone n="2" unit="section"/>toiou=to/s tis, oi(=os proselqw\n toi=s e)xqroi=s e)qe/lein lalei=n,
ou) misei=n: kai\ e)painei=n paro/ntas, oi(=s e)pe/qeto la/qra, kai\
tou/tois sullupei=sqai h(ttwme/nois: kai\ suggnw/mhn de\ e)/xein
toi=s au(to\n kakw=s le/gousi kai\ e)pi\ toi=s kaq' e(autou= legome/nois.
<milestone n="3" unit="section"/>kai\ pro\s tou\s a)dikoume/nous kai\ a)ganaktou=ntas pra/ws
diale/gesqai: kai\ toi=s e)ntugxa/nein kata\ spoudh\n boulome/nois
<milestone n="4" unit="section"/>prosta/cai e)panelqei=n. kai\ mhde\n w(=n pra/ttei o(mologh=sai,
a)lla\ fh=sai bouleu/esqai: kai\ prospoih/sasqai a)/rti paragegone/nai
<milestone n="5" unit="section"/>kai\ o)ye\ gene/sqai au)to\n kai\ malakisqh=nai. kai\
pro\s tou\s daneizome/nous kai\ e)rani/zontas . . . w(s ou) pwlei=, kai\
mh\ pwlw=n fh=sai pwlei=n: kai\ a)kou/sas ti mh\ prospoiei=sqai,
kai\ i)dw\n fh=sai mh\ e(orake/nai, kai\ o(mologh/sas mh\ memnh=sqai:
kai\ ta\ me\n ske/yasqai fa/skein, ta\ de\ ou)k ei)de/nai, ta\ de\
<pb/>qauma/zein, ta\ d' h)/dh pote\ kai\ au)to\s ou(/tws dialogi/sasqai.
</p>

</chapter>

#6 Updated by Luke Murphey about 12 years ago

The problem has to do with failing to recurse down nodes after head nodes. The recurse flag gets changed for all following nodes. I'm surprised that this didn't prevent importation of other works. All unit tests pass after fixing this though.

#7 Updated by Luke Murphey about 12 years ago

The following two works are not being imported:

1    plut.068_teubner_gk.xml
2    71_gk.xml

The error is:

2012-11-18 11:03:52,864 [ERROR] reader.importer.PerseusBatchImporter: Exception generated when attempting to process file="plut.068_teubner_gk.xml" 
Traceback (most recent call last):
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/importer/PerseusBatchImporter.py", line 326, in process_directory
    if self.__process_file__( os.path.join( root, f) ):
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/importer/PerseusBatchImporter.py", line 277, in __process_file__
    return self.process_file(file_path, document_xml, title, author, language)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/importer/PerseusBatchImporter.py", line 455, in process_file
    perseus_importer.import_file(file_path)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/importer/Perseus.py", line 143, in import_file
    return self.import_xml_document(doc)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/django/db/transaction.py", line 209, in inner
    return func(*args, **kwargs)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/importer/Perseus.py", line 643, in import_xml_document
    raise Exception("No divisions were discovered, title=%s" % (self.work.title) )
Exception: No divisions were discovered, title=Quomodo adolescens poetas audire debeat

#8 Updated by Luke Murphey about 12 years ago

  • % Done changed from 70 to 90

#9 Updated by Luke Murphey about 12 years ago

71_gk.xml (Podagra) is the only file that cannot be imported (complains that not divisions were discovered).

#10 Updated by Luke Murphey about 12 years ago

All works successfully import now. The import took 4606 seconds (76 minutes) and the resulting file is 330 MB.

#11 Updated by Luke Murphey about 12 years ago

  • Status changed from New to Closed
  • % Done changed from 90 to 100

Also available in: Atom PDF