Bug #1258
Work indexer fails
100%
Description
^[[C^[[CTraceback (most recent call last): File "manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line utility.execute() File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv self.execute(*args, **cmd_options) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute output = self.handle(*args, **options) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 36, in handle WorkIndexer.index_all_works() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 124, in index_all_works cls.index_work(work) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 164, in index_work cls.index_division(division) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 187, in index_division writer.commit() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 922, in commit finalsegments = self._merge_segments(mergetype, optimize, merge) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 827, in _merge_segments return mergetype(self, self.segments) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 101, in MERGE_SMALL writer.add_reader(reader) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 709, in add_reader docmap = self.write_per_doc(fieldnames, reader) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 690, in write_per_doc v = reader.vector(docnum, fieldname, fieldobj.vector) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/reading.py", line 832, in vector return self._perdoc.vector(docnum, fieldname, vformat) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 482, in vector byteids=True) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 879, in __init__ self._read_header() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 890, in _read_header raise Exception("Block tag error %r" % magic) Exception: Block tag error 'VPST'
The last log entry is:
reader.contentsearch: Successfully indexed verse, verse=4, division="36", work="Cynegeticus"
History
#1 Updated by Luke Murphey almost 9 years ago
Seems to be crashing after the following:
Successfully indexed verse, verse=4, division="36", work="Cynegeticus"
Tried to regenerate it with the following:
python manage.py make_search_indexes -c -w "Cynegeticus"
This worked however.
#2 Updated by Luke Murphey almost 9 years ago
Moving the index commit to the work level makes it faster.
#3 Updated by Luke Murphey almost 9 years ago
Now it is dying at:
Successfully indexed division, division="chapter 89", work="The Deipnosophists, Book 10"
#4 Updated by Luke Murphey almost 9 years ago
Consistently fails on:
python manage.py make_search_indexes -w the-deipnosophists-book-10
#5 Updated by Luke Murphey almost 9 years ago
Try indexing the work and committing by verse to identify the verse that is the problem.
#6 Updated by Luke Murphey almost 9 years ago
Observations:
- The indexing fails at the commit stage
- Once the error is observed, no more commits will work
- It is failing on the-deipnosophists-book-10
- Indexing the-deipnosophists-book-10 works after clearing the indexes; then indexing the-deipnosophists-book-11 works too as well as Cynegeticus
- Whoosh 2.4.1 seemed to index fine
#7 Updated by Luke Murphey almost 9 years ago
I could try indexing on Windows to see if it reproduces on other platforms.
#8 Updated by Luke Murphey almost 9 years ago
Trying different versions of Whoosh:
- 2.4.1: works
- 2.7.2: doesn't
- 2.5.0: gets different error: IndexError: list index out of range after "reader.contentsearch: Successfully indexed verse, verse=3, division="35", work="Cynegeticus""
2.5.0 error:
Traceback (most recent call last): File "manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line utility.execute() File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv self.execute(*args, **cmd_options) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute output = self.handle(*args, **options) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 42, in handle WorkIndexer.index_all_works() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 165, in index_all_works cls.index_work(work, commit=True) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 216, in index_work cls.index_division(division, commit=False, writer=writer) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 239, in index_division cls.index_verse(verse, division=division, writer=writer, commit=False) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 324, in index_verse author = author_str File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 752, in add_document perdocwriter.add_vector_items(fieldname, field, vitems) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 232, in add_vector_items vinfo = vpostwriter.finish_postings() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 648, in finish_postings terminfo.add_block(self) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 1083, in add_block self._minid = block.min_id() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 776, in min_id return self._ids[0]
#9 Updated by Luke Murphey almost 9 years ago
Going to try disabling term vectors
#10 Updated by Luke Murphey almost 9 years ago
2.4.1 indexing fails:
Traceback (most recent call last): File "manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line utility.execute() File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv self.execute(*args, **cmd_options) File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute output = self.handle(*args, **options) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 42, in handle WorkIndexer.index_all_works() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 165, in index_all_works cls.index_work(work, commit=True) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 219, in index_work writer.commit() File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 502, in commit finalsegments = self._merge_segments(mergetype, optimize, merge) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 432, in _merge_segments return mergetype(self, self.segments) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 68, in MERGE_SMALL writer.add_reader(reader) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 328, in add_reader self._merge_per_doc(reader, docmap) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 309, in _merge_per_doc perdocwriter.add_vector_matcher(fieldname, field, v) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh2.py", line 219, in add_vector_matcher self.add_vector_items(fieldname, fieldobj, readitems()) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh2.py", line 209, in add_vector_items self.vindex.add((self.docnum, fieldname), startoffset) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 490, in add self._add(self.keycoder(key), self.valuecoder(data)) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 122, in add self.add_all(((key, value),)) File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 346, in add_all raise ValueError("Keys must increase: %r .. %r" % (lk, key)) ValueError: Keys must increase: '\x00\x00\x00^\x00\x01' .. '\x00\x00\x00^\x00\x00'
#11 Updated by Luke Murphey almost 9 years ago
Fails on Windows too.
#12 Updated by Luke Murphey almost 9 years ago
Trying to index without vectors
#13 Updated by Luke Murphey almost 9 years ago
Debugging:
- Try on Windows: still fails
- Try on older versions: different error
- Try without vectors: works (including no vectors on the content field)
- Index by verse to identify bad work:
#14 Updated by Luke Murphey almost 9 years ago
Indexing without vectors works.
#15 Updated by Luke Murphey almost 9 years ago
Ok, I have a fairly minimal repro:
python .\manage.py make_search_indexes -c -w "Cynegeticus" python .\manage.py make_search_indexes -w "the-deipnosophists-book-1"
After importing Cynegeticus, the index cannot be opened.
#16 Updated by Luke Murphey almost 9 years ago
Even smaller repro:
python .\manage.py make_search_indexes -c -w indica
#17 Updated by Luke Murphey almost 9 years ago
#18 Updated by Luke Murphey almost 9 years ago
Employed a workaround but this still didn't allow the entire index to be built.
Blew up after:
reader.contentsearch: Successfully indexed division, division="Book 12", work="Laws"
#19 Updated by Luke Murphey almost 9 years ago
I'm changing the analyzer such that it matches at least one character to see if that makes a difference.
#20 Updated by Luke Murphey almost 9 years ago
Nope, still fails.
#21 Updated by Luke Murphey almost 9 years ago
Trying without the analyzer.
#22 Updated by Luke Murphey almost 9 years ago
Interesting. Removing the analyzer causes the indexer to blow up much earlier (blows up on Argonautica).
#23 Updated by Luke Murphey almost 9 years ago
Logged from file contentsearch.py, line 250 Process SubWriterTask-416: Traceback (most recent call last): File "C:\Program Files (x86)\Python2.7\lib\multiprocessing\process.py", line 258, in _bootstrap self.run() File "D:\Users\Luke\Workspace\TextCritical.com\src\whoosh\multiproc.py", line 129, in run runname, fieldnames, segment = finish_subsegment(writer, k) File "D:\Users\Luke\Workspace\TextCritical.com\src\whoosh\multiproc.py", line 49, in finish_subsegment runname = writer.pool.runs[0] IndexError: list index out of range
#24 Updated by Luke Murphey almost 9 years ago
- laws
- commentary-on-plato-protagoras-adam
- speeches-hyperides-english
#25 Updated by Luke Murphey almost 9 years ago
- Target version changed from 1.3 to 1.3.1
#26 Updated by Luke Murphey almost 9 years ago
- Status changed from New to In Progress
#27 Updated by Luke Murphey almost 9 years ago
- Status changed from In Progress to Closed
#28 Updated by Luke Murphey almost 9 years ago
- % Done changed from 0 to 100