Project

General

Profile

Bug #1258

Work indexer fails

Added by Luke Murphey over 8 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
03/11/2016
Due date:
% Done:

100%


Description

^[[C^[[CTraceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 36, in handle
    WorkIndexer.index_all_works()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 124, in index_all_works
    cls.index_work(work)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 164, in index_work
    cls.index_division(division)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 187, in index_division
    writer.commit()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 922, in commit
    finalsegments = self._merge_segments(mergetype, optimize, merge)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 827, in _merge_segments
    return mergetype(self, self.segments)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 101, in MERGE_SMALL
    writer.add_reader(reader)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 709, in add_reader
    docmap = self.write_per_doc(fieldnames, reader)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 690, in write_per_doc
    v = reader.vector(docnum, fieldname, fieldobj.vector)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/reading.py", line 832, in vector
    return self._perdoc.vector(docnum, fieldname, vformat)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 482, in vector
    byteids=True)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 879, in __init__
    self._read_header()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 890, in _read_header
    raise Exception("Block tag error %r" % magic)
Exception: Block tag error 'VPST'

The last log entry is:

reader.contentsearch: Successfully indexed verse, verse=4, division="36", work="Cynegeticus" 

History

#1 Updated by Luke Murphey over 8 years ago

Seems to be crashing after the following:

Successfully indexed verse, verse=4, division="36", work="Cynegeticus" 

Tried to regenerate it with the following:

python manage.py make_search_indexes -c -w "Cynegeticus" 

This worked however.

#2 Updated by Luke Murphey over 8 years ago

Moving the index commit to the work level makes it faster.

#3 Updated by Luke Murphey over 8 years ago

Now it is dying at:

Successfully indexed division, division="chapter 89", work="The Deipnosophists, Book 10" 

#4 Updated by Luke Murphey over 8 years ago

Consistently fails on:

python manage.py make_search_indexes -w the-deipnosophists-book-10

#5 Updated by Luke Murphey over 8 years ago

Try indexing the work and committing by verse to identify the verse that is the problem.

#6 Updated by Luke Murphey over 8 years ago

Observations:

  • The indexing fails at the commit stage
  • Once the error is observed, no more commits will work
  • It is failing on the-deipnosophists-book-10
  • Indexing the-deipnosophists-book-10 works after clearing the indexes; then indexing the-deipnosophists-book-11 works too as well as Cynegeticus
  • Whoosh 2.4.1 seemed to index fine

#7 Updated by Luke Murphey over 8 years ago

I could try indexing on Windows to see if it reproduces on other platforms.

#8 Updated by Luke Murphey over 8 years ago

Trying different versions of Whoosh:

  1. 2.4.1: works
  2. 2.7.2: doesn't
  3. 2.5.0: gets different error: IndexError: list index out of range after "reader.contentsearch: Successfully indexed verse, verse=3, division="35", work="Cynegeticus""

2.5.0 error:

Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 42, in handle
    WorkIndexer.index_all_works()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 165, in index_all_works
    cls.index_work(work, commit=True)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 216, in index_work
    cls.index_division(division, commit=False, writer=writer)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 239, in index_division
    cls.index_verse(verse, division=division, writer=writer, commit=False)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 324, in index_verse
    author        = author_str
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/writing.py", line 752, in add_document
    perdocwriter.add_vector_items(fieldname, field, vitems)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 232, in add_vector_items
    vinfo = vpostwriter.finish_postings()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 648, in finish_postings
    terminfo.add_block(self)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 1083, in add_block
    self._minid = block.min_id()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh3.py", line 776, in min_id
    return self._ids[0]

#9 Updated by Luke Murphey over 8 years ago

Going to try disabling term vectors

#10 Updated by Luke Murphey over 8 years ago

2.4.1 indexing fails:

Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Library/Python/2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Library/Python/2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/management/commands/make_search_indexes.py", line 42, in handle
    WorkIndexer.index_all_works()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 165, in index_all_works
    cls.index_work(work, commit=True)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/reader/contentsearch.py", line 219, in index_work
    writer.commit()
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 502, in commit
    finalsegments = self._merge_segments(mergetype, optimize, merge)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 432, in _merge_segments
    return mergetype(self, self.segments)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 68, in MERGE_SMALL
    writer.add_reader(reader)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 328, in add_reader
    self._merge_per_doc(reader, docmap)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filewriting.py", line 309, in _merge_per_doc
    perdocwriter.add_vector_matcher(fieldname, field, v)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh2.py", line 219, in add_vector_matcher
    self.add_vector_items(fieldname, fieldobj, readitems())
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/codec/whoosh2.py", line 209, in add_vector_items
    self.vindex.add((self.docnum, fieldname), startoffset)
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 490, in add
    self._add(self.keycoder(key), self.valuecoder(data))
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 122, in add
    self.add_all(((key, value),))
  File "/Users/lmurphey/Documents/SP/Workspace/TextCritical.com/src/whoosh/filedb/filetables.py", line 346, in add_all
    raise ValueError("Keys must increase: %r .. %r" % (lk, key))
ValueError: Keys must increase: '\x00\x00\x00^\x00\x01' .. '\x00\x00\x00^\x00\x00'

#11 Updated by Luke Murphey over 8 years ago

Fails on Windows too.

#12 Updated by Luke Murphey over 8 years ago

Trying to index without vectors

#13 Updated by Luke Murphey over 8 years ago

Debugging:

  • Try on Windows: still fails
  • Try on older versions: different error
  • Try without vectors: works (including no vectors on the content field)
  • Index by verse to identify bad work:

#14 Updated by Luke Murphey over 8 years ago

Indexing without vectors works.

#15 Updated by Luke Murphey over 8 years ago

Ok, I have a fairly minimal repro:

python .\manage.py make_search_indexes -c -w "Cynegeticus" 
python .\manage.py make_search_indexes  -w "the-deipnosophists-book-1" 

After importing Cynegeticus, the index cannot be opened.

#16 Updated by Luke Murphey over 8 years ago

Even smaller repro:

python .\manage.py make_search_indexes -c -w indica

#18 Updated by Luke Murphey over 8 years ago

Employed a workaround but this still didn't allow the entire index to be built.

Blew up after:

reader.contentsearch: Successfully indexed division, division="Book 12", work="Laws" 

#19 Updated by Luke Murphey over 8 years ago

I'm changing the analyzer such that it matches at least one character to see if that makes a difference.

#20 Updated by Luke Murphey over 8 years ago

Nope, still fails.

#21 Updated by Luke Murphey over 8 years ago

Trying without the analyzer.

#22 Updated by Luke Murphey over 8 years ago

Interesting. Removing the analyzer causes the indexer to blow up much earlier (blows up on Argonautica).

#23 Updated by Luke Murphey over 8 years ago

Logged from file contentsearch.py, line 250
Process SubWriterTask-416:
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python2.7\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "D:\Users\Luke\Workspace\TextCritical.com\src\whoosh\multiproc.py", line 129, in run
    runname, fieldnames, segment = finish_subsegment(writer, k)
  File "D:\Users\Luke\Workspace\TextCritical.com\src\whoosh\multiproc.py", line 49, in finish_subsegment
    runname = writer.pool.runs[0]
IndexError: list index out of range

#24 Updated by Luke Murphey over 8 years ago

Indexing works provided I exclude:
  • laws
  • commentary-on-plato-protagoras-adam
  • speeches-hyperides-english

#25 Updated by Luke Murphey over 8 years ago

  • Target version changed from 1.3 to 1.3.1

#26 Updated by Luke Murphey over 8 years ago

  • Status changed from New to In Progress

#27 Updated by Luke Murphey over 8 years ago

  • Status changed from In Progress to Closed

#28 Updated by Luke Murphey over 8 years ago

  • % Done changed from 0 to 100

Also available in: Atom PDF