Project

General

Profile

Bug #2355

Feature #551: Lexicon support: Liddell and Scott (Middle)

Cannot index LSJ

Added by Luke Murphey over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
12/22/2018
Due date:
% Done:

100%


Description

python manage.py make_search_indexes -w middlelsj
Creating search indexes for work...
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/lmurphey/git/textcritical_net/src/reader/management/commands/make_search_indexes.py", line 60, in handle
    WorkIndexer.index_work(work)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 219, in index_work
    cls.index_division(division, commit=False, writer=writer)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 242, in index_division
    cls.index_verse(verse, division=division, writer=writer, commit=False)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 344, in index_verse
    section       = cls.get_section_index_text(division),
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 265, in get_section_index_text
    descriptions.append(unicode(division.get_division_description(use_titles=False).decode("UTF-8")))
  File "/Users/lmurphey/venv/django18/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-8: ordinal not in range(128)

Associated revisions

Revision 955 (diff)
Added by lukemurphey over 5 years ago

Adding test that identifies the error in the lexicon entry encoding

Reference #2355

Revision 956 (diff)
Added by lukemurphey over 5 years ago

Fixed badly encoded string in Division model

Reference #2355

Revision 957 (diff)
Added by lukemurphey over 5 years ago

Adding test for bad unicode data

Reference #2355

Revision 958 (diff)
Added by lukemurphey over 5 years ago

Fixed issue where middle LSJ doesn't index

Reference #2355

History

#1 Updated by Luke Murphey over 5 years ago

Is the hanging up on the one without titles titled "*a ἀάατος"

#2 Updated by Luke Murphey over 5 years ago

These reproduces the issue:

unicode("ἀάατος")
"ἀάατος".encode("ascii").decode("utf-8")
unicode(u"ἀάατος".decode("utf-8"))
u"ἀάατος".decode("utf-8")
str(u"ἀάατος")

But this does not:

unicode("ἀάατος".decode("utf-8"))
u"ἀάατος".encode("utf-8").decode("utf-8")
str("ἀάατος")

#3 Updated by Luke Murphey over 5 years ago

Questions:
  • What is the division title encoded in?
    • Plain unicode
  • Does changing the division titles as unicode fix it?
  • Is it the decode call or the unicode construction that causes the error?
    • It fails on the decode, not on the constructor call
  • Why does it fail even though the get_division_description() call returns unicode?
    • I confirmed that unicode is returned
  • Does telling the unicode() call the encoding help?
  • Does encoding the join work?
    • No
    • return ",".encode('utf8').join(descriptions)
  • Does removing the unicode call help?
    • Causes the blow up later (in get_division_description line 389)
  • Does adding an extra encode() call help (.encode("utf-8").decode("UTF-8"))?
  • Is str on the Division model doing something wrong? It might be not handlings strings correctly
  • Why is get_division_description() line 389 trying to force a string?
  • Does removing the decode and unicode calls fix it?
    • No
  • Is it possible that the unicode data is wrong in the database?
    • Seeing django.utils.encoding.DjangoUnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 0: ordinal not in range(128). You passed in <Division: [Bad Unicode data]> (<class 'reader.models.Division'>)
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/Users/lmurphey/git/textcritical_net/src/reader/management/commands/make_search_indexes.py", line 60, in handle
    WorkIndexer.index_work(work)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 219, in index_work
    cls.index_division(division, commit=False, writer=writer)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 242, in index_division
    cls.index_verse(verse, division=division, writer=writer, commit=False)
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 348, in index_verse
    section       = cls.get_section_index_text(division),
  File "/Users/lmurphey/git/textcritical_net/src/reader/contentsearch.py", line 270, in get_section_index_text
    descriptions.append(division.get_division_description(use_titles=True))
  File "/Users/lmurphey/git/textcritical_net/src/reader/models.py", line 389, in get_division_description
    title = str(next_division)
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/db/models/base.py", line 503, in __str__
    return force_text(self).encode('utf-8')
  File "/Users/lmurphey/venv/django18/lib/python2.7/site-packages/django/utils/encoding.py", line 102, in force_text
    raise DjangoUnicodeDecodeError(s, *e.args)
django.utils.encoding.DjangoUnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 5: ordinal not in range(128). You passed in <Division: [Bad Unicode data]> (<class 'reader.models.Division'>)
  • Does it work with other works?
    • It does with ad-ammaeum
Observations:
  • Apparently get_division_description() line 394 gets something other than a string at times

#4 Updated by Luke Murphey over 5 years ago

The data in the model appears to be bad:

import reader.models
model = reader.models.Division.objects.get(id=54009)

This outputs:

<Division: [Bad Unicode data]>
model.descriptor.decode('utf8')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/lmurphey/venv/django18/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

#5 Updated by Luke Murphey over 5 years ago

Now that the issue is narrowed down to the

Questions:
  • Is the issue in make_division()?
    • It is probably ImportTransforms.convert_descriptors_from_beta_code()
  • What field is the problem?
    • descriptor field

Observations: *

#6 Updated by Luke Murphey over 5 years ago

This reproduces the issue that I am seeing in content_search:

descriptions = []
descriptions.append("main ἈΆΑΤΟΣ")
descriptions.append(u"ἀάατος")

",".join(descriptions)

This does not:

descriptions = []
descriptions.append("main ἈΆΑΤΟΣ")
descriptions.append(u"ἀάατος".encode("utf-8"))

",".join(descriptions)

#7 Updated by Luke Murphey over 5 years ago

No exceptions are thrown in content_search when I do the following:

        descriptions.append(division.get_division_description(use_titles=False).encode("utf-8"))
        descriptions.append(division.get_division_description(use_titles=True).decode("utf-8").encode("utf-8"))

However, Whoosh wants unicode.

#8 Updated by Luke Murphey over 5 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Also available in: Atom PDF