Bug #480: Make morphology lookups insenstive to browser differences - TextCritical.net - LukeMurphey.net

Bug #480

Feature #466: Lemma lookup

Task #472: Lemma lookup REST API

Make morphology lookups insenstive to browser differences

Added by Luke Murphey over 12 years ago. Updated over 12 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Luke Murphey

Target version:

0.2

Start date:

12/22/2012

Due date:

% Done:

100%

Description

For some reason different browsers seem to send different requests for looking words. For an example, go to Romans 16 and looking ἐκκλησίας in verse one on both FireFox on a desktop and and Safari on iOS. The query on the desktop succeeds while the iOS device fails. This does not apply to all words though.

History

#1 Updated by Luke Murphey over 12 years ago

Subject changed from Fix browser lookup bug to Make morphology lookups insenstive to browser differences

#2 Updated by Luke Murphey over 12 years ago

On iOS the request for Ἀσπάσασθε looks like:

[22/Dec/2012 02:38:47] "GET /api/word_parse/%E1%BC%88%CF%83%CF%80%CE%AC%CF%83%CE%B1%CF%83%CE%B8%CE%B5 HTTP/1.1" 200 2

On the desktop:

[22/Dec/2012 02:38:21] "GET /api/word_parse/%E1%BC%88%CF%83%CF%80%E1%BD%B1%CF%83%CE%B1%CF%83%CE%B8%CE%B5 HTTP/1.1" 200 460

From what I can tell, the diacritics are represented differently.

#3 Updated by Luke Murphey over 12 years ago

I tried normalizing the Unicode using all four forms of unistr (http://docs.python.org/2/library/unicodedata.html; none worked.

#4 Updated by Luke Murphey over 12 years ago

The issue is that the data in the database must be normalized

άβαις in the database is equivalent to u'\u1f71\u03b2\u03b1\u03b9\u03c2'

Normalizing in Python results in the following, none of which are equivalent:

NFC: u'\u03ac\u03b2\u03b1\u03b9\u03c2'
NFKC: u'\u03ac\u03b2\u03b1\u03b9\u03c2'
NFD: u'\u03b1\u0301\u03b2\u03b1\u03b9\u03c2'
NFKD: u'\u03b1\u0301\u03b2\u03b1\u03b9\u03c2'

However, normalizing the value manually works:

s = u'\u1f71\u03b2\u03b1\u03b9\u03c2'
wf = WordForm.objects.filter(form=s)[0]
wf.form = unicodedata.normalize("NFKC", s)
wf.save()

# This returns successfully
WordForm.objects.filter(form=unicodedata.normalize("NFKC", s))

#5 Updated by Luke Murphey over 12 years ago

May want to store versions without diacritic marks in case a search fails because the diacritics are different. See http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string

#6 Updated by Luke Murphey over 12 years ago

The following script should convert an existing database to use normalized Unicode:

import unicodedata
import reader.models
from django.db import transaction

@transaction.commit_on_success
def normalize_database_unicode():
    for wf in reader.models.WordForm.objects.all():
        wf.form = unicodedata.normalize("NFKC", wf.form)
        wf.save()

normalize_database_unicode()

#7 Updated by Luke Murphey over 12 years ago

Status changed from New to In Progress

#8 Updated by Luke Murphey over 12 years ago

Description updated (diff)
Status changed from In Progress to Closed
% Done changed from 0 to 100

Also available in: Atom PDF

Project

General

Profile

TextCritical.net

Issues

Custom queries