Bug #480
Feature #466: Lemma lookup
Task #472: Lemma lookup REST API
Make morphology lookups insenstive to browser differences
100%
Description
For some reason different browsers seem to send different requests for looking words. For an example, go to Romans 16 and looking ἐκκλησίας in verse one on both FireFox on a desktop and and Safari on iOS. The query on the desktop succeeds while the iOS device fails. This does not apply to all words though.
History
#1 Updated by Luke Murphey almost 12 years ago
- Subject changed from Fix browser lookup bug to Make morphology lookups insenstive to browser differences
#2 Updated by Luke Murphey almost 12 years ago
On iOS the request for Ἀσπάσασθε looks like:
[22/Dec/2012 02:38:47] "GET /api/word_parse/%E1%BC%88%CF%83%CF%80%CE%AC%CF%83%CE%B1%CF%83%CE%B8%CE%B5 HTTP/1.1" 200 2
On the desktop:
[22/Dec/2012 02:38:21] "GET /api/word_parse/%E1%BC%88%CF%83%CF%80%E1%BD%B1%CF%83%CE%B1%CF%83%CE%B8%CE%B5 HTTP/1.1" 200 460
From what I can tell, the diacritics are represented differently.
#3 Updated by Luke Murphey almost 12 years ago
I tried normalizing the Unicode using all four forms of unistr (http://docs.python.org/2/library/unicodedata.html; none worked.
#4 Updated by Luke Murphey almost 12 years ago
The issue is that the data in the database must be normalized
άβαις in the database is equivalent to u'\u1f71\u03b2\u03b1\u03b9\u03c2'
Normalizing in Python results in the following, none of which are equivalent:
- NFC: u'\u03ac\u03b2\u03b1\u03b9\u03c2'
- NFKC: u'\u03ac\u03b2\u03b1\u03b9\u03c2'
- NFD: u'\u03b1\u0301\u03b2\u03b1\u03b9\u03c2'
- NFKD: u'\u03b1\u0301\u03b2\u03b1\u03b9\u03c2'
However, normalizing the value manually works:
s = u'\u1f71\u03b2\u03b1\u03b9\u03c2' wf = WordForm.objects.filter(form=s)[0] wf.form = unicodedata.normalize("NFKC", s) wf.save() # This returns successfully WordForm.objects.filter(form=unicodedata.normalize("NFKC", s))
#5 Updated by Luke Murphey almost 12 years ago
May want to store versions without diacritic marks in case a search fails because the diacritics are different. See http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string
#6 Updated by Luke Murphey almost 12 years ago
The following script should convert an existing database to use normalized Unicode:
import unicodedata import reader.models from django.db import transaction @transaction.commit_on_success def normalize_database_unicode(): for wf in reader.models.WordForm.objects.all(): wf.form = unicodedata.normalize("NFKC", wf.form) wf.save() normalize_database_unicode()
#7 Updated by Luke Murphey almost 12 years ago
- Status changed from New to In Progress
#8 Updated by Luke Murphey almost 12 years ago
- Description updated (diff)
- Status changed from In Progress to Closed
- % Done changed from 0 to 100