Project

General

Profile

Feature #526

Add ability to search beta-code

Added by Luke Murphey almost 12 years ago. Updated almost 12 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
Due date:
% Done:

100%


Description

Provide users the ability to search using beta-code. In the best case scenario, I would like the results to highlight the beta-code sections.

This could be done by:

  1. Converting the beta-code to a Greek character set in the search
  2. Providing a mechanism for people to select Greek chars (like a drop-down Greek keyboard)
  3. Indexing beta-code version of the content

The first and second options are nice because users can see the actual Greek content they are searching for and we don't have to index more content (which makes the indexes larger). Also, the search results will automatically highlight the Greek characters.


Related issues

Blocks TextCritical.net - Feature #528: Search help page Closed 02/17/2013

History

#1 Updated by Luke Murphey almost 12 years ago

Could also add a button to convert the beta-code to Unicode

#4 Updated by Luke Murphey almost 12 years ago

  • % Done changed from 0 to 30

I have variations partially working. The only problem is that the query parser is splitting the query based on non-word characters such as slashes. This means that a search for "NO/MOU" results in a search for "no" and "mou" as opposed to expanding into "νόμου".

#5 Updated by Luke Murphey almost 12 years ago

I see information about how the Tokenizers split up the content for searching but I cannot find docs about how the searches are split up.

I think the SpaceSeparatedTokenizer is the correct one to use.

#8 Updated by Luke Murphey almost 12 years ago

It looks like the query parser uses the analyzer associated with the schema to parse the search. See QueryParser.__init__:

:param schema: a :class:`whoosh.fields.Schema` object to use when
parsing. The appropriate fields in the schema will be used to
tokenize terms/phrases before they are turned into query objects.
You can specify None for the schema to create a parser that does
not analyze the text of the query, usually for testing purposes.

#9 Updated by Luke Murphey almost 12 years ago

Changing the regular expression that SimpleAnalyzer uses to "[\w/*()=\+|&']+(\.?\w+)*" does get the unit tests to pass. However, the searches still fail. I'm guessing that the analyzer gets persisted in the index file and thus the indexes have to be re-created.

#10 Updated by Luke Murphey almost 12 years ago

Re-indexing the database didn't seem to help. Doing a search for "pa/sxa" returns no results despite the fact that this is converted to "πάσχα". Oddly enough, searching for "πάσχα" does return results. These are actually different:

CF 80 E1 BD B1 CF 83 CF 87 CE B1
CF 80 CE AC    CF 83 CF 87 CE B1

Somehow, the unicode for the accent is getting saved differently.

#11 Updated by Luke Murphey almost 12 years ago

Normalizing the content before saving it in the index worked. The only problem is that highlighting doesn't happen.

#12 Updated by Luke Murphey almost 12 years ago

  • % Done changed from 30 to 60

#13 Updated by Luke Murphey almost 12 years ago

The analyzer does seem to be pickled. I'll need to recreate the indexes from scratch.

#14 Updated by Luke Murphey almost 12 years ago

I rebuilt the indexes. The search works provided I normalize the unicode on the search term. The problem is that the terms are not being highlighted, either with a search for πάσχα or pa/sxa.

#15 Updated by Luke Murphey almost 12 years ago

Hmm, not sure what is going on here. Highlighting is inconsistent. ἑορτὴ does get highlighted.

#16 Updated by Luke Murphey almost 12 years ago

Somehow the content is getting stored in a different format in the indexes. I normalized the content before I provide it to the highlights function and this fixed the problem.

#17 Updated by Luke Murphey almost 12 years ago

Nice, it looks like the beta-code causes the text to get highlighted correctly now. Yay

#18 Updated by Luke Murphey almost 12 years ago

  • % Done changed from 60 to 80

I wonder: could I use the variations to search for all forms of a word? If so, then I could allow users to look up all possible forms of a given word.

#19 Updated by Luke Murphey almost 12 years ago

  • Status changed from New to Closed
  • % Done changed from 80 to 100

Also available in: Atom PDF