Project

General

Profile

Feature #1224

Morphology tool word search

Added by Luke Murphey about 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
03/04/2016
Due date:
% Done:

100%


Description

Add the ability to do a word search from the morphology tool that would provide:

  • Count of this word in the work
  • Count of this word in the section
  • Pie chart of the forms (if searching for related forms)

Subtasks

Task #1235: Add Javascript charting libraryClosedLuke Murphey

Task #1236: Add endpoint to return statsClosedLuke Murphey

Task #1237: Add tabs to switch between results, stats and helpClosedLuke Murphey

Feature #1239: Chart of works matchedClosedLuke Murphey

Feature #1240: Show message when chart has no dataClosedLuke Murphey


Related issues

Related to TextCritical.net - Bug #1250: Word frequency chart is incorrect Closed 03/07/2016

History

#1 Updated by Luke Murphey about 8 years ago

It would be nice to be able to do this from the search tool too. That way one could get counts from several words in an OR search. Another option would just be adding a count of the number of word matches to the search page (right now it lists verse matches, not word matches).

#2 Updated by Luke Murphey about 8 years ago

  • Target version changed from 3.0 to 1.3

#3 Updated by Luke Murphey about 8 years ago

If this is on the search page then this could include:

  • The various forms if similar forms are searches for
  • The count of forms by division if multiple are returned
  • The count of each search term that matched

#4 Updated by Luke Murphey about 8 years ago

Note sure how the highlights work and if the forms can be extracted.

#5 Updated by Luke Murphey about 8 years ago

Doing some testing to see what the results object includes:

from reader.contentsearch import *
inx = WorkIndexer.get_index()
parser = QueryParser("content", inx.schema, termclass=GreekBetaCodeVariations)

searcher = inx.searcher()

def do_search(search_text):
    search_query = parser.parse(unicode(search_text))
    print search_query
    r = searcher.search_page(search_query, 1, 20, terms=True)
    print len(r.results)
    return r

r = do_search(u'NO/MOU')

This example provides the matched terms:

from reader.contentsearch import *
inx = WorkIndexer.get_index()
parser = QueryParser("content", inx.schema, termclass=GreekVariations)

searcher = inx.searcher()

def do_search(search_text):
    search_query = parser.parse(unicode(search_text))
    print search_query
    r = searcher.search_page(search_query, 1, 20, terms=True)
    print len(r.results)
    return r

r = do_search(u'NO/MOU')
r.results.matched_terms()

#6 Updated by Luke Murphey about 8 years ago

Whoosh doesn't seem to store the number of matches in the matched terms. Perhaps a collector could do this: http://whoosh.readthedocs.org/en/latest/api/collectors.html

#7 Updated by Luke Murphey about 8 years ago

frequency() and doc_frequency() are helpful:

Number of examples of it:
searcher.frequency("content", u'εἰς')

Number of verses that include it:
searcher.doc_frequency("content", u'εἰς')

#8 Updated by Luke Murphey about 8 years ago

s = searcher.postings("content", u'εἰς')
i = s.all_items()
i.next()

See http://whoosh.readthedocs.org/en/latest/recipes.html?highlight=doc_frequency

Might want to consider:
  1. term vectors
  2. Iterating through postings

#9 Updated by Luke Murphey about 8 years ago

Might be able to:

  1. Use searcher.postings() to match a term
  2. Skip to the ID of the earliest verse in the document (presumes that documents are imported in order)
  3. Iterate through results until the last Id within the document (or some limit)

#10 Updated by Luke Murphey about 8 years ago

I think it turns out the Whoosh doesn't indicate the number of hits within a document. Instead it highlights them only when you provide the content.

#11 Updated by Luke Murphey about 8 years ago

A raw SQL query works surprisingly well:

select * from reader_verse
inner join reader_division on reader_verse.division_id = reader_division.id
inner join reader_work on reader_work.id = reader_division.work_id
where
reader_work.title_slug = "new-testament" 
AND reader_verse.content like "%καὶ%" 

#12 Updated by Luke Murphey about 8 years ago

r.results.termdocs indicates the terms that matched

#13 Updated by Luke Murphey about 8 years ago

Things here don't make sense.

work:"New Testament" section:"Galatians" νόμον

This returns 25 matches (counted 32) and:
νόμον: 16
νόμος: 7
νόμου: 8
νόμῳ: 4

work:"New Testament" section:"Galatians" νόμος

This returns 7 matches and:
νόμος: 7

work:"New Testament" section:"Galatians" νόμῳ

This returns 25 (counted 32 again) matches and:
νόμον: 8
νόμος: 7
νόμου: 8
νόμῳ: 8

#14 Updated by Luke Murphey about 8 years ago

It turns out that only one form of νόμος is being found: ΝΟΜΟΣ

#15 Updated by Luke Murphey about 8 years ago

Also getting too many responses. For example, the following returns 9 verses but 11 matches:

(work:new-testament)  xaris  (section:"Galatians 1" OR section:"Galatians 2" OR section:"Galatians 3" OR section:"Galatians 4" OR section:"Galatians 5" OR section:"Galatians 6")

Four instances of χάρις are matched while I can only find 2. Matching on χάρις directly only returns two (either with or without diacritics). Also, no variations of χάρις are being found.

I'm wondering if variations include duplicates?

#16 Updated by Luke Murphey about 8 years ago

These counts don't make complete sense.

The following indicates four matches each for νόμον and εργόν. However, there are 7 instances of νόμου.

(work:new-testament) (νόμον εργόν) (section:"galatians 1" OR section:"Galatians 2" OR section:"Galatians 3" OR section:"Galatians 4" OR section:"Galatians 5" OR section:"Galatians 6")

#17 Updated by Luke Murphey about 8 years ago

This seems to have something to do with variations.

The following indicates 10 matches in six verses (all of the search terms are exact matches):

(work:new-testament) (ἔργων) (section:"galatians 1" OR section:"Galatians 2" OR section:"Galatians 3" OR section:"Galatians 4" OR section:"Galatians 5" OR section:"Galatians 6")

The following only shows 6 matches in six verses (using variations):

(work:new-testament) (εργόν) (section:"galatians 1" OR section:"Galatians 2" OR section:"Galatians 3" OR section:"Galatians 4" OR section:"Galatians 5" OR section:"Galatians 6")

#18 Updated by Luke Murphey about 8 years ago

Wow, this matches 16 in six verses:

(work:new-testament) (εργόν OR ἔργων) (section:"galatians 1" OR section:"Galatians 2" OR section:"Galatians 3" OR section:"Galatians 4" OR section:"Galatians 5" OR section:"Galatians 6")

#19 Updated by Luke Murphey about 8 years ago

  • Status changed from New to Closed

#20 Updated by Luke Murphey about 8 years ago

  • Related to Bug #1250: Word frequency chart is incorrect added

Also available in: Atom PDF