Project

General

Profile

Bug #1250

Word frequency chart is incorrect

Added by Luke Murphey about 8 years ago. Updated about 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
03/07/2016
Due date:
% Done:

100%


Related issues

Related to TextCritical.net - Feature #1224: Morphology tool word search Closed 03/04/2016

History

#1 Updated by Luke Murphey about 8 years ago

  • The match counts represents the number of matching verses, not matches within the verses. See (ὅτι OR ἔργων) (section:"Galatians 2").
  • The variations are throwing off the results count

#2 Updated by Luke Murphey about 8 years ago

Some solutions:

  1. Kick off a different search that introspects the results and looks up the results manually.

See https://whoosh.readthedocs.org/en/latest/api/reading.html and http://stackoverflow.com/questions/35565900/how-do-i-get-the-list-of-all-terms-in-a-whoosh-index.

#3 Updated by Luke Murphey about 8 years ago

I wonder if term_info(fieldname, text) can do what I want it to do. It has a max_weight() function that may indicate I could use weights to get these counts. weight() can be used to find the frequency in all documents.

#4 Updated by Luke Murphey about 8 years ago

I could list the terms that did not match with the following:

q.all_terms() - results.terms()

#5 Updated by Luke Murphey about 8 years ago

There is a score() function that could likely indicate the total raw number of matches.

#7 Updated by Luke Murphey about 8 years ago

This will get the bag-of-words (it needs vector=True in the schema):

from reader.contentsearch import *
inx = WorkIndexer.get_index()
searcher = inx.searcher()

from reader.models import *
vs = Verse.objects.filter(division__work__title_slug='new-testament')
docnum = searcher.document_number(verse_id=vs[0].id)
g = searcher.vector(docnum,"content").items_as("frequency")

for h in g:
    print h[0], h[1]

searcher.idf("no_diacritics", "και") # Returns 1.4421964751262426
searcher.idf("no_diacritics", "κυριος") # Returns 4.8326777289115554

#8 Updated by Luke Murphey about 8 years ago

I wonder if I could use a collector for this: search_with_collector (http://whoosh.readthedocs.org/en/latest/api/searching.html)

#9 Updated by Luke Murphey about 8 years ago

python manage.py make_search_indexes -w new-testament -c

#10 Updated by Luke Murphey about 8 years ago

I'm starting to think I should implement a word summary on the morphological dialog that lists:

  • Count of this word in the current division
  • Count of this word in the current work
  • Count of this word's related forms in the current division
  • Count of this word's related forms in the current work

#11 Updated by Luke Murphey about 8 years ago

I could use document_numbers() to get a list of the documents within a given division or work and then look for the related words:

for doc in searcher.document_numbers(work="new-testament"):
    print doc

#12 Updated by Luke Murphey about 8 years ago

#13 Updated by Luke Murphey about 8 years ago

The following is a good test case:

work:"New Testament" section:"Galatians" νόμον

#14 Updated by Luke Murphey about 8 years ago

With stored=False, the search indexes for the New Testament is 18.7 MB.

With store=True, they are 20.1 MB.

#15 Updated by Luke Murphey about 8 years ago

  • Status changed from New to Closed

#16 Updated by Luke Murphey about 8 years ago

  • % Done changed from 0 to 100

Also available in: Atom PDF