Bug #1250: Word frequency chart is incorrect - TextCritical.net - LukeMurphey.net

Bug #1250

Word frequency chart is incorrect

Added by Luke Murphey over 9 years ago. Updated over 9 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Luke Murphey

Target version:

1.3

Start date:

03/07/2016

Due date:

% Done:

100%

Related issues

History

#1 Updated by Luke Murphey over 9 years ago

The match counts represents the number of matching verses, not matches within the verses. See (ὅτι OR ἔργων) (section:"Galatians 2").
The variations are throwing off the results count

#2 Updated by Luke Murphey over 9 years ago

Some solutions:

Kick off a different search that introspects the results and looks up the results manually.

See https://whoosh.readthedocs.org/en/latest/api/reading.html and http://stackoverflow.com/questions/35565900/how-do-i-get-the-list-of-all-terms-in-a-whoosh-index.

#3 Updated by Luke Murphey over 9 years ago

I wonder if term_info(fieldname, text) can do what I want it to do. It has a max_weight() function that may indicate I could use weights to get these counts. weight() can be used to find the frequency in all documents.

#4 Updated by Luke Murphey over 9 years ago

I could list the terms that did not match with the following:

q.all_terms() - results.terms()

#5 Updated by Luke Murphey over 9 years ago

There is a score() function that could likely indicate the total raw number of matches.

#6 Updated by Luke Murphey over 9 years ago

See http://stackoverflow.com/questions/35591302/how-do-i-get-the-bag-of-words-representation-of-document-content-with-whoosh

#7 Updated by Luke Murphey over 9 years ago

This will get the bag-of-words (it needs vector=True in the schema):

from reader.contentsearch import *
inx = WorkIndexer.get_index()
searcher = inx.searcher()

from reader.models import *
vs = Verse.objects.filter(division__work__title_slug='new-testament')
docnum = searcher.document_number(verse_id=vs[0].id)
g = searcher.vector(docnum,"content").items_as("frequency")

for h in g:
    print h[0], h[1]

searcher.idf("no_diacritics", "και") # Returns 1.4421964751262426
searcher.idf("no_diacritics", "κυριος") # Returns 4.8326777289115554

#8 Updated by Luke Murphey over 9 years ago

I wonder if I could use a collector for this: search_with_collector (http://whoosh.readthedocs.org/en/latest/api/searching.html)

#9 Updated by Luke Murphey over 9 years ago

python manage.py make_search_indexes -w new-testament -c

#10 Updated by Luke Murphey over 9 years ago

I'm starting to think I should implement a word summary on the morphological dialog that lists:

Count of this word in the current division
Count of this word in the current work
Count of this word's related forms in the current division
Count of this word's related forms in the current work

#11 Updated by Luke Murphey over 9 years ago

I could use document_numbers() to get a list of the documents within a given division or work and then look for the related words:

for doc in searcher.document_numbers(work="new-testament"):
    print doc

#12 Updated by Luke Murphey over 9 years ago

Related to Feature #1224: Morphology tool word search added

#13 Updated by Luke Murphey over 9 years ago

The following is a good test case:

work:"New Testament" section:"Galatians" νόμον

#14 Updated by Luke Murphey over 9 years ago

With stored=False, the search indexes for the New Testament is 18.7 MB.

With store=True, they are 20.1 MB.

#15 Updated by Luke Murphey over 9 years ago

Status changed from New to Closed

#16 Updated by Luke Murphey over 9 years ago

% Done changed from 0 to 100

Also available in: Atom PDF

Project

General

Profile

TextCritical.net

Issues

Custom queries