Bug #1250
Word frequency chart is incorrect
100%
Related issues
History
#1 Updated by Luke Murphey over 8 years ago
- The match counts represents the number of matching verses, not matches within the verses. See (ὅτι OR ἔργων) (section:"Galatians 2").
- The variations are throwing off the results count
#2 Updated by Luke Murphey over 8 years ago
Some solutions:
- Kick off a different search that introspects the results and looks up the results manually.
See https://whoosh.readthedocs.org/en/latest/api/reading.html and http://stackoverflow.com/questions/35565900/how-do-i-get-the-list-of-all-terms-in-a-whoosh-index.
#3 Updated by Luke Murphey over 8 years ago
I wonder if term_info(fieldname, text) can do what I want it to do. It has a max_weight() function that may indicate I could use weights to get these counts. weight() can be used to find the frequency in all documents.
#4 Updated by Luke Murphey over 8 years ago
I could list the terms that did not match with the following:
q.all_terms() - results.terms()
#5 Updated by Luke Murphey over 8 years ago
There is a score() function that could likely indicate the total raw number of matches.
#7 Updated by Luke Murphey over 8 years ago
This will get the bag-of-words (it needs vector=True in the schema):
from reader.contentsearch import * inx = WorkIndexer.get_index() searcher = inx.searcher() from reader.models import * vs = Verse.objects.filter(division__work__title_slug='new-testament') docnum = searcher.document_number(verse_id=vs[0].id) g = searcher.vector(docnum,"content").items_as("frequency") for h in g: print h[0], h[1] searcher.idf("no_diacritics", "και") # Returns 1.4421964751262426 searcher.idf("no_diacritics", "κυριος") # Returns 4.8326777289115554
#8 Updated by Luke Murphey over 8 years ago
I wonder if I could use a collector for this: search_with_collector (http://whoosh.readthedocs.org/en/latest/api/searching.html)
#9 Updated by Luke Murphey over 8 years ago
python manage.py make_search_indexes -w new-testament -c
#10 Updated by Luke Murphey over 8 years ago
I'm starting to think I should implement a word summary on the morphological dialog that lists:
- Count of this word in the current division
- Count of this word in the current work
- Count of this word's related forms in the current division
- Count of this word's related forms in the current work
#11 Updated by Luke Murphey over 8 years ago
I could use document_numbers() to get a list of the documents within a given division or work and then look for the related words:
for doc in searcher.document_numbers(work="new-testament"): print doc
#12 Updated by Luke Murphey over 8 years ago
- Related to Feature #1224: Morphology tool word search added
#13 Updated by Luke Murphey over 8 years ago
The following is a good test case:
work:"New Testament" section:"Galatians" νόμον
#14 Updated by Luke Murphey over 8 years ago
With stored=False, the search indexes for the New Testament is 18.7 MB.
With store=True, they are 20.1 MB.
#15 Updated by Luke Murphey over 8 years ago
- Status changed from New to Closed
#16 Updated by Luke Murphey over 8 years ago
- % Done changed from 0 to 100