Apache lucene relevance models

1/10/2024

All fraked projects go there to hibernate or die. The Attic, which sounds like reference to Josh Weldons Doll House, fundamentally works the same way it does on that show. The Decision to be a committer Lucene.Net was on the verge of beingįorked into oblivion and put into Apaches Attic. What this post is going to really focus on is the why or motivations behind the actions, which is sometimes more important than the actual actions themselves. It is poor taste to do something that diminishes something that really takes a team effort and community to accomplish. Seriously though, this post isnt going to be about flaunting or bragging rights. Despite the popular belief that Im a sexy nerd with naracistic tendencies, more awesome than Captain Awesome, I am just a down-to-earth guy like Whil Wheaton.

Why should I have to write about my rock star status? I have fans for that. They already hang signs around my desk that tells you not to feed my ego. Who in there right mind would ask me to do that? CoughcoworkerCough. Subclass DefaultSimilarity and override the method you want to customize.I was requested to compose a blog post about how what Ive been working within the Lucene.Net project as a committer. Its easy to customize the scoring algorithm. Hint: look at NutchSimilarity in Nutch to see an example of how web pages can be scored for relevance The mathematical definition of the scoring can be found at here * Documents which mention the search terms many times are good

* Long documents are not as good as short ones * Matches on rare words are better than for common words * Documents containing *all* the search terms are good So, in summary (quoting Mark Harwood from the mailing list), It is implemented as 1/sqrt(sumOfSquaredWeights) QueryNorm is not related to the relevance of the document, but rather tries to make scores between different queries comparable. Rationale: a term in a field with less terms is more important than one with more Implication: a term matched in fields with less terms have a higher score Implication: of the terms in the query, a document that contains more terms will have a higher score Rationale: common terms are less important than uncommon ones Implication: the greater the occurrence of a term in different documents, the lower its score Implementation: log(numDocs/(docFreq+1)) + 1 Rationale: documents which contains more of a term are generally more relevant Implication: the more frequent a term occurs in a document, the greater its score Note: the implication of these factors should be read as, "Everything else being equal. The implementation, implication and rationales of factors 1,2, 3 and 4 in DefaultSimilarity.java, which is what you get if you don't explicitly specify a similarity, are:

boost (query) = boost of the field at query-time.
boost (index) = boost of the field at index-time.
queryNorm = normalization factor so that queries can be compared.
lengthNorm = measure of the importance of a term according to the total number of terms in the field.
coord = number of terms in the query that were found in the document.
idf = inverse document frequency = measure of how often the term appears across the index.tf = term frequency in document = measure of how often a term appears in the document.The factors involved in Lucene's scoring algorithm are as follows: Lucene implements a variant of the Tf-Idf scoring model. The authoritative document for scoring is found on the Lucene site here.

0 Comments

Apache lucene relevance models

Leave a Reply.

Author

Archives

Categories