Is it possible to know what the internal scoring is carried out. So far, based on the results we are getting we have assumed is based on number of occurrences of search term (or close matches to it, since fuzzy is turned on ) on title + body.
The score returned in the endpoint has a numeric format, is it 0-100 based ?
Good question, @technology, and sorry our documentation isnāt very explicit about this.
For now Iām going to share a preliminary answer previously written by a dev:
By using fuzzy search, we returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance ( the number of one character changes that need to be made to one string to make it the same as another string).
The Levenshtein edit distance is configured based on the length of the search term:
length 0ā¦2 : must match exactly
length 3ā¦5 : one edit allowed
length >5 : two edits allowed
At the moment we do not allow to configure a custom Levenshtein edit distance
But this isnāt quite as clear as it could be, i.e., I am not sure whether the score is the Levenshtein distance or something else. Let me try to get more info and get back to you on that.
(Edit: The higher the score, the more relevant the result is to the search term. I donāt believe there is a maximum score. It might be the BM25 algorithm used by Elasticsearch with some custom weights, but let me verify that for you.)
The score is a measure of relevance, the higher the ābetterā (i.e., higher scores are more relevant to the search query). The score is different from the Levenshtein distance.
Title and body text are both taken into account, but title is weighted more. A page that has the query in its title will score higher than a page that only has a query in is body. If the query is found in both the title and the body, we average the scores.
How fuzziness affects scoring⦠it essentially āexpandsā the search keywords, like a search for ātestā would internally expand into ātsetā and ātessā and so on and the results will be scored accordingly. But if your search query was long enough, that would result in gazillions of permutations⦠so it doesnāt really do that manually, but rather performs some fancy magic against pre-indexed matches within X Damerau-Levenshtein distance away. (Itās a crazy algorithm: see https://www.elastic.co/blog/found-fuzzy-search and http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata). Anyway, that isnāt a Dato-specific thing, just how fuzzy Elasticsearches work by default.
What Iām still looking into:
We are still trying to find the exact scoring algorithm used by our particular Elasticsearch (if that matters to you). It ought to be BM25, which is the newer default, unless weāre an on older version of Elasticsearch.
Whether there is a maximum possible score value. So far, I donāt believe there is. It should be able to go higher and higher if your query is both unique and precise enough against a huge dataset of dissimilar results⦠i.e., a truly unique āneedle in a giant haystackā could score very very high compared to a query that returned many results among similar pages.