OK, a dev pointed me to this previous thread:
So what we know so far:
- The score is a measure of relevance, the higher the âbetterâ (i.e., higher scores are more relevant to the search query). The score is different from the Levenshtein distance.
- Title and body text are both taken into account, but title is weighted more. A page that has the query in its title will score higher than a page that only has a query in is body. If the query is found in both the title and the body, we average the scores.
- How fuzziness affects scoring⌠it essentially âexpandsâ the search keywords, like a search for âtestâ would internally expand into âtsetâ and âtessâ and so on and the results will be scored accordingly. But if your search query was long enough, that would result in gazillions of permutations⌠so it doesnât really do that manually, but rather performs some fancy magic against pre-indexed matches within X Damerau-Levenshtein distance away. (Itâs a crazy algorithm: see https://www.elastic.co/blog/found-fuzzy-search and http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata). Anyway, that isnât a Dato-specific thing, just how fuzzy Elasticsearches work by default.
What Iâm still looking into:
- We are still trying to find the exact scoring algorithm used by our particular Elasticsearch (if that matters to you). It ought to be BM25, which is the newer default, unless weâre an on older version of Elasticsearch.
- Whether there is a maximum possible score value. So far, I donât believe there is. It should be able to go higher and higher if your query is both unique and precise enough against a huge dataset of dissimilar results⌠i.e., a truly unique âneedle in a giant haystackâ could score very very high compared to a query that returned many results among similar pages.