Question: How is the score calculated for internal search? (DatoCMS Site Search algorithm)

OK, a dev pointed me to this previous thread:

So what we know so far:

  • The score is a measure of relevance, the higher the “better” (i.e., higher scores are more relevant to the search query). The score is different from the Levenshtein distance.
  • Title and body text are both taken into account, but title is weighted more. A page that has the query in its title will score higher than a page that only has a query in is body. If the query is found in both the title and the body, we average the scores.
  • How fuzziness affects scoring… it essentially “expands” the search keywords, like a search for “test” would internally expand into “tset” and “tess” and so on and the results will be scored accordingly. But if your search query was long enough, that would result in gazillions of permutations… so it doesn’t really do that manually, but rather performs some fancy magic against pre-indexed matches within X Damerau-Levenshtein distance away. (It’s a crazy algorithm: see https://www.elastic.co/blog/found-fuzzy-search and http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata). Anyway, that isn’t a Dato-specific thing, just how fuzzy Elasticsearches work by default.

What I’m still looking into:

  • We are still trying to find the exact scoring algorithm used by our particular Elasticsearch (if that matters to you). It ought to be BM25, which is the newer default, unless we’re an on older version of Elasticsearch.
  • Whether there is a maximum possible score value. So far, I don’t believe there is. It should be able to go higher and higher if your query is both unique and precise enough against a huge dataset of dissimilar results… i.e., a truly unique “needle in a giant haystack” could score very very high compared to a query that returned many results among similar pages.
1 Like