Question: How is the score calculated for internal search? (DatoCMS Site Search algorithm)

Describe the issue:

  • Is it possible to know what the internal scoring is carried out. So far, based on the results we are getting we have assumed is based on number of occurrences of search term (or close matches to it, since fuzzy is turned on ) on title + body.

  • The score returned in the endpoint has a numeric format, is it 0-100 based ?

Good question, @technology, and sorry our documentation isn’t very explicit about this.

For now I’m going to share a preliminary answer previously written by a dev:

By using fuzzy search, we returns documents that contain terms similar to the search term, as measured by a Levenshtein edit distance ( the number of one character changes that need to be made to one string to make it the same as another string).

The Levenshtein edit distance is configured based on the length of the search term:
length 0…2 : must match exactly
length 3…5 : one edit allowed
length >5 : two edits allowed

At the moment we do not allow to configure a custom Levenshtein edit distance

But this isn’t quite as clear as it could be, i.e., I am not sure whether the score is the Levenshtein distance or something else. Let me try to get more info and get back to you on that.

(Edit: The higher the score, the more relevant the result is to the search term. I don’t believe there is a maximum score. It might be the BM25 algorithm used by Elasticsearch with some custom weights, but let me verify that for you.)

OK, a dev pointed me to this previous thread:

So what we know so far:

  • The score is a measure of relevance, the higher the ā€œbetterā€ (i.e., higher scores are more relevant to the search query). The score is different from the Levenshtein distance.
  • Title and body text are both taken into account, but title is weighted more. A page that has the query in its title will score higher than a page that only has a query in is body. If the query is found in both the title and the body, we average the scores.
  • How fuzziness affects scoring… it essentially ā€œexpandsā€ the search keywords, like a search for ā€œtestā€ would internally expand into ā€œtsetā€ and ā€œtessā€ and so on and the results will be scored accordingly. But if your search query was long enough, that would result in gazillions of permutations… so it doesn’t really do that manually, but rather performs some fancy magic against pre-indexed matches within X Damerau-Levenshtein distance away. (It’s a crazy algorithm: see https://www.elastic.co/blog/found-fuzzy-search and http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata). Anyway, that isn’t a Dato-specific thing, just how fuzzy Elasticsearches work by default.

What I’m still looking into:

  • We are still trying to find the exact scoring algorithm used by our particular Elasticsearch (if that matters to you). It ought to be BM25, which is the newer default, unless we’re an on older version of Elasticsearch.
  • Whether there is a maximum possible score value. So far, I don’t believe there is. It should be able to go higher and higher if your query is both unique and precise enough against a huge dataset of dissimilar results… i.e., a truly unique ā€œneedle in a giant haystackā€ could score very very high compared to a query that returned many results among similar pages.
1 Like

Just to check back in, I do believe the above is correct on our version of ElasticSearch.

Hopefully that helps, and please let us know if you have any other questions about the scoring.

We really appreciate the thorough answers, thank you!

1 Like