I've noticed when using
db.index.vector.queryNodes(...)
that the scores I get back are too high. But perhaps this is because I expected cosine similarity scores and nothing in the docs explicitly tells me that the 'score' returned by this function isn't actually cosine similarity.
The difference is quite large. To test, I did a similarity search, selected two nodes and noted the score returned by
queryNodes
, manually looked up those nodes and copied the embeddings then calculated cosine similarity locally. In one test, the true similarity is 0.67, but Neo4J reports 0.84.
I assume this is already known and explained by the first A in ANN, but it certainly wasn't clear to me as a newbie.
I think many users might assume this 'score' is actually cosine similarity and use existing knowledge they have (like filtering out items where similarity is < 0.9) and find that things don't work as expected.
Suggestions: you could make it clearer in the docs and course that this score is NOT cosine similarity and can differ by several tenths. Or, for the subset of matched items, actually calculate the cosine similarity (which is fast for just a handful, and would be even faster if I could indicate via options that my embeddings are normalised, so just the dot product will get the correct result).
The workaround I'll use is to disregard the Neo4J score and calculate manually to apply further filters, which I assume is standard practice for those wishing to filter on the cosine similarity value.