ciphergoth: (tree)
[personal profile] ciphergoth
It's easy to see how to organise a database if you want to be able to, say, find all the people in it whose names are between "Charybdis" and "Scylla" in alphabetical order. But how would you organise it if you want to find all the people whose pictures look most like a particular sample picture, when all you have is a tool that measures how similar two pictures are? There's a family of very general approaches to the problem called "metric space searches"; in this approach, we don't need to know anything about what we're storing except that we can measure the distance between two items and get a sensible answer.

I thought I'd invented a smashing new algorithm for searching in metric spaces. But I'm having a hard time finding out if it's any good.

Searching in Metric Spaces (1999) provides an overview of the state of the art as of five years ago.

The usual assumption when designing these algorithms is that the distance measurement function is very expensive, and so the first order of business is to minimise the number of distance measurements. I implemented my algorithm and got very good results on random points in a cube. But when I tried a harder data set - random lines from "A Tale of Two Cities" with string edit distance as the metric, a standard test used in Brin's tests on his GNAT algorithm - I found it would compare the target string to over half the strings in the database before returning an answer.

So I tried implementing one of the standard algorithms from the paper - AESA, a simple one that's supposed to be very good at minimising distance measurements at search time, at the cost of very high startup time and storage - and that doesn't do much better!

Results using Hamming distance between random binary strings are even worse - AESA averages 882.4 measurements in a dataset of 1000 items, while my algorithm averages 999.3...

How am I supposed to assess my algorithm if I can't find an interesting test case on which any algorithm performs even vaguely OK? Random points in a cube is not a good test, because there are already much better algorithms for the special case of vector spaces...

Date: 2004-04-04 12:46 am (UTC)
From: [identity profile] pavlos.livejournal.com
Hmmm... I don't understand how/if you can evaluate such an algorithm generically, i.e. without reference to particular metrics. I take it that the worst-case runtime is something pretty boring, and you are looking for good average runtime for metrics of interest. This presumably depends on some statistical properties of your metric, such as dimensionality, continuity, distribution etc. Again i'm guessing that these properties are rarely explicit and often poorly understood, so failing that your best bet is probably the experimental approach that you are taking.

Date: 2004-04-05 12:20 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
Worst-case is the metric where x != y => d(x, y) == 1, the "are we nearly there yet" metric. You always have to compare to half or all the elements with that metric. So you are exactly right. And not just properties of your metric - statistical properties of your likely dataset too. Eg if it's in n-dimensional space but tends to lie very close to a plane, then it will perform as if it lay in a 2-dimensional space.

Profile

ciphergoth: (Default)
Paul Crowley

January 2025

S M T W T F S
   1234
5678 91011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 23rd, 2026 10:37 pm
Powered by Dreamwidth Studios