It's easy to see how to organise a database if you want to be able to, say, find all the people in it whose names are between "Charybdis" and "Scylla" in alphabetical order. But how would you organise it if you want to find all the people whose pictures look most like a particular sample picture, when all you have is a tool that measures how similar two pictures are? There's a family of very general approaches to the problem called "metric space searches"; in this approach, we don't need to know anything about what we're storing except that we can measure the distance between two items and get a sensible answer.
I thought I'd invented a smashing new algorithm for searching in metric spaces. But I'm having a hard time finding out if it's any good.
Searching in Metric Spaces (1999) provides an overview of the state of the art as of five years ago.
The usual assumption when designing these algorithms is that the distance measurement function is very expensive, and so the first order of business is to minimise the number of distance measurements. I implemented my algorithm and got very good results on random points in a cube. But when I tried a harder data set - random lines from "A Tale of Two Cities" with string edit distance as the metric, a standard test used in Brin's tests on his GNAT algorithm - I found it would compare the target string to over half the strings in the database before returning an answer.
So I tried implementing one of the standard algorithms from the paper - AESA, a simple one that's supposed to be very good at minimising distance measurements at search time, at the cost of very high startup time and storage - and that doesn't do much better!
Results using Hamming distance between random binary strings are even worse - AESA averages 882.4 measurements in a dataset of 1000 items, while my algorithm averages 999.3...
How am I supposed to assess my algorithm if I can't find an interesting test case on which any algorithm performs even vaguely OK? Random points in a cube is not a good test, because there are already much better algorithms for the special case of vector spaces...
I thought I'd invented a smashing new algorithm for searching in metric spaces. But I'm having a hard time finding out if it's any good.
Searching in Metric Spaces (1999) provides an overview of the state of the art as of five years ago.
The usual assumption when designing these algorithms is that the distance measurement function is very expensive, and so the first order of business is to minimise the number of distance measurements. I implemented my algorithm and got very good results on random points in a cube. But when I tried a harder data set - random lines from "A Tale of Two Cities" with string edit distance as the metric, a standard test used in Brin's tests on his GNAT algorithm - I found it would compare the target string to over half the strings in the database before returning an answer.
So I tried implementing one of the standard algorithms from the paper - AESA, a simple one that's supposed to be very good at minimising distance measurements at search time, at the cost of very high startup time and storage - and that doesn't do much better!
Results using Hamming distance between random binary strings are even worse - AESA averages 882.4 measurements in a dataset of 1000 items, while my algorithm averages 999.3...
How am I supposed to assess my algorithm if I can't find an interesting test case on which any algorithm performs even vaguely OK? Random points in a cube is not a good test, because there are already much better algorithms for the special case of vector spaces...
no subject
Date: 2004-04-03 01:23 am (UTC)Some of the research focused on pattern recognition. The only link I was able to find is below.
http://www.lic.gov.uk/research/information_retrieval/ir-im-rr.html
no subject
Date: 2004-04-03 04:38 am (UTC)I guess what you've discovered is that all the existing algorithms suck. I think this is probably why Google Image search doesn't have a "Find images similar to this one..." function.
From my brief read of AESA, I really don't see how it's supposed to be O(1) at all. It looks to me like its performance is heavily dependent on the value of r used - smaller r should eliminate many more elements of U on each iteration. What r are you using?
I'd be interested to hear what your algorithm is.
(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:no subject
no subject
Date: 2004-04-04 12:46 am (UTC)(no subject)
From:no subject
Date: 2004-04-05 07:31 am (UTC)(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:Assessing your algorithm...
Date: 2004-04-07 06:42 pm (UTC)- Your cube case is intrinsically 3D (if your cube was 3D) and the distribution is totally even (no interesting structure), so this is nearly a pathological case.
- Your random lines from "A Tale of Two Cities" test is (I would imagine) pretty close to n-dimensional (since the edit distances would be pretty large between random strings), but once again with a fairly even distribution.
Your test cases only cover two corners of the dimensionality/distribution property space. What it seems you need are just some more interesting test cases, starting with cases that test distributions that have structure:- Simple shapes like sines, sombrero, or some other simple function embedded in whatever dimensional space you want to test (pick random a coordinates for the function input and set the final dimensions value with the function).
- Points in a cube with a perlin noise distribution (pick a point in the space and discard ones where the noise value is outside of a certain threshold range- creating a nice marble like structure).
- Load in the points of a model file (say of a video game character).
In fact, now that I think about it, perlin noise (or fBm) functions are ideal for testing all sorts of algorithms since you can easily control how much or little structure there is to the space. You can control with simple input parameter changes the dimensionality, structure, frequency, and fractal dimensionality for each test set. It should be rather easy to write a program which generates an extremely thorough set of test sets to use. These test sets should give a very clear picture of comparative performance under different environments for each algorithm.Heck, I'll even write the program to generate the test sets for you if you want me to...
Re: Assessing your algorithm...
From: