ciphergoth: (tree)
[personal profile] ciphergoth
It's easy to see how to organise a database if you want to be able to, say, find all the people in it whose names are between "Charybdis" and "Scylla" in alphabetical order. But how would you organise it if you want to find all the people whose pictures look most like a particular sample picture, when all you have is a tool that measures how similar two pictures are? There's a family of very general approaches to the problem called "metric space searches"; in this approach, we don't need to know anything about what we're storing except that we can measure the distance between two items and get a sensible answer.

I thought I'd invented a smashing new algorithm for searching in metric spaces. But I'm having a hard time finding out if it's any good.

Searching in Metric Spaces (1999) provides an overview of the state of the art as of five years ago.

The usual assumption when designing these algorithms is that the distance measurement function is very expensive, and so the first order of business is to minimise the number of distance measurements. I implemented my algorithm and got very good results on random points in a cube. But when I tried a harder data set - random lines from "A Tale of Two Cities" with string edit distance as the metric, a standard test used in Brin's tests on his GNAT algorithm - I found it would compare the target string to over half the strings in the database before returning an answer.

So I tried implementing one of the standard algorithms from the paper - AESA, a simple one that's supposed to be very good at minimising distance measurements at search time, at the cost of very high startup time and storage - and that doesn't do much better!

Results using Hamming distance between random binary strings are even worse - AESA averages 882.4 measurements in a dataset of 1000 items, while my algorithm averages 999.3...

How am I supposed to assess my algorithm if I can't find an interesting test case on which any algorithm performs even vaguely OK? Random points in a cube is not a good test, because there are already much better algorithms for the special case of vector spaces...

Date: 2004-04-05 08:05 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
What graph? There's just a bag of points (where a "point" might be a picture, or a string, or whatever); there are no explicit arcs connecting the points.

Here's an example application: you have pictures of a million faces, and a function that computes distance between pictures. You do some preprocessing. Later, you capture a picture of a face, and you want to find the picture in the database that most closely matches your capture, without explicitly comparing with all million.

You could store the distances between all points as AESA does, but the O(n^2) cost makes this pretty much prohibitive if you're matching against a few million points.

Date: 2004-04-05 09:30 am (UTC)
From: [identity profile] drdoug.livejournal.com
I was saying you generate the graph from the points by doing the sums, and just take the O(n^2) hit at setup. Adding a single new point to the graph is only O(n). (This does sound a lot like AESA but I'm responding off the cuff here so haven't checked I'm afraid.)

For your million photo example I'm guessing this would still be your best bet if you have enough hardware to throw at it and your distance metric isn't woefully expensive to calculate.

Another option for that example would be to come up with a quick 'candidate match' algorithm - maybe based on a small set of numbers you generate cheaply for each photo individually. When you want to match a new photo, you find a largish number of candidate matches using these numbers, and use your expensive picture-picture matching algorithm on the results. (I'd also want to fill in the lines on the graph at that point by storing the distances, but that's still O(n^2) and you'll struggle with six billion photos on any near future kit.) Of course this depends on coming up with some cunning 'candidate match' algorithm that works.

If your requirement is to find the closest matching photo(s), I don't see how you can avoid the O(n^2) hit at some point (and throughout in storage terms), unless there are some strong properties of your match algorithm that don't leap out instantly?

My guess is that the known search algorithms are crap because the problem is bloody hard. It's bad enough doing NP optimisation stuff where evaluating candidate solutions is cheap - if evaluating solutions is hard you're stuffed unless you know something about how your data is structured/tends to cluster.

All of this is dodging the problem as you stated it - solving the metric space search problem elegantly - but that's because my instinct is that if a problem is well known but not well solved I'm unlikely to find a solution myself in polynomial time :-) (This is also probably why you have some distinction as an inventor of algorithms and I have none!)

Date: 2004-04-05 09:40 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
The key property is that it's a metric space, so AB + BC <= AC for all points A, B, C in the set. You can use this to trim your search. How well depends on the properties of your data set...

I think an algorithm something like what you describe is in the survey paper, but I can't remember the name...

Date: 2004-04-05 03:06 pm (UTC)
From: [identity profile] pavlos.livejournal.com
how would you go about convincing yourself that that inequality holds for a particular real-world application, like the face matching for example?

Profile

ciphergoth: (Default)
Paul Crowley

January 2025

S M T W T F S
   1234
5678 91011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 23rd, 2026 10:38 pm
Powered by Dreamwidth Studios