ciphergoth: (tree)
[personal profile] ciphergoth
It's easy to see how to organise a database if you want to be able to, say, find all the people in it whose names are between "Charybdis" and "Scylla" in alphabetical order. But how would you organise it if you want to find all the people whose pictures look most like a particular sample picture, when all you have is a tool that measures how similar two pictures are? There's a family of very general approaches to the problem called "metric space searches"; in this approach, we don't need to know anything about what we're storing except that we can measure the distance between two items and get a sensible answer.

I thought I'd invented a smashing new algorithm for searching in metric spaces. But I'm having a hard time finding out if it's any good.

Searching in Metric Spaces (1999) provides an overview of the state of the art as of five years ago.

The usual assumption when designing these algorithms is that the distance measurement function is very expensive, and so the first order of business is to minimise the number of distance measurements. I implemented my algorithm and got very good results on random points in a cube. But when I tried a harder data set - random lines from "A Tale of Two Cities" with string edit distance as the metric, a standard test used in Brin's tests on his GNAT algorithm - I found it would compare the target string to over half the strings in the database before returning an answer.

So I tried implementing one of the standard algorithms from the paper - AESA, a simple one that's supposed to be very good at minimising distance measurements at search time, at the cost of very high startup time and storage - and that doesn't do much better!

Results using Hamming distance between random binary strings are even worse - AESA averages 882.4 measurements in a dataset of 1000 items, while my algorithm averages 999.3...

How am I supposed to assess my algorithm if I can't find an interesting test case on which any algorithm performs even vaguely OK? Random points in a cube is not a good test, because there are already much better algorithms for the special case of vector spaces...

Date: 2004-04-03 01:23 am (UTC)
From: [identity profile] barking-watcher.livejournal.com
I don't know if this is going to be of any interest to you, but one of my lecturers at Brighton did a lot of research regarding picture storage and retrieval.

Some of the research focused on pattern recognition. The only link I was able to find is below.

http://www.lic.gov.uk/research/information_retrieval/ir-im-rr.html

Date: 2004-04-03 04:38 am (UTC)
babysimon: (Default)
From: [personal profile] babysimon
I read some of the paper and now my head is hurting.

I guess what you've discovered is that all the existing algorithms suck. I think this is probably why Google Image search doesn't have a "Find images similar to this one..." function.

From my brief read of AESA, I really don't see how it's supposed to be O(1) at all. It looks to me like its performance is heavily dependent on the value of r used - smaller r should eliminate many more elements of U on each iteration. What r are you using?

I'd be interested to hear what your algorithm is.

Date: 2004-04-03 04:59 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
I'm doing a "Nearest neighbour" search, so r starts at infinity and gets smaller as I go along. I suppose I should re-do the earlier tests to reject more points every time r gets smaller - only just thought of that.

By O(1) they mean about O(1) distance measurements - it's at least O(n) overall work, and that's not counting the O(n^2) initialisation and storage requirements.

My algorithm is a variant of GHT. The big ideas are
(*) at each node in the tree, you use one old point (one you already know the distance to) and one new point to decide the dividing line between the two subtrees. But you can use any old node, not just one from the parent; you choose the one that gives you the biggest range in distances from the points in the subtree, then use the point furthest from that as the new node

(*) at each node in the tree, you record the range of distances for points contained in the node from *all* points already considered, to maximise the chances you'll be able to prune that branch

(*) instead of hyperplanes as the dividing line, you use hyperhyperbolae, in order to exactly divide the points remaining into two. IE your test is not d[1] - d[0] < 0 but d[1] - d[0] < node.bias. This turned out to make a substantial positive difference

(*) I'm considering beefing up the pruning some more by adding hyperhyperbola-based pruning: ie the range of values of d_a - d_b for every point in the node and for all d_a and d_b considered so far.

Er, that was all clear as mud wasn't it. OK, here's another way to put it. The function that constructs the node takes two arguments: a set of points to put in the node, and a set of "comparator points". When the search method on the node is called, it will be passed the target point, the range r, and the distance of the target point from each of the comparator points. It'll use all these distances to try and "early reject". Then failing that, it'll add a new comparator point to the set, measure the distance from the target to the new point, and use this new point along with an existing point to define a hyperbolic dividing plane. It decides which side of the plane the target point is on, searches the subtree corresponding to that side first, and then tries the other subtree unless it can demonstrate that nothing on the other side can be close enough.

Er, slightly clearer mud?

Date: 2004-04-03 05:13 am (UTC)
babysimon: (Default)
From: [personal profile] babysimon
By O(1) they mean about O(1) distance measurements

I still don't get it. AESA repeatedly measures distance between the pivot and the query node, rejecting a proportion of the data set each time. Not a constant proportion, so we can't be exact, but that sounds more like O(log n) to me. What am I missing?

Must read up on GHT before peering into your mud. Actually fuck that; must have bath, sort costume, talk to lovers etc...

Date: 2004-04-03 05:14 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
I would have thought it would be log(n) as well, but they say O(1) was determined experimentally - *shrug*...

Date: 2004-04-03 08:08 am (UTC)
From: [identity profile] eqe.livejournal.com
It's nonsense to determine O experimentally -- it's too easy for a constant or linear term to dominate a log term over all testable values. For example, you can sort integers in O(n) if you assume that there's an upper bound (for example, 64-bit integers). I had a bit of a red face while trying to convince a Cray programmer that sorting was inherently O(n log n) -- he just kept pointing at his sort function which obviously had O(n) performance. And of course, he's right -- he can sort any useful set of integers in O(n) time. Doesn't change the fact that sorting is, mathematically, O(n log n).

"In theory, there's no difference between theory and practice."

Date: 2004-04-05 12:23 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
For the reasons [livejournal.com profile] pavlos outlines, the purely theoretical worst-case analysis always yields meaningless results, so you have to rely to some extent on experiment.

I don't suppose it can be O(1), but if it's, say, O(log log n) then it would be very hard to determine that by experiment, so they're doing the right thing by reporting what they observe as best they can. And it's a survey paper so there's a limit on how much detail they can go into.

Date: 2004-04-03 09:47 am (UTC)
From: [identity profile] webcowgirl.livejournal.com
I love it when you geek out.

Date: 2004-04-04 12:46 am (UTC)
From: [identity profile] pavlos.livejournal.com
Hmmm... I don't understand how/if you can evaluate such an algorithm generically, i.e. without reference to particular metrics. I take it that the worst-case runtime is something pretty boring, and you are looking for good average runtime for metrics of interest. This presumably depends on some statistical properties of your metric, such as dimensionality, continuity, distribution etc. Again i'm guessing that these properties are rarely explicit and often poorly understood, so failing that your best bet is probably the experimental approach that you are taking.

Date: 2004-04-05 12:20 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
Worst-case is the metric where x != y => d(x, y) == 1, the "are we nearly there yet" metric. You always have to compare to half or all the elements with that metric. So you are exactly right. And not just properties of your metric - statistical properties of your likely dataset too. Eg if it's in n-dimensional space but tends to lie very close to a plane, then it will perform as if it lay in a 2-dimensional space.

Date: 2004-04-05 07:31 am (UTC)
From: [identity profile] drdoug.livejournal.com
Looking at this quickly, I'd naively say that for practical purposes with your particular problem, you might well be better off walking the tree completely at setup time, so you have a graph with all the distances filled in already. Then at run time you just walk the graph using the same approach as Dijkstra's algorithm to get the nearest matches (stopping either you've hit as many results as you want or have exceed some arbitrary closeness threshold).

Date: 2004-04-05 08:05 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
What graph? There's just a bag of points (where a "point" might be a picture, or a string, or whatever); there are no explicit arcs connecting the points.

Here's an example application: you have pictures of a million faces, and a function that computes distance between pictures. You do some preprocessing. Later, you capture a picture of a face, and you want to find the picture in the database that most closely matches your capture, without explicitly comparing with all million.

You could store the distances between all points as AESA does, but the O(n^2) cost makes this pretty much prohibitive if you're matching against a few million points.

Date: 2004-04-05 09:30 am (UTC)
From: [identity profile] drdoug.livejournal.com
I was saying you generate the graph from the points by doing the sums, and just take the O(n^2) hit at setup. Adding a single new point to the graph is only O(n). (This does sound a lot like AESA but I'm responding off the cuff here so haven't checked I'm afraid.)

For your million photo example I'm guessing this would still be your best bet if you have enough hardware to throw at it and your distance metric isn't woefully expensive to calculate.

Another option for that example would be to come up with a quick 'candidate match' algorithm - maybe based on a small set of numbers you generate cheaply for each photo individually. When you want to match a new photo, you find a largish number of candidate matches using these numbers, and use your expensive picture-picture matching algorithm on the results. (I'd also want to fill in the lines on the graph at that point by storing the distances, but that's still O(n^2) and you'll struggle with six billion photos on any near future kit.) Of course this depends on coming up with some cunning 'candidate match' algorithm that works.

If your requirement is to find the closest matching photo(s), I don't see how you can avoid the O(n^2) hit at some point (and throughout in storage terms), unless there are some strong properties of your match algorithm that don't leap out instantly?

My guess is that the known search algorithms are crap because the problem is bloody hard. It's bad enough doing NP optimisation stuff where evaluating candidate solutions is cheap - if evaluating solutions is hard you're stuffed unless you know something about how your data is structured/tends to cluster.

All of this is dodging the problem as you stated it - solving the metric space search problem elegantly - but that's because my instinct is that if a problem is well known but not well solved I'm unlikely to find a solution myself in polynomial time :-) (This is also probably why you have some distinction as an inventor of algorithms and I have none!)

Date: 2004-04-05 09:40 am (UTC)
From: [identity profile] ciphergoth.livejournal.com
The key property is that it's a metric space, so AB + BC <= AC for all points A, B, C in the set. You can use this to trim your search. How well depends on the properties of your data set...

I think an algorithm something like what you describe is in the survey paper, but I can't remember the name...

Date: 2004-04-05 03:06 pm (UTC)
From: [identity profile] pavlos.livejournal.com
how would you go about convincing yourself that that inequality holds for a particular real-world application, like the face matching for example?

Assessing your algorithm...

Date: 2004-04-07 06:42 pm (UTC)
From: [identity profile] meico.livejournal.com
I don't know of any standard battery of tests to run on metric space NN algorithms (I would suspect that none exist), but I can say that it sounds to me like the test cases you were using were rather incomplete. They were on opposite ends of the spectrum in terms of dimensionality but at the same end of the spectrum in terms of distribution properties (smoothness and structure).
  • Your cube case is intrinsically 3D (if your cube was 3D) and the distribution is totally even (no interesting structure), so this is nearly a pathological case.
  • Your random lines from "A Tale of Two Cities" test is (I would imagine) pretty close to n-dimensional (since the edit distances would be pretty large between random strings), but once again with a fairly even distribution.
Your test cases only cover two corners of the dimensionality/distribution property space. What it seems you need are just some more interesting test cases, starting with cases that test distributions that have structure:
  • Simple shapes like sines, sombrero, or some other simple function embedded in whatever dimensional space you want to test (pick random a coordinates for the function input and set the final dimensions value with the function).
  • Points in a cube with a perlin noise distribution (pick a point in the space and discard ones where the noise value is outside of a certain threshold range- creating a nice marble like structure).
  • Load in the points of a model file (say of a video game character).
In fact, now that I think about it, perlin noise (or fBm) functions are ideal for testing all sorts of algorithms since you can easily control how much or little structure there is to the space. You can control with simple input parameter changes the dimensionality, structure, frequency, and fractal dimensionality for each test set. It should be rather easy to write a program which generates an extremely thorough set of test sets to use. These test sets should give a very clear picture of comparative performance under different environments for each algorithm.

Heck, I'll even write the program to generate the test sets for you if you want me to...

Re: Assessing your algorithm...

Date: 2004-05-12 05:21 pm (UTC)
From: [identity profile] meico.livejournal.com
Well, I just put up the code for a data set generator on my home page...

Check it out and let me know if it is of any use to you. :)

Profile

ciphergoth: (Default)
Paul Crowley

January 2025

S M T W T F S
   1234
5678 91011
12131415161718
19202122232425
262728293031 

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 23rd, 2026 07:52 pm
Powered by Dreamwidth Studios