Previous month: November 14, 2004 - November 20, 2004
Next month: November 28, 2004 - December 4, 2004
N-gram string matching
As my game review collection has grown quite large (92 reviews now), I thought a search for game titles would a nice addition to the different means I have for visitors to find the reviews they want. As I'm a student of information studies, which includes a healthy dose of computer information retrieval, I knew I would be able to tackle the task.
I didn't want to do full-text indexing of the reviews, but make the game titles searchable. Most searches of this kind use simple keyword searching with truncation. That kind of search produces a list of titles which have the keyword in them and nothing else. The list can't be ordered by hit quality and even a slightest mistake makes the search fail. That, I believe, is not satisfactory. A course I took in cross-language information retrieval had equipped me with just the tools I needed to create a better system.
Cross-language information retrieval uses so called N-grams. N can be whatever, but digrams and trigrams are most used. For example, digrams of word "foobar" are *f, fo, oo, ob, ba, ar, r* where * denotes a padding space. Both the indexed words and the search key are broken into N-grams and then compared using a simple formula of intersection of N-grams in the phrases divided by sum of unique N-grams in the phrases. Let's have an example.
How well does foobar match with fubar? Digrams of the words are *f, fo, oo, ob, ba, ar, r* and *f, fu, ub, ba, ar, r*. The union (sum) of unique digrams in both phrases is *f, fo, oo, ob, fu, ub, ba, ar, r*. The intersection of the phrases' digrams is *f, ba, ar, r*. There are thus four digrams that appear in both phrases and nine different digrams that appear in at least one the phrases.
A simple division gives a score of 4/9 or 0.444... for the match between the phrases. That's pretty good, but still far from complete match score of 1.
With this method, it's easy to compare the search key to different indexed keys and find out which are the closest ones. It allows for small mistakes and variations in spelling (which is one reason why it's used in CLIR: N-gram matching can overcome different spellings of the same words in different languages). It also eliminates one problem with straight pattern matching: small keys are easy to search. Try searching BoardGameGeek for Go or Ra and you'll know what I mean. In this system, there's no problem.
Of course, working with the N-grams takes more computing power, which can be a problem. Also, two words can have the same N-grams but in completely different order. Still, I think N-gram matching is a good tool for simple title or name searches such as the one I'm doing.
I'm sharing the PHP code that does the tricks of forming the N-grams and comparing two sets of N-grams. The compare function is very elegant, thanks to good tools already implemented in PHP. You can find it here: ngram.php. If you're planning to do something with it, I'd appreciate if you'd drop me a note.
In addition to that, I have indexing code that goes through all the files in the directory that has the reviews, searching for HTML meta field that contains the title of the game - or titles, as each game can have several titles in different languages. File names, game titles and the N-grams are then stored in a MySQL database to be accessed by the search script.
Creating the search engine was very straighforward. I was surprised how quickly I got it all together. Having thought a lot about N-grams before helped a bit, obviously.
27.11.2004 klo 12:30
|
1 comment(s)
| TrackBack (0)
Qveere Eye For Thye Mediaeval Man
Qveere Eye For Thye Medieval Man is a really neat parody on everybody's favourite reality show Queer Eye for the Straight Guysta. It's a great show, a nice combination of educational and entertaining!
Thy hovel moste certainly is a pigstye!
(via Linko)
25.11.2004 klo 12:41
| Be the first to comment
| TrackBack (0)
Spam filtering survey
John Graham-Cunning, the mastermind behind POPFile (my favourite anti-spam tool) is doing a Spam Filtering Survey. If you have some spare time (John says about 20 minutes, but it didn't take me that long), go and do the survey.
The survey is a bit untypical. It has four pages of questions on spam filtering software and e-mail use and then 16 pages of e-mail to classify. Spam or ham? That takes a while and can be a bit boring, but advancement of spam filtering software is a cause noble enough to see the effort.
24.11.2004 klo 12:56
| Be the first to comment
| TrackBack (0)
21 Grams, Godsend, Day After Tomorrow, Dahmer
Here's some quick reviews on movies we've watched recently:
21 Grams - High recommendations, good reviews, Movielens says I'll enjoy it - but no, it fell very flat. The movie was interesting because it's so fragmentary - it took a while to understand how the three different storylines join together, but once that "click" happened, the movie suddenly became rather boring. It's the story - only reason it was interesting was the fragmentary way it was told.
Godsend - Eight-year old boy dies accidentally. Parents clone the boy, everything's fine for eight years and then, something bad starts to happen. Why? I thought the storyline was kind of interesting, but no, the movie was very boring. It's pitched as sci-fi thriller and it falls badly short on both accounts.
Day After Tomorrow - I like the concept: save the nature, or else... Very un-American, that's always refreshing. There are also very neat special effects, but that's where the good stuff ends. Rest of it is very typical and all the plot elements are familiar from other high-budget mainstream movies. Boring!
Dahmer - The story of the serial killer and sexual ritualistic cannibal Jeffrey Dahmer! This must be terrifying! But no! Instead of crazy murdering and ritualistic killings, we get psychological stuff about Dahmer's repressed homosexuality and bad child-parent relationship. Which is nice, unless you were expecting a horror movie. Once again: boring!
Looks like we've been watching boring movies recently. Hopefully we'll see something better soon!
22.11.2004 klo 13:37
| Be the first to comment
| TrackBack (0)