Previous month: October 2004
Next month: December 2004
Interesting blog
A friend of mine, Phil Schwarzmann has a blog called Finland for Thought. Phil is an American IT professional living in Finland. His blog is mostly about politics, current events and culture. I certainly don't agree with many of his views, but it's interesting nevertheless.
30.11.2004 klo 09:14
| Be the first to comment
| TrackBack (0)
N-gram string matching
As my game review collection has grown quite large (92 reviews now), I thought a search for game titles would a nice addition to the different means I have for visitors to find the reviews they want. As I'm a student of information studies, which includes a healthy dose of computer information retrieval, I knew I would be able to tackle the task.
I didn't want to do full-text indexing of the reviews, but make the game titles searchable. Most searches of this kind use simple keyword searching with truncation. That kind of search produces a list of titles which have the keyword in them and nothing else. The list can't be ordered by hit quality and even a slightest mistake makes the search fail. That, I believe, is not satisfactory. A course I took in cross-language information retrieval had equipped me with just the tools I needed to create a better system.
Cross-language information retrieval uses so called N-grams. N can be whatever, but digrams and trigrams are most used. For example, digrams of word "foobar" are *f, fo, oo, ob, ba, ar, r* where * denotes a padding space. Both the indexed words and the search key are broken into N-grams and then compared using a simple formula of intersection of N-grams in the phrases divided by sum of unique N-grams in the phrases. Let's have an example.
How well does foobar match with fubar? Digrams of the words are *f, fo, oo, ob, ba, ar, r* and *f, fu, ub, ba, ar, r*. The union (sum) of unique digrams in both phrases is *f, fo, oo, ob, fu, ub, ba, ar, r*. The intersection of the phrases' digrams is *f, ba, ar, r*. There are thus four digrams that appear in both phrases and nine different digrams that appear in at least one the phrases.
A simple division gives a score of 4/9 or 0.444... for the match between the phrases. That's pretty good, but still far from complete match score of 1.
With this method, it's easy to compare the search key to different indexed keys and find out which are the closest ones. It allows for small mistakes and variations in spelling (which is one reason why it's used in CLIR: N-gram matching can overcome different spellings of the same words in different languages). It also eliminates one problem with straight pattern matching: small keys are easy to search. Try searching BoardGameGeek for Go or Ra and you'll know what I mean. In this system, there's no problem.
Of course, working with the N-grams takes more computing power, which can be a problem. Also, two words can have the same N-grams but in completely different order. Still, I think N-gram matching is a good tool for simple title or name searches such as the one I'm doing.
I'm sharing the PHP code that does the tricks of forming the N-grams and comparing two sets of N-grams. The compare function is very elegant, thanks to good tools already implemented in PHP. You can find it here: ngram.php. If you're planning to do something with it, I'd appreciate if you'd drop me a note.
In addition to that, I have indexing code that goes through all the files in the directory that has the reviews, searching for HTML meta field that contains the title of the game - or titles, as each game can have several titles in different languages. File names, game titles and the N-grams are then stored in a MySQL database to be accessed by the search script.
Creating the search engine was very straighforward. I was surprised how quickly I got it all together. Having thought a lot about N-grams before helped a bit, obviously.
27.11.2004 klo 12:30
|
1 comment(s)
| TrackBack (0)
Qveere Eye For Thye Mediaeval Man
Qveere Eye For Thye Medieval Man is a really neat parody on everybody's favourite reality show Queer Eye for the Straight Guysta. It's a great show, a nice combination of educational and entertaining!
Thy hovel moste certainly is a pigstye!
(via Linko)
25.11.2004 klo 12:41
| Be the first to comment
| TrackBack (0)
Spam filtering survey
John Graham-Cunning, the mastermind behind POPFile (my favourite anti-spam tool) is doing a Spam Filtering Survey. If you have some spare time (John says about 20 minutes, but it didn't take me that long), go and do the survey.
The survey is a bit untypical. It has four pages of questions on spam filtering software and e-mail use and then 16 pages of e-mail to classify. Spam or ham? That takes a while and can be a bit boring, but advancement of spam filtering software is a cause noble enough to see the effort.
24.11.2004 klo 12:56
| Be the first to comment
| TrackBack (0)
21 Grams, Godsend, Day After Tomorrow, Dahmer
Here's some quick reviews on movies we've watched recently:
21 Grams - High recommendations, good reviews, Movielens says I'll enjoy it - but no, it fell very flat. The movie was interesting because it's so fragmentary - it took a while to understand how the three different storylines join together, but once that "click" happened, the movie suddenly became rather boring. It's the story - only reason it was interesting was the fragmentary way it was told.
Godsend - Eight-year old boy dies accidentally. Parents clone the boy, everything's fine for eight years and then, something bad starts to happen. Why? I thought the storyline was kind of interesting, but no, the movie was very boring. It's pitched as sci-fi thriller and it falls badly short on both accounts.
Day After Tomorrow - I like the concept: save the nature, or else... Very un-American, that's always refreshing. There are also very neat special effects, but that's where the good stuff ends. Rest of it is very typical and all the plot elements are familiar from other high-budget mainstream movies. Boring!
Dahmer - The story of the serial killer and sexual ritualistic cannibal Jeffrey Dahmer! This must be terrifying! But no! Instead of crazy murdering and ritualistic killings, we get psychological stuff about Dahmer's repressed homosexuality and bad child-parent relationship. Which is nice, unless you were expecting a horror movie. Once again: boring!
Looks like we've been watching boring movies recently. Hopefully we'll see something better soon!
22.11.2004 klo 13:37
| Be the first to comment
| TrackBack (0)
Secret Santa and the first snow
Once again, it's time for Secret Santa. Everyone who participates must have a personal website and an Amazon wishlist. The organizers randomly select a wishlist for everyone participating. You check out your victim's website and then buy something from his or her wishlist. Someone does the same for you. It's all very simple and great fun.
Yesterday we got some snow, but it was already over when I left work. Today it was snowing properly when I walked to work. How nice, and the timing with Secret Santa announcement was simply lovely.
20.11.2004 klo 10:45
| Be the first to comment
| TrackBack (0)
Theater and music
We've enjoyed lots of quality culture recently. First we saw Macbeth, a classic Shakespeare tragedy on power and greed. The acting was strong, the set gorgeous and everything simply well done. We adored it. I wish the Tampere theaters did more Shakespeare - I wouldn't mind seeing more of his work, especially the tragedies.
To-morrow, and to-morrow, and to-morrow, creeps in this petty pace from day to day, to the last syllable of recorded time; and all our yesterdays have lighted fools the way to dusty death. Out, out, brief candle! Life's but a walking shadow, a poor player that struts and frets his hour upon the stage, and then is heard no more; it is a tale told by an idiot, full of sound and fury, signifying nothing. (Macbeth, V.v)
Next play we saw was Wait Until Dark, best known for the 1967 movie starring Audrey Hepburn. The play was thrilling, especially towards the end when the theater went completely dark and the only light was a cigarette lighter waved by an actor. Very exciting! The story, about a doll full of drugs, a gang of criminals and a blind woman, was well weaved and intriguing. Another great piece of theater.
Marianne Faithfull is almost sixty, but still kicking strong. Her performance on Monday was strong and captivating. Thanks to the first-row seats, I suppose, but it was a really good show anyway. Johanna, who has already seen her twice before, really loved this one, even saying this was the best gig she's ever seen. I agree it was a great performance, but I think the Nick Cave show we saw in September was even better.
Anyway, Marianne did lots of good songs Johanna hadn't heard before and of course the best tracks from her new album, Before the Poison (which is, by the way, highly recommended). The band featured Barry Reynolds and Fernando Saunders. Saunders did a small opening number, playing few songs. Nice, but unnecessary delay.
17.11.2004 klo 13:47
| Be the first to comment
| TrackBack (0)
Changes under the hood
*Hop*. There we go. I made a move from Nucleus to Movable Type. Despite the new, nice version of Nucleus, Movable Type is simply superior blog platform. Especially as I bought the new 3.121 version, which is just smooth.
There are some technical issues which are a bit unfinished, but most of this blog should work. Syndicated readers should note that while the old RSS url of http://www.melankolia.net/xml-rss2en.php is supported for legacy reasons, I recommend moving to one of the new urls (index.xml, index.rdf or atom.xml).
The old blog will remain, also for legacy reasons. Old links should stay alive, kind of. If you have linked to http://www.melankolia.net/index.en.php, please update your links. Linking to http://www.melankolia.net/ is a better idea anyway.
17.11.2004 klo 12:00
| Be the first to comment
| TrackBack (0)
Thinking Machine
Thinking Machine 4 is a Chess-playing AI-slash-piece-of-art. It's not that good as a Chess AI - average player can beat it, but that's not what's interesting about it. While it thinks, it visualizes it's thought process. The result looks pretty (after all, it's a work of art) and is quite interesting, too.
(via Inconsequential Ruminations)
10.11.2004 klo 09:02
|
1 comment(s)
| TrackBack (0)
Book news
Some of my favourite authors have been busy lately. I recently bought Redemption Ark by Alastair Reynolds (when I noticed that Play sells books with very pleasant prices). Not that I've had time to read it! Then there's Absolution Gap and Diamond Dogs, Turquoise Days I don't even have and to make my lag even worse, he's went and published a new book! Century Rain has a really pretty name, that's one thing for sure.
Then - Johanna bought me Altered Carbon by Richard Morgan. I read it during our honeymoon and enjoyed it a lot and then got his next book, Broken Angels as a birthday gift from Johanna. Now I found out he has a third book out called Market Forces. And that's not all! Woken Furies is coming out in March - fortunately.
Oh, and Iain M. Banks - The Algebraist is his latest book. I've read all his books I own (except The Crow Road, but that's not a M. Banks book) so there's not a large pile of those waiting. I've actually read most of the books I don't own.
Then again, there's A Game of Thrones, which everybody likes, except me it seems. I've read most of it, but haven't touched the book in months. I doubt I'll ever finish it. Enjoying books is fine, but has it's downsides, especially if one prefers books that aren't generally available in libraries...
3.11.2004 klo 15:10
| Be the first to comment
| TrackBack (0)