Stardust | Starbeamrainbowlabs

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D

I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search?

Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this:

apple
orange
pear
grape
pineapple

If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another.

The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1.

This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical.

To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series.

The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub.

Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes:

Word	Hash
`pinneapple`	`020c2312000800920004880000200002618200017c1021108200421018000404`
`pineapple`	`0204239242000042000428018000213364820000d02421100200400018080200256`

If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it.

This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm:

(Can't view the above? Try this direct link.)

I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash.

With this, we can get the variable-length hashes we wanted:

256       0a200240020004a180810950040a00d033828480cd16043246180e54444060a5
128       3ba286c0cf1604b3c6990f54444a60f5
64        02880ed0c40204b1
32        060a04f0
16        06d2
8         06

The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical.

Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this:

curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv

There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet.

With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test:

<?php
$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0); // Jump back to the beginning
    fgets($handle); // Skip the first line since it's the header

    while(($line = fgets($handle)) !== false) {
        // Do something with the next line here
    }
}

PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them:

<?php

// .....

$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {
    $skipped++;
    continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);
$hash_b = Nilsimsa::hash($parts[1], $size);

$count++;
if($hash_a == $hash_b) {
    $count_same++;
    $same[] = $parts;
}
else {
    $not_same[] = $parts;
}
echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");
$line_count = lines_count($handle);
echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header

    $count = 0; $count_same = 0; $skipped = 0;
    $same = []; $not_same = [];
    while(($line = fgets($handle)) !== false) {
        $parts = explode(",", trim($line), 2);
        if(strlen($parts[1]) < 3) {
            $skipped++;
            continue;
        }
        $hash_a = Nilsimsa::hash($parts[0], $size);
        $hash_b = Nilsimsa::hash($parts[1], $size);

        $count++;
        if($hash_a == $hash_b) {
            $count_same++;
            $same[] = $parts;
        }
        else $not_same[] = $parts;
        echo("$count_same / $count ($skipped skipped)\r");
    }

    file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $same)));
    file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $not_same)));

    echo(str_pad($size, 10)."→ $count_same / $count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n");
}

I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line):

<?php
function lines_count($handle) : int {
    fseek($handle, 0);
    $count = 0;
    while(fgets($handle) !== false) $count++;
    return $count;
}

Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction:

7375 lines total
256       → 7 / 7322 (0.1%), 52 skipped
128       → 9 / 7322 (0.12%), 52 skipped
64        → 13 / 7322 (0.18%), 52 skipped
32        → 64 / 7322 (0.87%), 52 skipped
16        → 347 / 7322 (4.74%), 52 skipped
8         → 1689 / 7322 (23.07%), 52 skipped

Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all.

A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started.

In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment.

Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes.

Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below!

Search Engine Optimisation: The curious question of efficiency

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)
The inverted index is queried for pages in that match the tokenised query terms.
The results returned from the query are ranked and sorted before being returned.
A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result
Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

Step #1 took ~128ms on average
Steps #2 & #3 took ~1200ms on average
Step #4 took ~1500ms on average(!)
Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

$start_time = microtime(true);

// Do work here

$end_time = microtime(true);

$time_taken_ms = round(($end_time - $start_time )*1000, 3);

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

$literator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);

$literator->transliterate($string);

(Originally from this StackOverflow answer)

Note that this requires the intl PHP extension (which should be installed & enabled by default). This does several things:

Converts the text to lowercase
Normalises it to UTF-8 (See this article for more information)
Casts Cyrillic characters to their Latin alphabet (i.e. a-z) phonetic equivalent
Removes all diacritics

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place
Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

1: Booster
2: Rocket
3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.
Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

Demystifying Inverted Indexes

(The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T	o	k	e	n	s	V	V
factory	balls	is	the	brand	new	bonte	game
i	already	announced	yesterday	factory	balls	takes
part	in	the	game	design	competition	over
at	jayisgames	the	goal	of	the	design
competition	was	to	create	a	ball	physics
themed	game	i	hope	you	enjoy	it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token	Frequency	Offsets
factory	2	0, 12
balls	2	1, 13
brand	1	4
new	1	5
bonte	1	6
game	3	7, 18, 37
i	2	8, 38
announced	1	10
yesterday	1	11
takes	1	14
design	2	19, 28
competition	2	20, 29
jayisgames	1	23
goal	1	25
create	1	32
ball	1	34
physics	1	35
themed	1	36
hope	1	39
enjoy	1	41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id	Page Name
1	Kerbal Space Program
2	Cross Code
3	Fort Meow
4	Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

Term: game

Id	Frequency	Offsets
2	1	31
3	1	5
4	3	5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Stardust Blog

Tag Cloud

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Search Engine Optimisation: The curious question of efficiency

Demystifying Inverted Indexes

Stardust
Blog