Next Gen Search, Part 1: Backend Storage
I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.
If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.
Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.
This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.
My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog
to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.
As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage
, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).
Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:
- The first key,
|termindex|
, is used to store a master list of words that have entries in the page index. - The second key,
term
is simply the word itself (e.g.cat
,chicken
,rocket
, etc.), and stores a list of ids of the pages that contain that word. - The third and final key,
term|pageid
, is a word followed by a separator and finally the id of a page (e.g.bill|1
,cat|45
,boosters|69
, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.
The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.
The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.
All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:
CREATE TABLE store (
key TEXT UNIQUE NOT NULL,
value TEXT
);
Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!
(Can't see the above? Try a direct link.)
While this works, I quickly ran into several major problems:
- Every time I read from the data store I'm decoding JSON, which is expensive
- Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
- I'm reading and writing the same thing multiple times, which is very expensive
- Writing anything to the data store takes a long time
The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.
Such a cache could just be an associative array:
private $cache = [];
Then, to fetch a value, we can do this:
// If it's not in the cache, insert it
if(!isset($this->cache[$key])) {
$this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
"SELECT value FROM store WHERE key = :key;",
[ "key" => $key ]
)->fetchColumn()) ];
}
return $this->cache[$key]["value"];
Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this:
if(!isset($this->cache[$key])) $this->cache[$key] = [];
$this->cache[$key]["value"] = $value;
$this->cache[$key]["modified"] = true;
Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:
foreach($this->cache as $key => $value_data) {
// If it wasn't modified, there's no point in saving it, is there?
if(!$value_data["modified"])
continue;
$this->query(
"INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)",
[
"key" => $key,
"value" => json_encode($value_data["value"])
]
);
}
I'll get onto the cogs and wheels behind that query()
function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next.
Much better, but still not great. Why does it still take ages to write all the changes to disk?
Well, it turns out that by default SQLite wraps every INSERT
in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further:
$this->db->beginTransaction();
// Do batch writing here
$this->db->commit();
Excellent (boomed the wizard).
But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all).
This is really easy to do. Let's create another associative array to store them in:
private $query_cache = [];
Then, we can alter the query()
function to look like this:
/**
* Makes a query against the database.
* @param string $sql The (potentially parametised) query to make.
* @param array $variables Optional. The variables to substitute into the SQL query.
* @return \PDOStatement The result of the query, as a PDOStatement.
*/
private function query(string $sql, array $variables = []) {
// Add to the query cache if it doesn't exist
if(!isset($this->query_cache[$sql]))
$this->query_cache[$sql] = $this->db->prepare($sql);
$this->query_cache[$sql]->execute($variables);
return $this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}
If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?
This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!
Next time, we'll take a look at generating the test data required to stress-test the system.
I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).
(Can't see the above? Try a direct link.)