Stardust | Starbeamrainbowlabs

Spam statistics are live!

I've blogged about spam a few times before, and as you might have guessed defending against it and analysing the statistics thereof is a bit of a hobby of mine. Since I first installed the comment key system (and then later upgraded) in 2015, I've been keeping a log of all the attempts to post spam comments on my blog. Currently it amounts to ~27K spam attempts, which is about ~14 comments per day overall(!) - so far too many to sort out manually!

This tracking system is based on mistakes. I have a number of defences in place, and each time that defence is tripped it logs it. For example, here are some of the mistake codes for some of my defences:

Code	Meaning
website	A web address was entered (you'll notice you can't see a website address field in the comment form below - it's hidden to regular users)
shortcomment	The comemnt was too short
invalidkey	The comment key)
http10notsupported	The request was made over HTTP 1.0 instead of HTTP 1.1+
invalidemail	The email address entered was invalid

These are the 5 leading causes of comment posting failures over the past month or so. Until recently, the system would only log the first defence that was tripped - leaving other defences that might have been tripped untouched. This saves on computational resources, but doesn't help the statistics I've been steadily gathering.

With the new system I implemented on the 12th June 2020, a comment is checked against all current defences - even if one of them has been tripped already, leading to some interesting new statistics. I've also implemented a quick little statistics calculation script, which is set to run every day via cron. The output thereof is public too, so you can view it here:

Failed comment statistics

Some particularly interesting things to note are the differences in the mistake histograms. There are 2 sets thereof: 1 pair that tracks all the data, and another that only tracks the data that was recorded after 12th June 2020 (when I implemented the new mistake recording system).

From this, we can see that if we look at only the first mistake made, invalidkey catches more spammers out by a landslide. However, if we look at all the mistakes made, the website check wins out instead - this is because the invalidkey check happens before the website check, so it was skewing the results because the invalidkey defence is the first line of defence.

Also interesting is how comment spam numbers have grown over time in the spam-by-month histogram. Although it's a bit early to tell by that graph, there's a very clear peak around May / June, which I suspect are malicious actors attempting to gain an advantage from people who may not be able to moderate their content as closely due to recent happenings in the world.

I also notice that the overall amount of spam I receive has an upwards trend. I suspect this is due to more people knowing about my website since it's been around for longer.

Finally, I notice that in the average number of mistakes (after 2020-06-12) histogram, most spammers make at least 2 mistakes. Unfortunately there's also a significant percentage of spammers who make only a single mistake, so I can't yet relax the rules such that you need to make 2 or more mistakes to be considered a spammer.

Incidentally, it would appear that the most common pair of mistakes to make are shortcomment and website - perhaps this is an artefact of some specific scraping / spamming software? If I knew more in this area I suspect that it might be possible to identify the spammer given the mistakes they've made and perhaps their user agent.

This is, of course, a very rudimentary analysis of the data at hand. If you're interested, get in touch and I'm happy to consider sharing my dataset with you.

Avoiding accidental array mutation when iterating arrays in PHP

Pepperminty Wiki is written in PHP, and I've posted before about the search engine I've implemented for it that's powered by an inverted index. In this post, I want to talk about an anti-feature of PHP that doesn't behave the way you'd expect, and how to avoid running into the same problem I did.

To do this, let's introduce a simple example of the problem at work:

<?php
$arr = [];
for($i = 0; $i < 3; $i++) {
    $key = random_int(0, 2000);
    $arr[$key] = $i;
    echo("[init] key: $key, i: $i\n");
}

foreach($arr as $key => &$value) {
    // noop
}

echo("structure before: "); var_dump($arr);

foreach($arr as $key => $value) {
    echo("key: $key, i was $value\n");
}

echo("structure after: "); var_dump($arr);
?>

The above code initialises an associative array with 3 elements. The contents might look like this:

Key	Value
469	0
1777	1
1685	2

Pretty simple so far. It then iterates over it twice: once referring to the values by reference (that's what the & there is for), and the second time referring to the items by value.

You'd expect the array to be identical before and after the second foreach loop, but you'd be wrong:

Key	Value
469	0
1777	1
1685	1

Wait, what? That's very odd. What's going on here? How can a foreach loop that's iterating an array by value mutate an array? To understand why, let's take a step back for a moment. Here's another snippet:

<?php

$arr = [ 1, 2, 3 ];

foreach($arr as $key => $value) {
    echo("$key: $value\n");
}

echo("The last value was $key: $value\n");
?>

What do you expect to happen here? While in Javascript with a for..of loop with a let declaration both $key and $value would have fallen out of scope by now, in PHP foreach statements don't create a new scope for variables. Instead, they inherit the scope from their parent - e.g. the global scope in the above or their containing function if defined inside a function.

To this end, we can still access the values of both $key and $value in the above example even after the foreach loop has exited! Unexpected.

It gets better. Try prefixing $value with an ampersand & in the above example and re-running it - note that both $key and $value are both still defined.

This leads us to why the unexpected behaviour occurs. For some reason because of the way that PHP's foreach loop is implemented, if we re-use the same variable name for $value here in a subsequent loop it replaces the value of the last item in the array.

Shockingly enough this is actually documented behaviour (see also this bug report), though I'm somewhat confused as to how it happens on the last element in the array instead of the first.

With this in mind, to avoid this problem in future if you iterate an array by reference with a foreach loop, always remember to unset() the $value, like so:

<?php
$arr = [];
for($i = 0; $i < 3; $i++) {
    $key = random_int(0, 2000);
    $arr[$key] = $i;
    echo("[init] key: $key, i: $i\n");
}

foreach($arr as $key => &$value) {
    // noop
}
unset($key); unset($value);

echo("structure before: "); var_dump($arr);

foreach($arr as $key => $value) {
    echo("key: $key, i was $value\n");
}

echo("structure after: "); var_dump($arr);
?>

By doing this, you can ensure that you don't accidentally mutate your arrays and spend weeks searching for the bug like I did.

It's language features like these that catch developers out: and being aware of the hows and whys of their occurrence can help you to avoid them next time (if anyone can explain why it's the last element in the array that's affected instead of the first, I'd love to know!).

Regardless, although I'm aware of how challenging implementing a programming language is, programming language designers should take care to avoid unexpected behaviour like this that developers don't expect.

Found this interesting? Comment below!

Sources and further reading

EmbedBox: Lightweight syntax-highlighted embeds

I was planning posting about something else yesterday, but I wanted to show some GitLab code in a syntax-highlighted embed. When I wasn't able to figure out how to do that, I ended up writing EmbedBox.

The whole thing is best explained with an example. Have an embed:

(Can't see the above? Check out the original file here)

Pretty cool, right? The above is the default settings file for EmbedBox. Given any URL (e.g. https://raw.githubusercontent.com/sbrl/EmbedBox/master/src/settings.default.toml), it will generate a syntax-highlighted embed for it.

It does so using highlight.php to do the syntax-highlighting server-side, Stash PHP for the cache, and without any Javascript in the embed itself.

It comes with a web interface that generates the embed code given the input URL and a few other settings and shows a preview of what it'll look like.

EmbedBox is open-source too (under the Mozilla Public Licence 2.0), so you're welcome to setup your own instance!

To do so, check out the code here: https://github.com/sbrl/EmbedBox/

The installation instructions should be pretty straightforward in theory, but if you get stuck please open an issue.

Now that I've implemented EmbedBox, you can expect to see it appear in future blog posts. I'm planning to write about my organise-photos script in the near future, so expect a blog post about it soon.

Found this interesting? Got a suggestion? Want to say hi? Comment below!

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D

I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search?

Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this:

apple
orange
pear
grape
pineapple

If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another.

The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1.

This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical.

To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series.

The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub.

Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes:

Word	Hash
`pinneapple`	`020c2312000800920004880000200002618200017c1021108200421018000404`
`pineapple`	`0204239242000042000428018000213364820000d02421100200400018080200256`

If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it.

This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm:

(Can't view the above? Try this direct link.)

I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash.

With this, we can get the variable-length hashes we wanted:

256       0a200240020004a180810950040a00d033828480cd16043246180e54444060a5
128       3ba286c0cf1604b3c6990f54444a60f5
64        02880ed0c40204b1
32        060a04f0
16        06d2
8         06

The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical.

Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this:

curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv

There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet.

With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test:

<?php
$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0); // Jump back to the beginning
    fgets($handle); // Skip the first line since it's the header

    while(($line = fgets($handle)) !== false) {
        // Do something with the next line here
    }
}

PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them:

<?php

// .....

$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {
    $skipped++;
    continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);
$hash_b = Nilsimsa::hash($parts[1], $size);

$count++;
if($hash_a == $hash_b) {
    $count_same++;
    $same[] = $parts;
}
else {
    $not_same[] = $parts;
}
echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");
$line_count = lines_count($handle);
echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header

    $count = 0; $count_same = 0; $skipped = 0;
    $same = []; $not_same = [];
    while(($line = fgets($handle)) !== false) {
        $parts = explode(",", trim($line), 2);
        if(strlen($parts[1]) < 3) {
            $skipped++;
            continue;
        }
        $hash_a = Nilsimsa::hash($parts[0], $size);
        $hash_b = Nilsimsa::hash($parts[1], $size);

        $count++;
        if($hash_a == $hash_b) {
            $count_same++;
            $same[] = $parts;
        }
        else $not_same[] = $parts;
        echo("$count_same / $count ($skipped skipped)\r");
    }

    file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $same)));
    file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $not_same)));

    echo(str_pad($size, 10)."→ $count_same / $count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n");
}

I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line):

<?php
function lines_count($handle) : int {
    fseek($handle, 0);
    $count = 0;
    while(fgets($handle) !== false) $count++;
    return $count;
}

Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction:

7375 lines total
256       → 7 / 7322 (0.1%), 52 skipped
128       → 9 / 7322 (0.12%), 52 skipped
64        → 13 / 7322 (0.18%), 52 skipped
32        → 64 / 7322 (0.87%), 52 skipped
16        → 347 / 7322 (4.74%), 52 skipped
8         → 1689 / 7322 (23.07%), 52 skipped

Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all.

A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started.

In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment.

Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes.

Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below!

Converting multiline text to an image in PHP

This post is late because I lost the post I had written when I tried to save it - I need to find a new markdown editor. Do let me know if you have any suggestions!

I was working on Pepperminty Wiki earlier, and as I was working on external diagram renderer support (coming soon! It's really cool) I needed to upgrade my errorimage() function to support multi-line text. Pepperminty Wiki's key feature is that it's portable and builds (with a custom build system) to a single file of PHP that's plug-and-play, so no nice easy libraries here!

Instead, said function uses GD to convert text to images. This is kind of useful when you want to send back an error, but you also want to send an image because otherwise the user won't see it as they've used an <img src="" /> tag, which doesn't support displaying text, obviously.

To this end, I found myself wanting to add multi-line text support. This was quite an interesting task, because as errorimage() uses GD and Pepperminty Wiki needs to be portable, I can't use imagettftext() and use a nice font. Instead, it uses imagestring(), which draws things in a monospace font. This actually makes the whole proceeding much easier - especially since imagefontwidth() and imagefontheight() give us the width and height of a character respectively.

Here's what the function looked like before:

/**
 * Creates an images containing the specified text.
 * Useful for sending errors back to the client.
 * @package feature-upload
 * @param   string  $text           The text to include in the image.
 * @param   int     $target_size    The target width to aim for when creating
 *                                  the image.
 * @return  resource                The handle to the generated GD image.
 */
function errorimage($text, $target_size = null)
{
    $width = 640;
    $height = 480;

    if(!empty($target_size))
    {
        $width = $target_size;
        $height = $target_size * (2 / 3);
    }

    $image = imagecreatetruecolor($width, $height);
    imagefill($image, 0, 0, imagecolorallocate($image, 238, 232, 242)); // Set the background to #eee8f2
    $fontwidth = imagefontwidth(3);
    imagestring($image, 3,
        ($width / 2) - (($fontwidth * mb_strlen($text)) / 2),
        ($height / 2) - (imagefontheight(3) / 2),
        $text,
        imagecolorallocate($image, 17, 17, 17) // #111111
    );

    return $image;
}

...it's based on a Stack Overflow answer that I can no longer locate. It takes in a string of text, and draws an image with the specified size. This had to change too - since the text might not fit in the image. The awkward thing here is that I needed to maintain the existing support for the $target_size variable, making the code a bit messier than it needed to be.

To start here, let's define a few extra variables to hold some settings:

$width = 0;
$height = 0;
$border_size = 10; // in px, if $target_size isn't null has no effect
$line_spacing = 2; // in px
$font_size = 5; // 1 - 5

Looking better already! Variables like these will help us tune it later (I'm picky). The font size in PHP is a value from 1 to 5, with higher values corresponding to larger font sizes. The $border_size is the number of pixels around the text that we want to add as padding when we're in auto-sizing mode to make it look neater. The $line_spacing is the number of extra pixels of space we should add between lines to make the text look better.

So, about that image size. We'll need the size of a character for that:

$font_width = imagefontwidth($font_size);   // in px
$font_height = imagefontheight($font_size); // in px

....and we'll need to split the input text into a list of lines too:

$text_lines = array_map("trim", explode("\n", $text));

We use an array_map() call here to ensure we chop the whitespace of the end, because strange whitespace characters lying around will result in odd characters appearing in the output image. If the target size is set, then calculating the actual size of the image is easy:

if(!empty($target_size)) {
    $width = $target_size;
    $height = $target_size * (2 / 3);
}

If not, then we'll have to do some fancier footwork to count the maximum number of characters on a line to find the width of the image, and the number of lines we have for the image height:

else {
    $height = count($text_lines) * $font_height + 
        (count($text_lines) - 1) * $line_spacing +
        $border_size * 2;
    foreach($text_lines as $line)
        $width = max($width, $font_width * mb_strlen($line));
    $width += $border_size * 2;
}

Here we also don't forget about the line spacing either - which to get the number of spaces between the lines, we need to take the number of lines minus one. We also add the border as an offset value to the width and height too - multiplied by 2 because there's a border on both sides of the text.

Next, we need to create an image to draw the text to. This is largely the same as before:

$image = imagecreatetruecolor($width, $height);
imagefill($image, 0, 0, imagecolorallocate($image, 250, 249, 251)); // Set the background to #faf8fb

Now, we're ready to draw the text itself. This needs to now be done with a loop, because we've got multiple lines of text to draw - and imagestring() doesn't support that as we've discussed above. We also need to keep track of the index of the loop, so a temporary value is required:

$i = 0;
foreach($text_lines as $line) {
    // ....

    $i++;
}

With the loop in place, we can make the call to imagestring():

imagestring($image, $font_size,
    ($width / 2) - (($font_width * mb_strlen($line)) / 2),
    $border_size + $i * ($font_height + $line_spacing),
    $line,
    imagecolorallocate($image, 68, 39, 113) // #442772
);

This looks full of maths, but it's really quite simple. Let's break it down. Lines #2 and #3 there are the $(x, y)$ of the top-left corner at which the text should be drawn. Let's look at them in turn.

The $x$ co-ordinate (($width / 2) - (($font_width * mb_strlen($line)) / 2)) centres the text on the row. Basically, we take the centre of the image $width / 2), and take away ½ of the text width - which we calculate by taking the number of characters on the line, multiplying it by the width of a single character, and dividing it by 2.

The $y$ co-ordinate ($border_size + $i * ($font_height + $line_spacing)) is slightly different, because we need to account for the border at the top of the image. We take the font height, add the line spacing, and multiply it by the index of the text line that we're drawing. Since values start from 0 here, this will have no effect for the first line of text that we process, and it'll be drawn at the top of the image. We add to this the border width, to avoid drawing it inside the border that we've allocated around the image.

Lastly, the imagecolorallocate() call there tells GD the colour that we want to draw the text in via RGB. I've added a comment there because my editor highlights certain colour formats with the actual colour they represent, which is cool.

All that's left here is to return the completed image:

return $image;

....then we can do something like this:

if(!empty($error)) {
    http_response_code(503);
    header("content-type: image/png");
    imagepng(errorimage("Error: Something went\nwrong!")); // Note: Don't ever send generic error messages like this one. It makes for a bad and frustrating user (and debugging) experience.
}

I'm including the completed upgrade to the function at the bottom of this blog post. Here's an example image that it can render:

Found this interesting? Done some refactoring of your own recently? Comment below!)

(Can't see the above? Try a direct link.)

Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

The first key, |termindex|, is used to store a master list of words that have entries in the page index.
The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
    key TEXT UNIQUE NOT NULL,
    value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

Every time I read from the data store I'm decoding JSON, which is expensive
Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
I'm reading and writing the same thing multiple times, which is very expensive
Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = [];

Then, to fetch a value, we can do this:

// If it's not in the cache, insert it
if(!isset($this->cache[$key])) {
    $this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
        "SELECT value FROM store WHERE key = :key;",
        [ "key" => $key ]
    )->fetchColumn()) ];
}
return $this->cache[$key]["value"];

Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this:

if(!isset($this->cache[$key])) $this->cache[$key] = [];
$this->cache[$key]["value"] = $value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as $key => $value_data) {
    // If it wasn't modified, there's no point in saving it, is there?
    if(!$value_data["modified"])
        continue;

    $this->query(
        "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)",
        [
            "key" => $key,
            "value" => json_encode($value_data["value"])
        ]
    );
}

I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next.

Much better, but still not great. Why does it still take ages to write all the changes to disk?

Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further:

$this->db->beginTransaction();

// Do batch writing here

$this->db->commit();

Excellent (boomed the wizard).

But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all).

This is really easy to do. Let's create another associative array to store them in:

private $query_cache = [];

Then, we can alter the query() function to look like this:

/**
 * Makes a query against the database.
 * @param   string  $sql        The (potentially parametised) query to make.
 * @param   array   $variables  Optional. The variables to substitute into the SQL query.
 * @return  \PDOStatement       The result of the query, as a PDOStatement.
 */
private function query(string $sql, array $variables = []) {
    // Add to the query cache if it doesn't exist
    if(!isset($this->query_cache[$sql]))
        $this->query_cache[$sql] = $this->db->prepare($sql);
    $this->query_cache[$sql]->execute($variables);
    return $this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}

If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?

This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!

Next time, we'll take a look at generating the test data required to stress-test the system.

I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).

(Can't see the above? Try a direct link.)

Next

Generating Atom 1.0 Feeds with PHP (the proper way)

I've generated Atom feeds in PHP before, but recently I went on the hunt to discover if PHP has something like C♯'s XMLWriter class - and it turns out it does! Although poorly documented (probably why I didn't find it in the first place :P), it's actually quite logical and easy to pick up.

To this end, I thought I'd blog about how I used to write the Atom 1.0 feed generator for recent changes on Pepperminty Wiki that I've recently implemented (coming soon in the next release!), as it's so much cleaner than atom.gen.php that I blogged about before! It's safer too - as all the escaping is handled automatically by PHP - so there's no risk of an injection attack because I forgot to escape a character in my library code.

It ends up being a bit verbose, but a few helper methods (or a wrapper class?) should alleviate this nicely - I might refactor it at a later date.

To begin, we need to create an instance of the aptly-named XMLLWriter class. It's probable that you'll need the php-xml package installed in order to use this.

$xml = new XMLWriter();
$xml->openMemory();
$xml->setIndent(true); $xml->setIndentString("\t");
$xml->startDocument("1.0", "utf-8");

In short, the above creates a new XMLWriter instance, instructs it to write the XML to memory, enables pretty-printing of the output, and writes the standard XML document header.

With the document created, we can begin to generate the Atom feed. To figure out the format (I couldn't remember from when I wrote atom.gen.php - that was ages ago :P), I ended following this helpful guide on atomenabled.org. It seems familiar - I think I might have used it when I wrote atom.gen.php. To start, we need something like this:

<feed xmlns="http://www.w3.org/2005/Atom">
    ......
</feed>

In PHP, that translates to this:

$xml->startElement("feed");
$xml->writeAttribute("xmlns", "http://www.w3.org/2005/Atom");

$xml->endElement(); // </feed>

Next, we probably want to advertise how the Atom feed was generated. Useful for letting the curious know what a website is powered by, and for spotting usage of your code out in the wild!

Since I'm writing this for Pepperminty Wiki, I'm settling on something not unlike this:

<generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>

In PHP, this translates to this:

$xml->startElement("generator");
$xml->writeAttribute("uri", "https://github.com/sbrl/Pepperminty-Wiki/");
$xml->writeAttribute("version", $version); // A variable defined elsewhere in Pepperminty Wiki
$xml->text("Pepperminty Wiki");
$xml->endElement();

Next, we need to add a <link rel="self" /> tag. This informs clients as to where the feed was fetched from, and the canonical URL of the feed. I've done this:

xml->startElement("link");
$xml->writeAttribute("rel", "self");
$xml->writeAttribute("type", "application/atom+xml");
$xml->writeAttribute("href", full_url());
$xml->endElement();

That full_url() function is from StackOverflow, and calculates the full URI that was used to make a request. As Pepperminty Wiki can be run in any directory on ayn server, I can't pre-determine this url - hence the complexity.

Note also that I output type="application/atom+xml" here. This specifies the type of content that can be found at the supplied URL. The idea here is that if you represent the same data in different ways, you can advertise them all in a standard way, with other formats having rel="alternate". Pepperminty Wiki does this - generating the recent changes list in HTML, CSV, and JSON in addition to the new Atom feed I'm blogging about here (the idea is to make the wiki data as accessible and easy-to-parse as possible). Let's advertise those too:

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/html");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=html");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "application/json");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=json");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/csv");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=csv");
$xml->endElement();

Before we can output the articles themselves, there are a few more pieces of metadata left on our laundry list - namely <updated />, <id />, <icon />, <title />, and <subtitle />. There are others in the documentation too, but aren't essential (as far as I can tell) - and not appropriate in this specific case. Here's what they might look like:

<updated>2019-02-02T21:23:43+00:00</updated>
<id>https://wiki.bobsrockets.com/?action=recent-changes&amp;format=atom</id>
<icon>https://wiki.bobsrockets.com/rocket_logo.png</icon>
<title>Bob's Wiki - Recent Changes</title>
<subtitle>Recent Changes on Bob's Wiki</subtitle>

The <updated /> tag specifies when the feed was last updated. It's unclear as to whether it's the date/time the last change was made to the feed or the date/time the feed was generated, so I've gone with the latter. If this isn't correct, please let me know and I'll change it.

The <id /> element can contain anything, but it must be a globally-unique string that identifies this feed. I've seen other feeds use the canonical url - and I've gone to the trouble of calculating it for the <link rel="self" /> - so it seems a shame to not use it here too.

The remaining elements (<icon />, <title />, and <subtitle />) are pretty self explanatory - although it's worth mentioning that the icon must be square apparently. Let's whip those up with some more PHP:

$xml->writeElement("updated", date(DateTime::ATOM));
$xml->writeElement("id", full_url());
$xml->writeElement("icon", $settings->favicon);
$xml->writeElement("title", "$settings->sitename - Recent Changes");
$xml->writeElement("subtitle", "Recent Changes on $settings->sitename");

PHP even has a present for generating a date string in the correct format required by the spec :D $settings is an object containing the wiki settings that's a parsed form of peppermint.json, and contains useful things like the wiki's name, icon, etc.

Finally, with all the preamble done, we can turn to the articles themselves. In the case of Pepperminty Wiki, the final result will look something like this:

<entry>
    <title type="text">Edit to Compute Core by Sean</title>
    <id>https://seanssatellites.co.uk/wiki/?page=Compute%20Core</id>
    <updated>2019-01-29T10:21:43+00:00</updated>
    <content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  Sean&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Compute Core&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 10:21:43 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 1.36kb&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +1&lt;/li&gt;
&lt;/ul&gt;</content>
    <link rel="alternate" type="text/html" href="https://seanssatellites.co.uk/wiki/?page=Compute%20Core"/>
    <author>
        <name>Sean</name>
        <uri>https://seanssatellites.co.uk/wiki/?page=Users%2FSean</uri>
    </author>
</entry>

There are a bunch of elements here that deserve attention:

<title /> - The title of the article. Easy peasy!
<id /> - Just like the id of the feed itself, each article entry needs an id too. Here I've followed the same system I used for the feed, and given the url of the page content.
<updated /> - The last time the article was updated. Since this is part of a feed of recent changes, I've got this information readily at hand.
<content /> - The content to display. If the content is HTML, it must be escaped and type="html" present to indicate this.
<link rel="alternate" /> Same deal as above, but on an article-by-article level. In this case, it should link to the page the article content is from. In this case, I link to the page & revision of the change in question. In other cases, you might link to the blog post in question for example.
<author /> - Can contain <name />, <uri />, and <email />, and should indicate the author of the content. In this case, I use the name of the user that made the change, along with a link to their user page.

Here's all that in PHP:

foreach($recent_changes as $recent_change) {
    if(empty($recent_change->type))
        $recent_change->type = "edit";

    $xml->startElement("entry");

    // Change types: revert, edit, deletion, move, upload, comment
    $type = $recent_change->type;
    $url = "$full_url_stem?page=".rawurlencode($recent_change->page);

    $content = ".......";

    $xml->startElement("title");
    $xml->writeAttribute("type", "text");
    $xml->text("$type $recent_change->page by $recent_change->user");
    $xml->endElement();

    $xml->writeElement("id", $url);
    $xml->writeElement("updated", date(DateTime::ATOM, $recent_change->timestamp));

    $xml->startElement("content");
    $xml->writeAttribute("type", "html");
    $xml->text($content);
    $xml->endElement();

    $xml->startElement("link");
    $xml->writeAttribute("rel", "alternate");
    $xml->writeAttribute("type", "text/html");
    $xml->writeAttribute("href", $url);
    $xml->endElement();

    $xml->startElement("author");
    $xml->writeElement("name", $recent_change->user);
    $xml->writeElement("uri", "$full_url_stem?page=".rawurlencode("$settings->user_page_prefix/$recent_change->user"));
    $xml->endElement();

    $xml->endElement();
}

I've omitted the logic that generates the value of the <content /> tag, as it's not really relevant here (you can check it out here if you're curious :D).

This about finishes the XML we need to generate for our feed. To extract the XML from the XMLWriter, we can do this:

$atom_feed = $xml->flush();

Then we can do whatever we want to with the generated XML!

When the latest version of Pepperminty Wiki comes out, you'll be able to see a live demo here! Until then, you'll need to download a copy of the latest master version and experiment with it yourself. I'll also include a complete demo feed below:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>
    <link rel="self" type="application/atom+xml" href="http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3"/>
    <link rel="alternate" type="text/html" href="http://[::]:35623/?action=recent-changes&amp;format=html"/>
    <link rel="alternate" type="application/json" href="http://[::]:35623/?action=recent-changes&amp;format=json"/>
    <link rel="alternate" type="text/csv" href="http://[::]:35623/?action=recent-changes&amp;format=csv"/>
    <updated>2019-02-03T17:25:10+00:00</updated>
    <id>http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3</id>
    <icon>data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAB3VBMVEXhERHbKCjeVVXjb2/kR0fhKirdHBziDg6qAADaHh7qLy/pdXXUNzfMAADYPj7ZPDzUNzfbHx/fERHpamrqMTHgExPdHx/bLCzhLS3fVFTjT0/ibm7kRkbiLi7aKirdISHeFBTqNDTpeHjgERHYJCTVODjYQkLaPj6/AADVOTnpbW3cIyPdFRXcJCThMjLiTU3ibW3fVVXaKyvcERH4ODj+8fH/////fHz+Fxf4KSn0UFD/CAj/AAD/Xl7/wMD/EhL//v70xMT/+Pj/iYn/HBz/g4P/IyP/Kyv/7Oz0QUH/9PT/+vr/ior/Dg7/vr7/aGj/QED/bGz/AQH/ERH/Jib/R0f/goL/0dH/qan/YWH/7e3/Cwv4R0f/MTH/enr/vLz/u7v/cHD/oKD/n5//aWn+9/f/k5P/0tL/trb/QUH/cXH/dHT/wsL/DQ3/p6f/DAz/1dX/XV3/kpL/i4v/Vlb/2Nj/9/f/pKT+7Oz/V1f/iIj/jIz/r6//Zmb/lZX/j4//T0//Dw/4MzP/GBj/+fn/o6P/TEz/xMT/b2//Tk7/OTn/HR3/hIT/ODj/Y2P/CQn/ZGT/6Oj0UlL/Gxv//f3/Bwf/YmL/6+v0w8P/Cgr/tbX0QkL+9fX4Pz/qNzd0dFHLAAAAAXRSTlMAQObYZgAAAAFiS0dEAIgFHUgAAAAJcEhZcwAACxMAAAsTAQCanBgAAAAHdElNRQfeCxINNSdmw510AAAA5ElEQVQYGQXBzSuDAQCA8eexKXOwmSZepa1JiPJxsJOrCwcnuchBjg4O/gr7D9zk4uAgJzvuMgcTpYxaUZvSm5mUj7TX7ycAqvoLIJBwStVbP0Hom1Z/ejoxrbaR1Jz6nWinbKWttGRgMSSjanPktRY6mB9WtRNTn7Ilh7LxnNpKq2/x5LnBitfz+hx0qxUaxhZ6vwqq9bx6f2XXvuUl9SVQS38NR7cvln3v15tZ9bQpuWDtZN3Lgh5DWJex3Y+z1KrVhw21+CiM74WZo83DiXq0dVBDYNJkFEU7WrwDAZhRtQrwDzwKQbT6GboLAAAAAElFTkSuQmCC</icon>
    <title>Pepperminty Wiki - Recent Changes</title>
    <subtitle>Recent Changes on Pepperminty Wiki</subtitle>
    <entry>
        <title type="text">Edit to Internal link by admin</title>
        <id>http://[::]:35623/?page=Internal%20link</id>
        <updated>2019-01-29T19:55:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Internal link&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 19:55:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 2.11kb&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +2007&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Internal%20link"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FInternal%20link</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T20:14:07+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 20:14:07 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 317b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +68&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T17:53:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 17:53:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 249b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +31&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
</feed>

...this is from my local development instance.

Found this interesting? Confused about something? Want to say hi? Comment below!

Using libsodium to upgrade the comment key system

I've blogged about the comment key system I utilise on this blog to prevent spam before (see also). Today, I've given it another upgrade to make it harder for spammers to fake a comment key!

In the last post, I transformed the comment key with a number of reversible operations - including a simple XOR password system. This is, of course, very insecure - especially since an attacker knows (or can at least guess) the content of the encrypted key, making it trivial (I suspect) to guess the password used for 'encryption'.

The solution here, obviously, is to utilise a better encryption system. Since it's the 5th November and I'm not particularly keen on my website going poof like the fireworks tonight, let's do something about it! PHP 7.2+ comes with native libsodium support (those still using older versions of PHP can still follow along! Simply install the PECL module). libsodium bills itself as

A modern, portable, easy to use crypto library.

After my experiences investigating it, I'd certainly say that's true (although the official PHP documentation could do with, erm, existing). I used this documentation instead instead - it's quite ironic because I have actually base64-encoded the password.......

Anyway, after doing some digging I found the quick reference, which explains how you can go about accomplishing common tasks with libsodium. For my comment key system, I want to encrypt my timestamp with a password - so I wanted the sodium_crypto_secretbox() and its associated sodium_crypto_secretbox_open() functions.

This pair of functions, when used together, implement a secure symmetric key encryption system. In other words, they securely encrypt a lump of data with a password. They do, however, have 2 requirements that must be taken care of. Firstly, the password must be of a specific length. This is easy enough to accomplish, as PHP is kind enough to tell us how long it needs to be:

/**
 * Generates a new random comment key system password.
 * @return string   The base64-encoded password.
 */
function key_generate_password() {
    return base64_encode_safe(random_bytes(SODIUM_CRYPTO_SECRETBOX_KEYBYTES));
}

Easy-peasy! base64_encode_safe isn't a built-in function - it's a utility function I wrote that we'll need later. For consistency, I've used it here too. Here it is, along with its counterpart:

/**
 * Encodes the given data with base64, but replaces characters that could 
 * get mangled in transit with safer equivalents.
 * @param   mixed   $data   The data to encode.
 * @return  string  The data, encoded as transmit-safe base64.
 */
function base64_encode_safe($data) {
    return str_replace(["/", "+"], ["-", "_"], base64_encode($data));
}
/**
 * Decodes data encoded with base64_encode_safe().
 * @param   mixed   $base64     The data to decode.
 * @return  string  The data, decoded from transmit-safe base64.
 */
function base64_decode_safe($base64) {
    return base64_decode(str_replace(["-", "_"], ["/", "+"], $base64));
}

With that taken care of, we can look at the other requirement: a nonce. Standing for Number used ONCE, it's a sequence of random bytes that's used by the encryption algorithm. We don't need to keep it a secret, but we do need to to decrypt the data again at the other end, in addition to the password - and we do need to ensure that we generate a new one for every comment key. Thankfully, this is also easy to do in a similar manner to generating a password:

$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);

With everything in place, we can look at upgrading the comment key generator itself. It looks much simpler now:

/**
 * Generates a new comment key.
 * Note that the 'encryption' used is not secure - it's just simple XOR!
 * It's mainly to better support verification in complex setups and 
 * serve as a nother annoying hurdle for spammers.
 * @param  string $pass The password to encrypt with. Should be a base64-encoded string from key_generate_password().
 * @return string       A new comment key stamped at the current time.
 */
function key_generate($pass) {
    $pass = base64_decode_safe($pass);
    $nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);
    $encrypt = sodium_crypto_secretbox(
        strval(time()), // The thing we want to encrypt
        $nonce, // The nonce
        $pass // The password to encrypt with
    );

    return base64_encode_safe($nonce.$encrypt);
}

I bundle the nonce with the encrypted data here to ensure that the system continues to be stateless (i.e. we don't need to store any state information about a user on the server). I also encode the encrypted string with base64, as the encrypted strings contain lots of nasty characters (it's actually a binary byte array I suspect). This produces keys like this:

BOqDvr26XY9s8PhlmIZMIp6xCOZyfsh6J05S4Jp2cY3bL8ccf_oRgRMldrmzKk6RrnA=
Tt8H81tkJEqiJt-RvIstA_vz13LS8vjLgriSAvc1n5iKwHuEKjW93IMITikdOwr5-NY=
5NPvHg-l1GgcQ9ZbThZH7ixfKGqAtSBr5ggOFbN_wFRjo3OeJSWAcFNhQulYEVkzukI=

They are a bit long, but not unmanageable. In theory, I could make it a bit shorter by avoiding converting the integer output from time() to a string, but in my testing it didn't make much difference. I suspect that there's some sort of minimum length to the output string for some (probably good) reason.


php > var_dump(sodium_crypto_secretbox(time(), random_bytes(24), random_bytes(32)));
string(26) "GW$:���ߌ@]�+1b��������d%"
php > var_dump(sodium_crypto_secretbox(strval(time()), random_bytes(24), random_bytes(32)));
string(26) ":_}0H �E°9��Kn�p��ͧ��"

Now that we're encrypting comment keys properly, it isn't much good unless we can decrypt them as well! Let's do that too. The first step is to decode the base64 and re-split the nonce from the encrypted data:

$pass = base64_decode_safe($pass);
$key_enc_raw = base64_decode_safe($key);
$nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
$key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");

This is derived from the example code. With the 2 separated once more, we can decrypt the key:

$key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);

Apparently, according to the example code on the website I linked to above, if the return value isn't a string then the decryption failed. We should make sure we handle that when returning:

if(!is_string($key_dec))
    return null;
return intval($key_dec);

That completes the decryption code. Here is in full:

/**
 * Decodes the given key.
 * @param  string $key  The key to decode.
 * @param  string $pass The password to decode it with.
 * @return int|null     The time that the key was generated, or null if the provided key was invalid.
 */
function key_decode($key, $pass) {
    $pass = base64_decode_safe($pass);
    $key_enc_raw = base64_decode_safe($key);
    $nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
    $key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");
    $key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);
    if(!is_string($key_dec)) return null;
    return intval($key_dec);
}

Explicitly returning null in key_decode() requires a small change to key_verify(), in order to prevent it from thinking that a key is really old if the decryption fails (null is treated as 0 in arithmetic operations, apparently). Let's update key_verify() to handle that:

/**
 * Verifies a key.
 * @param   string  $key        The key to decode and verify.
 * @param   string  $pass       The password to decode the key with.
 * @param   int     $min_age    The minimum required age for the key.
 * @param   int     $max_age    The maximum allowed age for the key.
 * @return  bool                Whether the key is valid or not.
 */
function key_verify($key, $pass, $min_age, $max_age) {
    $key_time = key_decode($key, $pass);
    if($key_time == null) return false;
    $age = time() - $key_time;
    return $age >= $min_age && $age <= $max_age;
}

A simple check is all that's needed. With that, the system is all updated! Time to generate a new password with the provided function and put it to use. You can do that directly from the PHP REPL (php -a):


php > require("lib/comment_system/comment_key.php");
php > var_dump(key_generate_password());                                        string(44) "S0bJpph5htdaKHZdVfo9FB6O4jOSzr3xZZ94c2Qcn44="

Obviously, this isn't the password I'm using on my blog :P

I've got a few more ideas I'd like to play around with to deter spammers even more whilst maintaining the current transparency - so you may see another post soon on this subject :-)

With everything complete, here's the complete script:

(Above: The full comment key system code. Can't see it? Check it out on GitHub Gist here.)

Found this useful? Got an improvement? Confused about something? Comment below!

Sources and Further Reading

Placeholder Image Generator

(Above: An example output with debugging turned on from my placeholder generation service)

For a quite a considerable amount of time now, I've been running my own placeholder image generation service here at starbeamrainbowlabs.com - complete with etag generation and custom colour selection. Although it's somewhat of an Easter Egg, it's not actually that hard to find if you know what you're looking for (hint: I use it on my home page, but you may need to view source to find it).

I decided to post about it now because I've just finished fixing the angle GET parameter - and it was interesting (and hard) enough to warrant a post to remind myself how I did it for future reference. Comment below if you knew about it before this blog post came out!

The script itself is split into 3 loose parts:

The initial settings / argument parsing
The polyfills and utility functions
The image generation.

My aim here was to keep the script contained within a single file - so that it fits well in a gist (hence why it currently looks a bit messy!). It's probably best if show the code in full:

(Having trouble viewing code above? Visit it directly here: placeholder.php)

If you append ?help to the url, it will send back a plain text file detailing the different supported parameters and what they do:

(Can't see the above? View it directly here)

Aside from implementing the random value for fg_colour and bg_colour, the angle has been a huge pain. I use GD - a graphics drawing library that's bundled with practically every PHP installation ever - to draw the placeholder image, and when you ask it to draw some rotated text, it decides that it's going to pivot it around the bottom-left corner instead of the centre.

Naturally, this creates some complications. Though some people on the PHP manual page said method (imagettftext) have attempetd to correct for this (exhibits a and b), neither of their solutions work for me. I don't know if it's because their code is just really old (13 and 5 years respectively as of the time of typing) or what.

Anyway, I finally decided recently that enough was enough - and set about fixing it myself. Basing my code on the latter of the 2 pre-existing solutions, I tried fixing it - but ended up deleting most of it and starting again. It did give me some idea as to how to solve the problem though - all I needed to do was find the centre of where the text would be drawn when it is both not rotated and rotated and correct for it (these are represented by the green and blue crosses respectively on the image at the top of this post).

PHP does provide a method for calculating the bounding box of some prospective text that you're thinking of drawing via imagettfbbox. Finding the centre of the box though sounded like a horrible maths-y problem that would take me ages to work through on a whiteboard first, but thankfully (after some searching around) Wikipedia had a really neat solution for finding the central point of any set of points.

It calls it the centroid, and claims that the geometric centre of a set of points is simply the average of all the points involved. It just so happens that the geometric centre is precisely what I'm after:

$$ C=\frac{a+b+c+d+....}{N} $$

...Where $C$ is the geometric centre of the shape as an $\binom{x}{y}$ vector, $a$, $b$, $c$, $d$, etc. are the input points as vectors, and $N$ is the number of points in question. This was nice and easy to program:

$bbox = imagettfbbox($size, 0, $fontfile, $text);
$orig_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];
$text_width = $bbox[2] - $bbox[0];
$text_height = $bbox[3] - $bbox[5];

$bbox = imagettfbbox($size, $angle, $fontfile, $text);
$rot_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];

The odd and even indexes of $bbox there are referring to the $x$ and $y$ co-ordinates of the 4 corners of the bounding boxes - imagettfbbox outputs the co-ordinates in 1 single array for some reason. I also calculate the original width and height of the text - in order to perform an additional corrective translation later.

With these in hand, I could calculate the actual position I need to draw it (indicated by the yellow cross in the image at the top of this post):

$$ delta = C_{rotated} - C_{original} $$

$$ pivot = \frac{pos_x - ( delta_x + \frac{text_width}{2})}{pos_y - delta_y + \frac{text_height}{2}} $$

In short, I calculate the distance between the 2 centre points calculated above, and then find the pivot point by taking this distance plus half the size of the text from the original central position we wanted to draw the text at. Here's that in PHP:

// Calculate the x and y shifting needed
$dx = $rot_centre[0] - $orig_centre[0];
$dy = $rot_centre[1] - $orig_centre[1];

// New pivot point
$px = $x - ($dx + $text_width/2);
$py = $y - $dy + $text_height/2;

I generated a video of it in action (totally not because I thought it would look cool :P) with a combination of curl, and ffmpeg:

(Having issues with the above video? Try downloading it directly, or viewing it on YouTube)

This was really quite easy to generate. In a new folder, I executed the following:

echo {0..360} | xargs -P3 -d' ' -n1 -I{} curl 'https://starbeamrainbowlabs.com/images/placeholder/?width=1920&height=1920&fg_colour=inverse&bg_colour=white&angle={}' -o {}.jpeg
ffmpeg -r 60 -i %d.jpeg -q:v 3 /tmp/text-rotation.webm

The first command downloads all the required frames (3 at a time), and the second stitches them together. The -q:v 3 bit of the ffmpeg command is of note - by default webm videos apparently have a really low quality - this corrects that. Lower is better, apparently - it goes up to about 40 I seem to remember reading somewhere. 3 to 5 is supposed to be the right range to get it to look ok without using too much disk space.

That's about everything I wanted to talk about to remind myself about what I did. Let me know if there's anything you're confused about in the comments below, and I'll explain it in more detail.

(Found this interesting? Comment below!)

Sources and Further Reading

Search Engine Optimisation: The curious question of efficiency

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)
The inverted index is queried for pages in that match the tokenised query terms.
The results returned from the query are ranked and sorted before being returned.
A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result
Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

Step #1 took ~128ms on average
Steps #2 & #3 took ~1200ms on average
Step #4 took ~1500ms on average(!)
Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

$start_time = microtime(true);

// Do work here

$end_time = microtime(true);

$time_taken_ms = round(($end_time - $start_time )*1000, 3);

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

$literator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);

$literator->transliterate($string);

(Originally from this StackOverflow answer)

Note that this requires the intl PHP extension (which should be installed & enabled by default). This does several things:

Converts the text to lowercase
Normalises it to UTF-8 (See this article for more information)
Casts Cyrillic characters to their Latin alphabet (i.e. a-z) phonetic equivalent
Removes all diacritics

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place
Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

1: Booster
2: Rocket
3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.
Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

Stardust Blog

Tag Cloud

Spam statistics are live!

Avoiding accidental array mutation when iterating arrays in PHP

Sources and further reading

EmbedBox: Lightweight syntax-highlighted embeds

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Converting multiline text to an image in PHP

Next Gen Search, Part 1: Backend Storage

Generating Atom 1.0 Feeds with PHP (the proper way)

Using libsodium to upgrade the comment key system

Sources and Further Reading

Placeholder Image Generator

Sources and Further Reading

Search Engine Optimisation: The curious question of efficiency

Stardust
Blog