Own your code, Part 2: The curious case of the unreliable webhook
In the last post, I talked about how to setup your own Git server with Gitea. In this one, I'm going to take bit of a different tack - and talk about one of the really annoying problems I ran into when setting up my continuous integration server, Laminar CI.
Since I wanted to run the continuous integration server on a different machine to the Gitea server itself, I needed a way for the Gitea server to talk to the CI server. The natural choice here is, of course, a Webhook-based system.
After installing and configuring Webhook on the CI server, I set to work writing a webhook receiver shell script (more on this in a future post!). Unfortunately, it turned out that that Gitea didn't like sending to my CI server very much:
Whether it succeeded or not was random. If I hit the "Test Delivery" button enough times, it would eventually go through. My first thought was to bring up the Gitea server logs to see if it would give any additional information. It claimed that there was an i/o timeout
communicating with the CI server:
Delivery: Post https://ci.bobsrockets.com/hooks/laminar-config-check: read tcp 5.196.73.75:54504->x.y.z.w:443: i/o timeout
Interesting, but not particularly helpful. If that's the case, then I should be able to get the same error with curl
on the Gitea server, right?
curl https://ci.bobsrockets.com/hooks/testhook
.....wrong. It worked flawlessly. Every time.
Not to be beaten by such an annoying issue, I moved on to my next suspicion. Since my CI server is unfortunately behind NAT, I checked the NAT rules on the router in front of it to ensure that it was being exposed correctly.
Unfortunately, I couldn't find anything wrong here either! By this point, it was starting to get really rather odd. As a sanity check, I decided to check the server logs on the CI server, since I'm running Webhook behind Nginx (as a reverse-proxy):
5.196.73.75 - - [04/Dec/2018:20:48:05 +0000] "POST /hooks/laminar-config-check HTTP/1.1" 408 0 "-" "GiteaServer"
Now that's weird. Nginx has recorded a HTTP 408 error. Looking is up reveals that it's a Request Timeout error, which has the following definition:
The server did not receive a complete request message within the time that it was prepared to wait.
Wait what? Sounds to me like there's an argument going on between the 2 servers here - in which each server is claiming that the other didn't send a complete request or response.
At this point, I blamed this on a faulty HTTP implementation in Gitea, and opened an issue.
As a workaround, I ended up configuring Laminar to use a Unix socket on disk (as opposed to an abstract socket), forwarding it over SSH, and using a git hook to interact with it instead (more on how I managed this in a future post. There's a ton of shell scripting that I need to talk about first).
This isn't the end of this tail though! A month or two after I opened the issue, I wound up in the situation whereby I wanted to connect a GitHub repository to my CI server. Since I don't have shell access on github.com, I had to use the webhook.
When I did though, I got a nasty shock: The webhook deliveries exhibited the exact same random failures as I saw with the Gitea webhook. If I'd verified the Webhook server and cleared Gitea's HTTP implementation's name, then what else could be causing the problem?
At this point, I can only begin to speculate what the issue is. Personally, I suspect that it's a bug in the port-forwarding logic of my router, whereby it drops the first packet from a new IP address while it sets up a new NAT session to forward the packets to the CI server or something - so subsequent requests will go through fine, so long as they are sent within the NAT session timeout and from the same IP. If you've got a better idea, please comment below!
Of course, I really wanted to get the GitHub repository connected to my CI server, and if the only way I could do this was with a webhook, it was time for some request-wrangling.
My solution: A PHP proxy script running on the same server as the Gitea server (since it has a PHP-enabled web server set up already). If said script eats the request and emits a 202 Accepted
immediately, then it can continue trying to get a hold of the webhook on the CI server 'till the cows come home - and GitHub will never know! Genius.
PHP-FPM (the fastcgi process manager; great alongside Nginx) makes this possible with the fastcgi_finish_request()
method, which both flushes the buffer and ends the request to the client, but doesn't kill the PHP script - allowing for further processing to take place without the client having to wait.
Extreme caution must be taken with this approach however, as it can easily lead to a situation where the all the PHP-FPM processes are busy waiting on replies from the CI server, leaving no room for other requests to be fulfilled and a big messy pile-up in the queue forming behind them.
Warnings aside, here's what I came up with:
<?php
$settings = [
"target_url" => "https://ci.bobsrockets.com/hooks/laminar-git-repo",
"response_message" => "Processing laminar job proxy request.",
"retries" => 3,
"attempt_timeout" => 2 // in seconds, for a single attempt
];
$headers = "host: ci.starbeamrainbowlabs.com\r\n";
foreach(getallheaders() as $key => $value) {
if(strtolower($key) == "host") continue;
$headers .= "$key: $value\r\n";
}
$headers .= "\r\n";
$request_content = file_get_contents("php://input");
// --------------------------------------------
http_response_code(202);
header("content-type: text/plain");
header("content-length: " . strlen($settings["response_message"]));
echo($settings["response_message"]);
fastcgi_finish_request();
// --------------------------------------------
function log_message($msg) {
file_put_contents("ci-requests.log", $msg, FILE_APPEND);
}
for($i = 0; $i < $settings["retries"]; $i++) {
$start = microtime(true);
$context = stream_context_create([
"http" => [
"header" => $headers,
"method" => "POST",
"content" => $request_content,
"timeout" => $settings["attempt_timeout"]
]
]);
$result = file_get_contents($settings["target_url"], false, $context);
if($result !== false) {
log_message("[" . date("r") . "] Queued laminar job in " . (microtime(true) - $start_time)*1000 . "ms");
break;
}
log_message("[" . date("r") . "] Failed to laminar job after " . (microtime(true) - $start_time)*1000 . "ms.");
}
I've named it autowrangler.php
. A few things of note here:
php://input
is a special virtual file that's mapped internally by PHP to the client's request. By eating it with file_get_contents()
, we can get the entire request body that the client has sent to us, so that we can forward it on to the CI server.
getallheaders()
lets us get a hold of all the headers sent to us by the client for later forwarding
- I use
log_message()
to keep a log of the successes and failures in a log file. So far I've got a ~32% failure rate, but never more than 1 failure in a row - giving some credit to my earlier theory I talked about above.
This ends the tale of the recalcitrant and unreliable webhook. Hopefully you've found this an interesting read. In future posts, I want to look at how I configured Webhook, the inner workings of the git hook I mentioned above, and the collection of shell scripts I've cooked to that make my CI server tick in a way that makes it easy to add new projects quickly.
Found this interesting? Run into this issue yourself? Found a better solution workaround? Comment below!