Quest Get: Search large amounts of code!
(Above: A map of the Linux Kernel source code. Source: this post on medium.)
Recently I was working on a little project of mine (nope, not this for once! :P), and I needed a C♯ class I'd written a while ago. Being forgetful as I am, I had no idea which of my project I'd written it for. And so the quest began to find it! I did in the end, but it left me thinking whether there was a better way to search all my code quickly. This post is the culmination of everything I've discovered so far about the process of searching one's code.
Before I started, I already know about grep, which is built into almost every Linux system around. It's even available for Windows via the MSYS Tools. Unfortunately though, despite it's prevailance, it's not particularly good at searching large numbers of git repositories, as it keeps descending into the .git
folder and displaying a whole load of useless results.
Something had to change. After asking reddit, I was introduced to OpenGrok. Written in Java, it indexes all of your code, and provides a web interface through which you can search it. Very nice. Unfortunately, I had trouble figuring out the logistics of actually getting it to run - and discovered that it takes multiple hours to set up correctly.
Moving on, I was re-introduced to ack, written in plain-old Perl, it apparently runs practically any system that Perl does - though it's not installed by default like grep
is. Looking into it, I found it to be much like grep
- only smarter. It ignores version control directories (like the .git
folder ), and common package folders (like node_modules
) by default, and even has a system by which results can be filtered by language (with support for hash-bangs too!). The results themselves are coloured by default - making it easy to skim through quickly. Coupled with the flexible configuration file system, ack
makes for a wonderfully flexible way to search through large amounts of code quickly.
Though ack
looks good, I still didn't have a way to search through all my code that scattered across multiple devices at once, so I kept looking. The next project I found (through alternative to actually) was Text Sherlock. It positions itself as an alternative to OpenGrok that's much simpler to configure.
True to its word, I managed to get a test instance set up running from my /tmp
directory in 15 minutes - though it did take a while to index the code I had locally. It also took several seconds to consult its index when I entered a query. I suspect I could alleviate both of these issues by installing Xapian (an open-source high-performance search library), which it appears to have support for.
While the interface was cool, it didn't appear to allow me to tell it which directories not to index, so it ended trawling through all my .git
directories - just like grep
did. It also doesn't appear to multi-threaded - so it took much longer to index my code than it really needed to (I've got a solid-state drive and enough RAM for a few GBs of cache, so the indexing operation was CPU-bound, not I/O-bound).
In the end, I've rediscovered the awesome search tool ack
, and taken a look at the current state of code search tools today. While I haven't yet found precisely what I'm looking for, I'm further forward than when I started.
Other honourable mentions include GNU Global (which apparently needs several GiBs per ~300MiB of source code for its generated static HTML web interface), insight.io (an IDE-like freemium cloud product that 'understands your code'), CodeQuery (only supports C, C++, Java, Python, Ruby, Javascript, and Go), and ripgrep (rust-based program, similar to ack
and grep
, feature comparison). The official ack
website has a good page that contains more tools that are worth a look, too.
Got a cool way to search through all your code? Did this help you out? Comment below!