Go, day three

It took me a while to figure out what I wanted to do to round out my introduction to Go and since I haven't had a lot of spare time recently, I figured i'd do something simple and pragmatic that mirrors some of the kinds of things I'd typically do in ruby at work. In this case I implemented stripped down version of some text processing I had to do involving a large file.

The allCountries.txt file from the geonames project is a 1 gigabyte tab delimited text file with over 8 million lines that represent places all over over the world, in the past i've needed to parse this file and insert most of its contents into a database to support a geolocation feature in an application we were building. I remember that this would take quite a long time, and while I know a good chunk of that time was database I/O; I was curious to see how Go compared in just parsing the file and computing a single measure, namely the largest (by population) place in the whole file. Here is the code, it simply reads in the file line by line, splits each line on the '\t' character and keeps track of the place with the highest population, there is also some code to skip but keep track of malformed lines, after all, data is messy. Here is the go code.

On my computer, running this program over the 8 million line text file completes in 1m20.840s (as measured by the unix 'time' program). As a baseline the unix program wc, takes 0m58.426s to count the number of lines in the file. So Go is definitely not too shabby, on a lark I also decided to write the equivalent ruby code just to compare (both the resulting code and the execution time). Here is what that looks like.

This code runs in 3m44.464s on my machine.

I wouldn't read too much into the numbers here, its a pretty unscientific benchmark. But the go code did run faster, and coupled with goroutines i could easily see inserting the data into a database concurrently if that were the task.

The code itself is a bit more verbose, error handling code in go is much more front and center than in ruby, but it is not overly unwieldy (there are also a lot more newlines in my go code). So overall it was comfortable to write. One may notice that I am not using a csv/tsv library to read the file. I actually started with a version of the go code that did use the csv library (it runs slower than the code above, closer to 3 minutes), but when I wanted to write the ruby version, the csv library choked on the file because some of the lines were malformed. I thus resorted to simply doing the split manually (which really isn't much extra code at all) to get it to work in ruby, then changed the go version to match.

Overall I am pretty fairly pleased with my first foray into Go. While statically typed the syntax is not overly verbose or ceremonious (though the language does lean towards being explicit), it has some easy to understand higher level concurrency tools and compiles quite quickly. I do however miss the more functional feel of something like clojure or even scala; though Go is not really aiming for that, and does hit a nice sweet spot as an unabashedly imperative language.