Wide Finder: take 2

Tim Bray’s revised Wide Finder project [ongoing.org] has been ongoing for a few weeks now, and I’ve finally took the time to design and prototype an implementation.

The goal is to evaluate the performance of middle-of-the-road, not embarrassingly parralelizable tasks on modern-day multi-core hardware. Such as the Sun T2000 servers. Fittingly, the task is to parse a multi-gigabyte web server log file and compile some aggregate statistics.

The solution I came up with for the earlier iteration of the contest, coded in different versions (C++, OCaml and JoCaml) is fundamentally sound, though rather unoptimized (picking up two-and-a-half different languages in one weekend is a good way to find out how much there is to know about, say, C++ stream buffering). With the benefit of hindsight, and given that we are several weeks into the project and there are strong implementations already [wikis.sun.com], the idea is to find an unexplored niche.

Short recap of the main implementations:

  • OCaml: Fernandez is ahead of the pack again, the only solution in the 7 minutes
  • make+C+awk+sh: Perl is dead, but shell scripting is enjoying a renaissance with parallelizable tasks. 8 minutes
  • Java-based solutions: in the 13-17 minutes range are the various JVM solutions, from Java, Groovy and Scala to Fan, an interesting Ruby-like language for the JVM. Reminds me of .NET’s Boo.
  • Python, Ruby: in the 20+ minutes range. Python multiprocessing is not that efficient yet; I believe an improved Stackless Python solution might be forthcoming

I dabbled with a Common Lisp solution; it works and appears to be competitive, when tried on a partial log file. Exploring the available options for parallelism, however, revealed the disconcerting fact: no freely-available Common Lisp compilers have good multi-threading, or even multi-processing (without shared memory) on Solaris! Even worse, the SBCL incompatibility with GCC 4.3 means that even the Linux version on my Fedora machine is several months old, and does not have the threading library.

So it’s back to Java. Perusing the blogs of the Java and Scala programmers, it appears that the common complain is .. regular expressions. So the hunt was on for a good regular expression library. Joni, a port of the Ruby Oniguruma regex library to the JVM used by the JRuby project, appears ideal: low-level and supposedly very fast. Until one hits the total lack of documentation. So that’s off the table. Ended up using dk.brics.automaton, which appears to perform well enough, even when parsing Unicode strings.

The nice thing about using Java is that, if you hit a performance brick wall, chances are that many other people have been there before you. The problem I have, the need to have random access within a file (so different threads can start at different offsets (Java’s RandomAccessFile is good for this) combined with the need for buffered I/O (BufferedReader is good, but there is no RandomAccessReader !) is solved by the nice folks at Biojava.com. Great!

On my system (2 GHz Core 2, 2 GB RAM, 5400 rpm HDD, Fedora 9 x64), Ruby takes about 2.2 seconds, while my Java implementation running on OpenJDK 1.6 (64-bit) takes about 1.6-1.7 seconds with 1 thread and 1.4-1.5 seconds with 2 threads. Close to the 1.2+ seconds time that Alex reported for Python, but hey, we’re paying the Java start-up cost here.

Will update when I get an account on the test server. 40GB dataset, here I come! In the meantime, time to look for opportunities to use a JVM-based language better suited for the task. The Java code is a tad bit verbose.


Comments are closed.