Talked with a colleague about the slow single-threaded performance of my Wide Finder implementation, and we narrowed it down to two possibilities:
- Boost regular expression is not compiled?
- C++ strings have higher overhead than null-terminated c_str
First point can be ruled out: Boost compiles regular expressions when you assign them. Second point — well, reading in the file using std::getline turns out to consume the bulk of time.
I’ve reorganized the code a bit, using a multimap rather than a vector to rank the URLs by count, with no effect on speed. With two and four threads on a dual-core Intel notebook, the performance is at least on par with Ruby.
Alastair Rankine has a C++ implementation that is slightly faster, but uses Boost memory-mapped IO that I avoided for the same reason he put as caveat: that it will not scale to files that are too large. Which Tim’s log file might well be. Again, that is not significantly faster than the Ruby code.
Moral of the question: Perl and Ruby can be faster than C++! The C implementations out there are blindingly fast, but the way they do regular expression handling are really painful.
Will turn my (limited) spare time to doing a clean JoCaml implementation — it might not be faster but it definitely will look cleaner!