Why you should conditionally promise to buy the upcoming Nokia N810 tablet

Nokia N810 tablet

  • It looks gorgeous
  • It runs Linux, and showcases what can be done with more vertical integration
  • Nokia has been improving their interaction with the developer community
  • Video camera and Skype (no Skype video support yet, though)
  • Rhapsody subscription service
  • New: Now with GPS, spacious internal storage, and sliding keyboard built-in!
  • New: More video codecs, Flash 9, Mozilla-based browser

So commercial software providers (Skype, Real Networks) will provide Linux ports if they judge that the userbase is big enough. Which is good news.

The same thing applies to Nokia itself, naturally, and sadly in this case, they do not think there is demand for Ogg Vorbis playback.

So if, like me, you find the product attractive, but have a personal collection of Ogg Vorbis files (or FLAC, which transcodes seamlessly to Vorbis), then this is what you can do:

  • E-mail Nokia about it
  • Inform outlets that stock the tablet (e.g. Best Buy, CompUSA)
  • Sign this pledge and pass it around

All the software for the new device (minus GPS — though perhaps it’s the same software that comes with the GPS kit for N800? Oh, and the ambient light sensor) will run on the N800, so holding back won’t be that painful.

Advertisements

Wide Finder: OCaml and JoCaml

Spent last night getting a crash course in using OCaml to do non-functional things (hash tables, file I/O, regular expressions) and the result is now up.

The JoCaml version does the file-partitioning trick used in the C++ implementation, with each finder workers being run inside a JoCaml channel; the channels share a single lock so they can update the hash table serially.

Interestingly, current implementation does not get a speed-up from the input file being cached (Ilmari’s wf.ml does). Will have to peruse his to see what’s slowing things down.

Lesson: not all techniques for processing a file line-wise are equally good!

Wide Finder: C++ update

Talked with a colleague about the slow single-threaded performance of my Wide Finder implementation, and we narrowed it down to two possibilities:

  • Boost regular expression is not compiled?
  • C++ strings have higher overhead than null-terminated c_str

First point can be ruled out: Boost compiles regular expressions when you assign them. Second point — well, reading in the file using std::getline turns out to consume the bulk of time.

I’ve reorganized the code a bit, using a multimap rather than a vector to rank the URLs by count, with no effect on speed. With two and four threads on a dual-core Intel notebook, the performance is at least on par with Ruby.

Alastair Rankine has a C++ implementation that is slightly faster, but uses Boost memory-mapped IO that I avoided for the same reason he put as caveat: that it will not scale to files that are too large. Which Tim’s log file might well be. Again, that is not significantly faster than the Ruby code.

Moral of the question: Perl and Ruby can be faster than C++! The C implementations out there are blindingly fast, but the way they do regular expression handling are really painful.

Will turn my (limited) spare time to doing a clean JoCaml implementation — it might not be faster but it definitely will look cleaner!

JoCaml

After turning in the C/C++ monster (cleanest C code I reckon it is possible to write, thus the total lack of memory-mapped I/O and other optimizations), I turned my attention to picking a better implementation language.

Requirements:

  • Functional
  • Good support for threading
  • If possible, support for distributed computing

As it turns out, JoCaml fits the bill perfectly. It’s an extension of Ocaml, so it combines a rich library with a familiar syntax (not to me, but having used both Scheme and Haskell, how different can it be) — and a very nice process calculus!

Example: this is a concurrent stack that blocks if there is no input available

let new_stack () =
def state (s) & push (v) = state (v::s) & reply to push
or state (x::s) & pop () = state s & reply x to pop in
spawn state([]);
pop, push
;;

This defines a private state channel, and then export the pop and push synchronous channels (that to the user behave just like ordinary functions)

and this is how you use it:

let pop, push = new_stack ();;
spawn echo(pop());;
push(1);;

Note that the echo channel will block, since pop can’t return a value until the stack contains something! This value is then pushed into the stack and ‘1’ printed.

More of this at the JoCaml site. And, as it turns out, there already is a JoCaml implementation of the Wide Finder, by Ilmari Heikkinen. Will have to grok the finer details from him.

Tim Bray’s Wide Finder: a minimalist implementation

Several weeks ago, Tim Bray posted his Wide Finder project: take the Ruby script that parses an Apache log file and report the top 10 hits, and parallelize it in your language of choice.

It occurred to me a while back that this is a perfect job for a C multi-process program, taking advantage of Linux’s cheap copy-on-write fork, if not for the need to merge the result. So it would probably be easier to write it using the pthreads library instead. One would want to reduce inter-thread communication as much as possible, though.

I did not have time to touch the code until today, but now it’s done. Two key insights:

  • The input file can be partitioned cleanly into multiple chunks of roughly identical sizes without communication, as long as each thread follows the same protocol. i.e. a common chunk size is used, and all but the first thread needs to check if the character before their starting offset is the end-of-line. If not, they need to skip until the end of line, yielding that line to the previous thread.
  • When merging the hash maps, the first thread to acquire the lock does not need to insert its items one by one, but can just set the main hash map to point to its own

On a data set size of 200MB, with a single thread, performs slightly slower as Ruby by wall clock:

Ruby timing

2.99user 1.88system 0:05.26elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k
152inputs+0outputs (3major+853minor)pagefaults 0swaps
3.04user 1.84system 0:05.33elapsed 91%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+856minor)pagefaults 0swaps
2.84user 2.01system 0:05.19elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+856minor)pagefaults 0swaps

C++ timing, one thread

2.70user 5.21system 0:07.95elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+655minor)pagefaults 0swaps
2.78user 5.14system 0:07.96elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+655minor)pagefaults 0swaps
2.72user 5.12system 0:07.90elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+654minor)pagefaults 0swaps

With two threads, though, it runs faster, almost at Ruby speed.

C++ timing, two threads

3.33user 5.40system 0:05.51elapsed 158%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+675minor)pagefaults 0swaps
3.40user 5.33system 0:05.86elapsed 148%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+672minor)pagefaults 0swaps
3.41user 5.36system 0:05.60elapsed 156%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+674minor)pagefaults 0swaps

So, as Tim observed, the problem is not entirely IO-bound. More testing is needed, but will probably need to be done on a machine with faster IO (and more CPUs). Like Tim’s new Niagara T2 testbed.

Will edit this post later — need to run now, and wanted to get this out as soon as possible. Code is available here… now I just need to find Tim’s email, since his blog is down. Argh!

Update: switched to the new std::tr1::unsorted_map. Performance seems identical with a single thread, but slightly higher (within margin of error) on two threads — which makes sense: a multithreaded run exercises the map more, because each thread’s map has to be merged into the final map.