Shared nothing parallel programming

I agree strongly with Tim and Nathan’s belief in the importance of parallel computing. I’ve been following this space since 2000, when I took Gurusamy Sarathy’s initial work on making perl multi-threaded and finished it for the 5.8 release.

The initial perl threading released in 5.5 had a traditional architecture: all data was shared between all threads. The problem with this approach was the need for continuous synchronization between threads would slow the whole machine down. For 5.8 we revised the plan, and settled on the default of a completely non-shared environment. Each thread had its own context, with its own data space. Only explicitly shared variables were accessible between threads. This let most of the code run at full speed, only paying the synchronization cost when a shared variable was accessed.

I am a firm believer in the shared nothing architecture. Multithreading is hard, with the standard way to solve concurrency problems being to add mutex protection around the non-thread-safe code. Those mutexes allow only one thread to access a particular resource at one time. So imagine your 32-core machine, running an application with 32 threads that uses a mutex to control access to a vital part of the application. All threads need to continuously acquire this mutex, thus creating a bottleneck that allows only a few threads to execute. So your 32 threads, on your 32 core machine, are mostly sitting around waiting for their turn.

With a shared nothing architecture, you can avoid this. If your thread never has to acquire a mutex, it can run at full speed on its assigned CPU. A recent visit to IBM Almaden again underscored the importance of this to me. They showed us a Blue Gene, an awesome beast with 2048 CPUs per rack. Each CPU is a little computer on a chip, with ethernet networking, local interconnects and 512 MB of RAM. They have two of these racks together, and to make it even cooler, you can put 64 of these together for a total of 65536 CPUs. All of these CPUs share no memory, so to implement software on them, you have to use a shared nothing architecture.

The important challenge is not to allow star developers to write multithreaded code; it is to allow the large army of enterprise developers out there to scale their applications to large numbers of cores. Perhaps tools like PeakStream (purchased by Google) or its remaining competitor, RapidMind, can help, but I remain doubtful. I spent a summer reading a printout of all 16,000 lines of perl regular expression code, with a marker pen to find problematic spots. I am unconvinced a tool could have done that for me.

Radar friend Jeff Jonas made me think about this when he posted about performance on his blog. I believe this is direction parallel computing has to go.

Our small database footprint project had the goal of externalizing as much computation off the database engine – pushing this processing into share nothing parallelizable pipelines. So we also did such things as externalized serialization (no more using the database engine to dole out unique record ID’s) and eliminated virtually all stored procedure and triggers – placed more computational weight on these “n” wide pipeline processes instead.

tags: ,