ENTRIES TAGGED "distributed"
Untangling code with flow-based programming
That’s why we live in this world where we follow this one particular [von Neumann] architecture and all the alternatives were squashed… Turing gave us this very powerful one-dimensional model, von Neumann made it into this two-dimensional address matrix, and why are we still stuck in that world? We’re fully capable of moving on to the next generation… that becomes fully three-dimensional. Why stay in this von Neumann matrix?
Dyson suggested a more biologically based template-based approach, but I wasn’t sure at the time that we were as far from three dimensions as Dyson thought. Distributed computing with separate memory spaces already can offer an additional dimension, though most of us are not normally great at using it. (I suspect Dyson would disagree with my interpretation.)
Companies that specialize in scaling horizontally—Google, Facebook, and many others—already seem to have multiple dimensions running more or less smoothly. While we tend to think of that work as only applying to specialized cases involving many thousands of simultaneous users, that extra dimension can help make computing more efficient at practically any scale above a single processor core.
Unfortunately, we’ve trained ourselves very well to the von Neumann model—a flow of instructions through a processor working on a shared address space. There are many variations in pipelines, protections for memory, and so on, but we’ve centered our programming models on creating processes that communicate with each other. The program is the center of our computing universe because it must handle all of these manipulations directly.
Retreading old topics can be a powerful source of epiphany, sometimes more so than simple extra-box thinking. I was a computer science student, of course I knew statistics. But my recent years as a NoSQL (or better stated: distributed systems) junkie have irreparably colored my worldview, filtering every metaphor with a tinge of information management.
Lounging on a half-world plane ride has its benefits, namely, the opportunity to read. Most of my Delta flight from Tel Aviv back home to Portland lacked both wifi and (in my case) a workable laptop power source. So instead, I devoured Nate Silver’s book, The Signal and the Noise. When Nate reintroduced me to the concept of statistical overfit, and relatedly underfit, I could not help but consider these cases in light of the modern problem of distributed data management, namely, operators (you may call these operators DBAs, but please, not to their faces).
When collecting information, be it for a psychological profile of chimp mating rituals, or plotting datapoints in search of the Higgs Boson, the ultimate goal is to find some sort of usable signal, some trend in the data. Not every point is useful, and in fact, any individual could be downright abnormal. This is why we need several points to spot a trend. The world rarely gives us anything clearer than a jumble of anecdotes. But plotted together, occasionally a pattern emerges. This pattern, if repeatable and useful for prediction, becomes a working theory. This is science, and is generally considered a good method for making decisions.
On the other hand, when lacking experience, we tend to over value the experience of others when we assume they have more. This works in straightforward cases, like learning to cook a burger (watch someone make one, copy their process). This isn’t so useful as similarities diverge. Watching someone make a cake won’t tell you much about the process of crafting a burger. Folks like to call this cargo cult behavior.
How Fit are You, Bro?
You need to extract useful information from experience (which I’ll use the math-y sounding word datapoints). Having a collection of datapoints to choose from is useful, but that’s only one part of the process of decision-making. I’m not speaking of a necessarily formal process here, but in the case of database operators, merely a collection of experience. Reality tends to be fairly biased toward facts (despite the desire of many people for this to not be the case). Given enough experience, especially if that experience is factual, we tend to make better and better decisions more inline with reality. That’s pretty much the essence of prediction. Our mushy human brains are more-or-less good at that, at least, better than other animals. It’s why we have computers and Everybody Loves Raymond, and my cat pees in a box.
Imagine you have a sufficient amount of relevant datapoints that you can plot on a chart. Assuming the axes have any relation to each other, and the data is sound, a trend may emerge, such as a line, or some other bounding shape. A signal is relevant data that corresponds to the rules we discover by best fit. Noise is everything else. It’s somewhat circular sounding logic, and it’s really hard to know what is really a signal. This is why science is hard, and so is choosing a proper database. We’re always checking our assumptions, and one solid counter signal can really be disastrous for a model. We may have been wrong all along, missing only enough data. As Einstein famously said in response to the book 100 Authors Against Einstein: “If I were wrong, then one would have been enough!”
Database operators (and programmers forced to play this role) must make predictions all the time, against a seemingly endless series of questions. How much data can I handle? What kind of latency can I expect? How many servers will I need, and how much work to manage them?
So, like all decision making processes, we refer to experience. The problem is, as our industry demands increasing scale, very few people actually have much experience managing giant scale systems. We tend to draw our assumptions from our limited, or biased smaller scale experience, and extrapolate outward. The theories we then tend to concoct are not the optimal fit that we desire, but instead tend to be overfit.
Overfit is when we have a limited amount of data, and overstate its general implications. If we imagine a plot of likely failure scenarios against a limited number of servers, we may be tempted to believe our biggest odds of failure are insufficient RAM, or disk failure. After all, my network has never given me problems, but I sure have lost a hard drive or two. We take these assumptions, which are only somewhat relevant to the realities of scalable systems and divine some rules for ourselves that entirely miss the point.
In a real distributed system, network issues tend to consume most of our interest. Single-server consistency is a solved problem, and most (worthwhile) distributed databases have some sense of built in redundancy (usually replication, the root of all distributed evil).
Steve Vinoski on when to make the leap to functional programming.
Functional programming has a long and distinguished heritage of great work — that was only used by a small group of programmers. In a world dominated by individual computers running single processors, the extra cost of thinking functionally limited its appeal. Lately, as more projects require distributed systems that must always be available, functional programming approaches suddenly look a lot more appealing.
Steve Vinoski, an architect at Basho Technologies, has been working with distributed systems and complex projects for a long time, first as a tentative explorer and then leaping across to Erlang when it seemed right. Seventeen years as a columnist on C, C++, and functional languages have given him a unique viewpoint on how developers and companies are deciding whether and how to take the plunge.
Highlights from our recent interview include: