Configuration and scale at Facebook

Phil Dibowitz explains the challenges and the results they got with Chef

At OSCON, Phil Dibowitz reminded me how little I understand about large systems – as he puts it, really large systems, systems of systems, with some similarities but with different people controlling parts. His work at Facebook explores the challenges (and opportunities) of creating tools that work across a company’s many networks and computers.

If you deal with such challenges, he’s worth listening to as a model. If not, he’s worth listening to for a sense of just how different work at this scale can be, though much of what he accomplishes can be worthwhile at scales much smaller than the 17,000 servers he describes for Facebook at 26:42 in the session.

I talked with him in an interview:

and we’ve posted his OSCON session:

Highlights include:

  • Scale can be homogenous or heterogeneous [in the session at 1:43]
  • The importance of granularity for heterogeneous environments [in the interview at 0:29]
  • Scalable building blocks [in the session at 4:13]
  • Configuration as data [in the session at 6:32]
  • Who runs it? Service owners plus four people [in the session at 3:36]
  • Private Chef as a testing ground, and Open-Source Chef [in the interview at 2:20 and in the session at 21:15]
  • Testing out Puppet, Chef and Spine [in the interview at 5:17] and Why Chef? [in the session at 10:27]
  • Building a workflow [in the session at 13:50]
  • Locking yourself into software by extending it with “crazy hacks” [in the interview at 8:29]
  • Where Facebook goes its own way on cookbooks [in the interview at 10:25 and in the session at 15:06]
  • The DevOps movement encouraging people to talk infrastructure [in the interview at 11:29]
  • Flexibility critical [in the session at 7:37 and 16:55]
  • Erlang and Chef “We can probably put more thrust on that pig… but we finished this rewrite on to Erlang” [in the session at 28:58]
  • The new software could run 4000 nodes more efficiently than the old software could run zero nodes. [in the session at 31:45]
  • “At Facebook, it’s true, we test in production.” But tests have an hour. [in the session at 32:47]
  • Lessons learned [in the session at 34:48]
  • Is this possible? [in the session at 37:00]
tags: ,