"velocityconf" entries

Velocity NY recap

Operating on the edge and real-time performance emerged as key themes at Velocity NY.

Maybe it was the hustle and bustle of Times Square just within earshot. Maybe it was the smell of that legendary pizza on every corner. Or maybe it was having a front row seat in the financial capital of the world while the drama of a possible government default played itself out like a Broadway show. Whatever it was, Velocity New York felt markedly different than its west coast cousin.

Of course, Velocity’s inaugural east coast incarnation sported presentations on the state of the art of all the topics we’ve come to expect from the conference: nudging style sheets, hacking Javascript, battling memory usage, tuning database queries, and coaxing the network for every last ounce of performance on the Web. (And then doing it all over again for the mobile version…)

But discussions on scaling the operations of the squishy side of the organization–developing healthy team interactions, talking through the realities of the ever-elusive DevOps culture, and what it all means to scaling not just web sites but full-fledged businesses–were more present than ever.

Read more…

Velocity: Toward the real-time business

Velocity 2013 Speaker Series

I want to start by thanking John and Steve for the warm welcome. They’ve created something very amazing with Velocity, and I’m excited to be a part of it.

It might seem a bit odd to talk about What’s Next at the beginning of a conference, but I figure the best time to go to the bank and ask for a loan is when you actually have some money.

What we’ve been talking about at Velocity, especially the DevOps side of things, is only the tip of the iceberg when it comes to how businesses are changing. And that shift is from the sequential to the concurrent. It used to be that we threw things over a series of walls, from Product Management to Design, to Development, to QA, to Production, to Customer Service and so on. That was an old world of software and one-year development cycles.

Read more…

Going beyond Onload: Measuring performance that matters

Velocity 2013 Speaker Series: Focus on Web Apps, Not Web Pages

We’re not making web pages anymore; we’re building web applications. Gone are the days of a few script tags in the <head>. Apps today are a complex web of asynchronously-loaded content and functionality. In the past decade, we’ve progressed from statically-loaded HTML to AJAX-ifying all the things. However, the way we’ve been measuring real user performance of our apps hasn’t changed to reflect our new state of art.

Defining “Done”

At what point during page load do users consider an app to be “ready enough” to start using? If we use standard performance metrics, we have to choose one of the following:

1) When the HTML document has been completely loaded and parsed, but before stylesheets, images, and subframes have finished loading (DOMContentLoaded)

2) When all synchronous scripts, stylesheets, images, and subframes have finished loading (onload)

If we pick DOMContentLoaded, it quickly becomes clear that there’s no inherent correlation between the app state at that point and what a user would consider “ready.”

Read more…

Measuring Mobile Performance

Velocity 2013 Speaker Series: A Sneak Peek With WebPagetest and Appurify

Now that more companies have basic mobile strategies in place, they are turning their attention to the issue of performance.

Mobile developers are thinking about how fast their apps and mobile webpages load and—more importantly—what they can do to make them faster. Consumers have little patience for slow loading apps and their expectations are only going to get more stringent. This expectation likely contributed to Apple making changes so that apps on iOS 7 load 11% faster than on iOS 6.

The challenge is that measuring performance for mobile is not as easy as it is for web. Many of us have used tools like WebPagetest to assess website performance across different browsers/locations and pinpoint areas for improvement but fully functional, equivalent tools don’t exist yet for the mobile space.

This has left mobile developers ill equipped to create the highest-performing mobile apps and websites.
Read more…

Making Systems Operable

Velocity 2013 Speaker Series

There’s an old joke about the aviation cockpit of the future that it will contain just a pilot and a dog. The pilot will be there to watch the automation. The dog will be there to bite the pilot if he tries to touch anything.

Although they will all deny it, the majority of modern IT developers have exactly this view of automation: the system is designed to be self regulating and operators are there to watch it, not to operate it. The result is current systems are often inoperable, i.e. systems they cannot be effectively operated because their functions and capacities are hidden or inaccessible.

The conceit in the pilot-and-the-dog joke is that modern systems do not require operation, that they are autonomous. Whenever these systems are exhibited, our attention is drawn to their autonomous features. But there are no systems that actually function without operators. Even when we claim they are “unmanned”, all important systems have operators who are intimately involved in their function: UAV’s are piloted, the Mars rover is driven, the satellites are managed, surgical robots are manipulated, insulin pumps are programmed. We do not see these activities–many are performed by workers who remain anonymous–but we depend on them.

Read more…

The Joys of Static Memory JavaScript

Velocity 2013 Speaker Series

You wake up one morning to discover your team has gotten a dreaded alert: your web application is performing badly. You dig through your code, but don’t see anything that stands out, until you open up Chrome’s memory performance tools, and see this:

sawtooth01

One of your co-workers chuckles, because they realize that you’ve got a memory-related performance problem.

Read more…

Building an Alerting System That Really Works

Velocity 2013 Speaker Series

Building a high quality alerting system often feels like a dark art. Often it is hard to set the proper thresholds and it is even harder to define when an alert should be triggered or not. This results in alerts being raised too early or too late and your colleagues losing faith in the system. Once you use a structured approach to build an alerting system you will find it much easier and the alerts more predictable and precise.

Measure Selection

First you have to select proper measures to alert on. This selection is key as all other steps depend on using meaningful measures. While there seems to be an infinite number of different measures, you can categorize them into three main categories:

  • Saturation measures indicate how much of a resource is used. Examples are CPU usage or resource pool consumption.
  • Occurrence measures indicate whether a condition was met or not. A good example is errors. These measures are often presented as a rate like failed transactions per seconds.
  • Continuous measures do not have a single value at any given point in time, but instead a large number of different values. A typical example is response times. Irrespective of how small you make the sample, you will always have a large amount of values and never just one single representative value.

Read more…

Test-driven Infrastructure with Chef

Velocity 2013 Speaker Series

If you’re a System Administrator, you’re likely all too familiar with the 2:35am PagerDuty alert. “When you roll out testing on your infrastructure,” says Seth Vargo, “the number of alerts drastically decreases because you can build tests right into your Chef cookbooks.” We sat down to discuss his upcoming talk at Velocity, which promises to deliver many more restful nights for SysAdmins.

Key highlights from our discussion include:

  • There are not currently any standards regarding testing with Chef.  [Discussed at 1:09]
  • A recommended workflow that starts with unit testing  [Discussed at 2:11]
  • Moving cookbooks through a “pipeline” of testing with Test Kitchen [Discussed at 3:11]
  • In the event that something bad does make it into production, you can roll back actual infrastructure changes. [Discussed at 4:54]
  • Automating testing and cookbook uploads with Jenkins [Discussed at 5:40]

You can watch the full interview here:

 

Application Resilience in a Service-oriented Architecture

Velocity 2013 Speaker Series

Failure Isolation and Operations with Hystrix

Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.

At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.

blocked-requests-640

This open source library follows these principles in protecting our systems when novel failures inevitably occur:

  • Isolate client network interaction using the bulkhead and circuit breaker patterns.
  • Fallback and degrade gracefully when possible.
  • Fail fast when fallbacks aren’t available and rapidly recover.
  • Monitor, alert and push configuration changes with low latency (seconds).

 
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.

failing-dependency-640

Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Read more…

Ops Mythology

Velocity 2013 Speaker Series

At some point, we’ve all ended up trading horror stories over drinks with colleagues. Heads nod and shake in sympathy, and the stories get hairier as the night goes on. And while it of course feels good to get some of that dirt off your shoulder, is there a larger, better purpose to sharing war stories? I sat down with James Turnbull of Puppet Labs (@kartar) to chat about his upcoming Velocity talk about Ops mythology, and how we might be able to turn our tales of disaster into triumph.

Key highlights of our discussion include:

  • Why do we share disaster stories? What is the attraction? [Discussed at 0:40]
  • Stories are about shared experience and bonding with members of our community. [Discussed at 2:10]
  • These horror stories are like mythological “big warnings” that help enforce social order, which isn’t always a good thing. [Discussed at 4:18]
  • A preview of how his talk will be about moving away from the bad stories so people can keep telling more good stories. (Also: s’mores.) [Discussed at 7:15]

You can watch the entire interview here:

This is one of a series of posts related to the upcoming Velocity conference in Santa Clara, CA (June 18-20). We’ll be highlighting speakers in a variety of ways, from video and email interviews to posts by the speakers themselves.