ENTRIES TAGGED "netflix"

Application Resilience in a Service-oriented Architecture

Velocity 2013 Speaker Series

Failure Isolation and Operations with Hystrix

Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.

At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.

blocked-requests-640

This open source library follows these principles in protecting our systems when novel failures inevitably occur:

  • Isolate client network interaction using the bulkhead and circuit breaker patterns.
  • Fallback and degrade gracefully when possible.
  • Fail fast when fallbacks aren’t available and rapidly recover.
  • Monitor, alert and push configuration changes with low latency (seconds).

 
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.

failing-dependency-640

Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Read more…

Comment |
Commerce Weekly: Small banks lagging in mobile

Commerce Weekly: Small banks lagging in mobile

Small banks struggle with mobile and a look inside Netflix' data.

As banking goes mobile, smaller banks must find a way to keep up. Also, Netflix data is deconstructed at the Strata Conference, and commerce-related highlights from the Mobile World Congress.

Comment |
How Netflix handles all those devices

How Netflix handles all those devices

Netflix's Matt McCarthy on building apps that work across platforms.

Matt McCarthy explains how WebKit and A/B testing play important roles on Netflix's many apps. Plus: Platform lessons Netflix has learned that apply to other developers and companies.

Comments: 2 |
How the cloud helps Netflix

How the cloud helps Netflix

Netflix's Adrian Cockcroft on the benefits of a cloud infrastructure.

Netflix moved some of its services into Amazon's cloud last year. In this interview, Netflix cloud architect Adrian Cockcroft says the move was about building a scalable product and paying down technical debt.

Comment: 1 |