ENTRIES TAGGED "resilience"

Head Games: Ego and Entrepreneurial Failure

“Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.” —Samuel Beckett

Entrepreneurial success hinges in large part on a founder’s mastery of psychology. This requires the ability to manage one’s responses to what Ben Horowitz calls “The Struggle,” that is, the emotional roller coaster of startup life. Paul DeJoe captures the ups and downs of being a startup CEO in a post reprinted in a book that I edited, Managing Startups: Best Blog Posts.

It’s all in a founder’s head: the drive to build something great; the resilience to dust yourself off when you repeatedly get knocked down; the passion powering a Reality Distortion Field that mesmerizes potential teammates, investors, and partners. But inside a founder’s head may also be delusional arrogance; an overly impulsive “ready-fire-aim” bias for action; a preoccupation with control; fear of failure; and self-doubt fueling the impostor syndrome. That’s why VC-turned-founder-coach Jerry Colonna named his blog The Monster in Your Head. In a recent interview with Jason Calacanis, Colonna does a nice job of summarizing some of the psychological challenges confronting entrepreneurs. So does a classic article by the psychoanalyst Manfred Kets de Vries: “The Dark Side of Entrepreneurship.”
Read more…

Comment |

Application Resilience in a Service-oriented Architecture

Velocity 2013 Speaker Series

Failure Isolation and Operations with Hystrix

Web-scale applications such as Netflix serve millions of customers using thousands of servers across multiple data centers. Unmitigated system failures can impact the user experience, a product’s image, and a company’s brand and, potentially, revenue. Service-oriented architectures such as these are too complex to completely understand or control and must be treated accordingly. The relationships between nodes are constantly changing as actors within the system independently evolve. Failure in the form of errors and latency will emerge from these relationships and resilient systems can easily “drift” into states of vulnerability. Infrastructure alone cannot be relied upon to achieve resilience. Application instances, as components of a complex system, must isolate failure and constantly audit for change.

At Netflix, we have spent a lot of time and energy engineering resilience into our systems. Among the tools we have built is Hystrix, which specifically focuses on failure isolation and graceful degradation. It evolved from a series of production incidents involving saturated connection and/or thread pools, cascading failures, and misconfigurations of pools, queues, timeouts, and other such “minor mistakes” that led to major user impact.

blocked-requests-640

This open source library follows these principles in protecting our systems when novel failures inevitably occur:

  • Isolate client network interaction using the bulkhead and circuit breaker patterns.
  • Fallback and degrade gracefully when possible.
  • Fail fast when fallbacks aren’t available and rapidly recover.
  • Monitor, alert and push configuration changes with low latency (seconds).

 
Restricting concurrent access to a given backend service has proven to be an effective form of bulkheading, as it limits the resource utilization to a concurrent request limit smaller than the total resources available in an application instance. We do this using two techniques: thread pools and semaphores. Both provide the essential quality of restricting concurrent access while threads provide the added benefit of timeouts so the caller can “walk away” if the underlying work is latent.

failing-dependency-640

Isolating functionality rather than the transport layer is valuable as it not only extends the bulkhead beyond network failures and latency, but also those caused by client code. Examples include request validation logic, conditional routing to different or multiple backends, request serialization, response deserialization, response validation, and decoration. Network responses can be latent, corrupted, or incompatibly changed at any time, which in turn can result in unexpected failures in this application logic.
Read more…

Comment |

Distributed resilience with functional programming

Steve Vinoski on when to make the leap to functional programming.

Functional programming has a long and distinguished heritage of great work — that was only used by a small group of programmers. In a world dominated by individual computers running single processors, the extra cost of thinking functionally limited its appeal. Lately, as more projects require distributed systems that must always be available, functional programming approaches suddenly look a lot more appealing.

Steve Vinoski, an architect at Basho Technologies, has been working with distributed systems and complex projects for a long time, first as a tentative explorer and then leaping across to Erlang when it seemed right. Seventeen years as a columnist on C, C++, and functional languages have given him a unique viewpoint on how developers and companies are deciding whether and how to take the plunge.

Highlights from our recent interview include:

Read more…

Comments: 2 |