ENTRIES TAGGED "monitoring"
Scale and complexity call for leaving it to specialists
As applications move from onpremise to SaaS, the scale of deployments increases by orders of magnitude (to “webscale”). At the same time, application development and operation become tightly integrated and continuous deployment brings the frequency of updates down from months to days or even hours.
The larger scale makes the health of SaaS applications mission-critical and even existential to its providers, while the frequent updates increase the risk of failures. Therefore, monitoring and root cause analysis also become mission critical functions, and more instrumentation is needed to ensure the application’s quality of service. At the company I co-founded, we see customers using extensive and often tailored instrumentation that generates massive amounts of data (think hundreds of thousands of data streams and billions of data points per day).
Current tools make collection and visualization easier but don't reduce work
New tools are raining down on system administrators these days, attacking the “monitoring sucks” theme that was pervasive just a year ago. The new tools–both open source and commercial–may be more flexible and lightweight than earlier ones, as well as more suited for the kaleidoscopic churn of servers in the cloud, making it easier to log events and visualize them. But I look for more: a new level of data integration. What if the monitoring tools for different components could send messages to each other and take over from the administrator the job of tracing causes for events?
The Havana release features metering and orchestration
I talked this week to Jonathan Bryce and Mark Collier of OpenStack to look at the motivations behind the enhancements in the Havana release announced today. We focused on the main event–official support for the Ceilometer metering/monitoring project and the Heat orchestration project–but covered a few small bullet items as well.
Velocity 2013 Speaker Series
Building a high quality alerting system often feels like a dark art. Often it is hard to set the proper thresholds and it is even harder to define when an alert should be triggered or not. This results in alerts being raised too early or too late and your colleagues losing faith in the system. Once you use a structured approach to build an alerting system you will find it much easier and the alerts more predictable and precise.
First you have to select proper measures to alert on. This selection is key as all other steps depend on using meaningful measures. While there seems to be an infinite number of different measures, you can categorize them into three main categories:
- Saturation measures indicate how much of a resource is used. Examples are CPU usage or resource pool consumption.
- Occurrence measures indicate whether a condition was met or not. A good example is errors. These measures are often presented as a rate like failed transactions per seconds.
- Continuous measures do not have a single value at any given point in time, but instead a large number of different values. A typical example is response times. Irrespective of how small you make the sample, you will always have a large amount of values and never just one single representative value.