Availability: The Art of Uptime
We’ve all seen Flickr “having a massage” – a lighthearted euphemism to announce that their site is down for maintenance; scheduled or otherwise. But downtime is a serious issue for any business relying on websites and web applications, be it part of the sales cycle or the product offering.
Our primary goal is to minimize downtime as much as possible. In order to do this, we must understand what causes failures. But of course, in order to investigate the causes, we need to know about the failures in the first place.
Sadly, all too many businesses only discover that their websites are experiencing downtime when confronted by annoyed visitors, prospects or customers.
This situation is easily avoided by using one of the hundreds of proactive website monitoring products and services available – including of course the services we offer.
Assuming that we are aware that our website infrastructure is suffering from reliability issues, we can group the causes into three fundamental types:
Arguably the primary cause of website downtime is unreliable or misconfigured application software. The stalwarts of web server infrastructure – Apache Web Server, Microsoft IIS, MySQL, SQL Server et al – have been actively developed for many years and are now remarkably reliable.
However, modern web applications rely on software that has not been refined for nearly as long, and the immature state of the technology can manifest as application failures in a production environment.
Server software will often perform flawlessly until peak traffic levels are reached, at which point RAM or other resource constraints cause an application failure. This situation is particularly difficult to diagnose unless the performance of your system is constantly monitored, providing you with an audit trail of the circumstances leading to the failure.
At a more mundane level, misconfiguration of the web server can result in its inability to correctly serve page content. Some examples include:
- Apache attempting to automatically restart after log rotation, but failing due to a broken configuration file or a file permissions error.
- Dependencies failing to start at boot time, causing the application to fail after a reboot.
- Automated backup procedures that fail to correctly restore system functionality after backup completion.
- Just plain running out of disk space!
As dedicated servers and VPS’s continue to become cheaper, the web hosting environment will continue to trend toward these self-managed options. Unless we see a marked simplification in the administration of these systems, we can expect server misconfiguration to be increasingly responsible for website failures.
The fundamentals of your web server infrastructure are of course electricity and an Internet connection – if the data center hosting your servers cannot provide this reliably then your website availability will suffer greatly.
At the core of the technology that comprises the Internet are IP packets. A key tenet often overlooked by web developers is that IP packets are never guaranteed to arrive at their destination.
Instead, the layers that sit on top of IP – specifically TCP – need to re-send packets that are deemed “lost” along the way. So while HTTP is a generally robust application platform, failures are commonplace at the protocol level.
The upshot of this is that, while your hosting provider might be delivering on their 99.99% uptime guarantee – if 20% of your data is getting lost and re-sent, your true availability levels are much lower.
Transient failures at the packet level are a mostly hidden source of frustration for your users. Long delays on page loads, timeouts and inconsistent performance hinder their use of your service, and make them less likely to return. As the site operator, it can be difficult for you to pinpoint these problems and take corrective action.
This scenario illustrates, once again, the importance of monitoring the performance of your website infrastructure continuously, so that problems are uncovered before they can cause outright failures.
The final and least common cause of website downtime is hardware failure. Commodity server hardware is now very mature and offers a great deal of performance at a low cost, so failures in the server environment are thankfully rare.
However this cause should not be discounted – particularly as the budget server market brings pricing levels to new lows, and continual cost cutting becomes necessary to remain competitive.
The value of redundancy – i.e. RAID disk arrays, multiple power supplies and error-correcting RAM – quickly becomes apparent when the costs of hardware failures are considered.
An Ounce of Prevention
Having considered the risk factors affecting website availability, we can devise a strategy to minimize, and as much as possible, prevent downtime.
Recognize that failures are inevitable.
All computer systems suffer from failures, so minimizing the frequency and impact of these failures is the best course of action.
Understand what’s happening.
By proactively monitoring both performance and availability, we can better understand the issues that lead to a failure.
Ensure that your monitoring strategy provides relevant parties with rapid notification so that corrective action can be taken as quickly as possible.
When a failure occurs, the most important outcome is that you have learned from the experience, and can take steps prevent the same situation from occurring again.
We offer performance and availability monitoring of your servers with a crucial difference – to help you find out why your websites experience downtime so you can solve the problem at its source.
Fully featured monitoring is available completely free for 14 days, so sign up today to improve your website performance and availability.