|
|
Your system has a bottleneck when it has 100% utilization of a resource like memory or the processor. You eliminate this bottleneck by
On a busy computer system, when you eliminate one bottleneck, you often get another bottleneck. Our goal is to try to utilize all resources equally. Ideally we have excess capacity on all of the major resources and therefore no bottleneck exists. Monitor both Utilization and Queues Bottlenecks can occur when you have 100% utilization of a system resource. Although a 100% peak will cause an optimization problem, the average utilization may be much less than 100%. Most systems have periods of high utilization followed by periods of low utilization. It is important not only to look at average utilization statistics, but also look a queue statistics. If processes are queuing waiting for some resource, then these processes are being delayed because of the lack of this resource. If you have any process queuing, then performance will improve by providing more of the lacking resource, e.g. add more memory.
Use tools such as System Monitor to detect bottlenecks. The Heisenberg principle is named for physicist, Werner Heisenberg. Although Werner Heisenberg has not been well know outside of scientific circles, the recent theatrical play, Copenhagen, has raised his profile amongst the general public. The Tony Award winning play, Copenhagen, is about the wartime meeting of physicists Werner Heisenberg and Niels Bohr.
A resource bottleneck is obvious with 100% utilization, but what about 90%, 80%, or even 50%. At 50% utilization, is the glass half empty or half full? Does 90% utilization indicate high utilization or 10% excess capacity? A good analogy to consider is utilization of a freeway. If cars on a freeway are bumper to bumper traveling at the speed limit we have 100% utilization. This means that we have optimal use of the freeway. This is not a problem unless another car is on the on-ramp trying to get onto the freeway. If this happens, we get a queue as cars line up on the on-ramp. The queue represents a lack of service and resource overload. Uneven usage may result in queues with less than 100% average utilization. This analogy shows that bottleneck detection requires the monitoring of both utilization and queues.
Planning
To analyze your log data, store the log files in CSV or TSV format. These can easily be imported into a spreadsheet program. You can then use the trend analysis features of you spreadsheet program to predict usage into the future. In addition to log files, you need to know about major changes in your organization.
Important metrics to measure fault tolerance are
Downtime is generally measured as MTTR/MTTF, but since it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing the MTTR for your most likely and costly points of failure. Most electronic components have a distinctive "bathtub" curve that represents their failure characteristics, as shown in the following diagram. During the early life/burn-in phase of the component, it's more likely to fail; once this initial phase is over, a component's overall failure rate remains low until it reaches the end of its useful life, when the failure rate increases again.
The normal statistical failure rates for mechanical and electronic components: a characteristic "bathtub" curve | ||||||||||||||||||||||||||||||||||
|
|