Windows Server Troubleshooting - Analysis

Click here to start saving with ING DIRECT!

Home | Up | Architecture | Tools | Memory | Processor | Registry | File System | Network | Active Directory | Contents

Get the Book

Major Topics
File System
Active Directory
Other Topics
More Detail

eXpert Genealogy

Memory from

2003-2006 Team Approach Limited
All rights reserved


 Your system has a bottleneck when it has 100% utilization of a resource like memory or the processor. You eliminate this bottleneck by

  • adding resources, or
  • lowering the utilization.

On a busy computer system, when you eliminate one bottleneck, you often get another bottleneck. Our goal is to try to utilize all resources equally. Ideally we have excess capacity on all of the major resources and therefore no bottleneck exists.

Monitor both Utilization and Queues

Bottlenecks can occur when you have 100% utilization of a system resource. Although a 100% peak will cause an optimization problem, the average utilization may be much less than 100%. Most systems have periods of high utilization followed by periods of low utilization. It is important not only to look at average utilization statistics, but also look a queue statistics.  If processes are queuing waiting for some resource, then these processes are being delayed because of the lack of this resource. If you have any process queuing, then performance will improve by providing more of the lacking resource, e.g. add more memory.

Monitoring tools affect performance

Use tools such as System Monitor to detect bottlenecks.
Remember the Heisenberg principle; you measuring tools may affect what you are trying to measure. For example, do not store a monitoring log file on a hard disk that is being monitored.

The Heisenberg principle is named for physicist, Werner Heisenberg. Although Werner Heisenberg has not been well know outside of scientific circles, the recent theatrical play, Copenhagen, has raised his profile amongst the general public. The Tony Award winning play, Copenhagen, is about the wartime meeting of physicists Werner Heisenberg and Niels Bohr.

 Half Full or Half Empty?

A resource bottleneck is obvious with 100% utilization, but what about 90%, 80%, or even 50%. At 50% utilization, is the glass half empty or half full? Does 90% utilization indicate high utilization or 10% excess capacity?

A good analogy to consider is utilization of a freeway. If cars on a freeway are bumper to bumper traveling at the speed limit we have 100% utilization. This means that we have optimal use of the freeway. This is not a problem unless another car is on the on-ramp trying to get onto the freeway. If this happens, we get a queue as cars line up on the on-ramp.  The queue represents a lack of service and resource overload. Uneven usage may result in queues with less than 100% average utilization. This analogy shows that bottleneck detection requires the monitoring of both utilization and queues.  



Obviously, 50% Utilization is not be a problem, if there is no queue.



But 100% Utilization is also not be a problem, if there is no queue!


Obviously, 100% Utilization is a problem if there is a queue.


But 50% Utilization is also a problem if there is a queue!


In capacity planning we anticipate users' needs and predict what resources we will need in the future. Capacity planning begins with recording performance statistics of our Windows network. The easiest way to collect this data is to gather performance logs. These log files become baselines that we can reference in the future. Compare current and historical log files to identify growth trends.

To analyze your log data, store the log files in CSV or TSV format. These can easily be imported into a spreadsheet program. You can then use the trend analysis features of you spreadsheet program to predict usage into the future.

In addition to log files, you need to know about major changes in your organization.

  • Numbers of users
  • Software upgrades
  • Network changes
  • Business direction
    • Mergers
    • Budgets
    • New activities
    • Seasonal activity

Mean Time to Failure and Mean Time to Recover

Important metrics to measure fault tolerance are

  • MTTF mean time to failure, the mean time until the device will fail, and

  • MTTR mean time to recover, the mean time it takes to recover once a failure has occurred.

Downtime is generally measured as MTTR/MTTF, but since it can be prohibitively expensive to increase MTTF beyond a certain point, you should spend both time and resources on managing and reducing the MTTR for your most likely and costly points of failure.

Most electronic components have a distinctive "bathtub" curve that represents their failure characteristics, as shown in the following diagram. During the early life/burn-in phase of the component, it's more likely to fail; once this initial phase is over, a component's overall failure rate remains low until it reaches the end of its useful life, when the failure rate increases again.

  Burn in           Normal aging            Failure mode

The normal statistical failure rates for mechanical and electronic components: a characteristic "bathtub" curve