Modern computer networks are complex, high-speed marvels of technology with advanced server, client, and network hardware and software working in close cooperation. They are so fast and complex, in fact, that it can sometimes be difficult to trace the source of problems that arise.
This chapter deals with methodology for identifying underlying causes of faults in a client/server network environment. It examines the complexity that can make problem identification so difficult and includes recommendations for precautionary measures that can help you get a handle on the complexity of your particular network. This chapter then provides a roadmap for identifying the underlying cause of a network problem.
NOTE: This chapter focuses on fault identification methodology. It does not describe network troubleshooting tools and how to use them. This chapter also does not describe how to correct a problem once it is identified--chapter 32, "Repairing Common Types of Problems," does that for the most common network problems. The information required to fix other problems can be found elsewhere in this book in the relevant chapters.
NOTE: While this chapter focuses on client/server network faults, the methods described are generic and can be applied equally to peer-to-peer networks.
An initial fault report, also known as a problem report, is generally a description of one or more symptoms rather than of the underlying problem itself. The problem can be regarded as successfully located when it has been redefined in a specific hardware or software context where it can be dealt with directly.
For example, a problem might be reported initially as "Can't connect to server." After investigation, you might be able to restate the problem as "The server has a faulty network adapter" or "The client is using the incorrect frame type." In the first case, the server's network adapter needs to be replaced; in the second, the client software needs to be reconfigured. In both cases, the problem has been located (though not yet resolved).
How do you get from the initial statement of the problem, phrased in the user's terms, to a statement in your own terms of the specific locus of the underlying problem? The following sections look at a few possible approaches.
The trial-and-error approach is one way to find causes of problems--try changing a component here or a driver there and see if it all starts working again. This is a bad approach, but let's think about why it is bad.
First, there is an underlying assumption that you don't really understand the way your network works. If you did understand how the network works, then you'd make use of your insight--rather than trial and error--to approach the problem in a logical way.
Second, there is no guarantee that you'll ever locate the problem. It might be caused by a combination of factors, and unless you happen to eliminate both factors at the same time, the problem will persist.
This doesn't mean, however, that there isn't a useful element of trial and error in properly planned troubleshooting. You need to experiment with the network to some extent to find answers to your questions about the problem (for example, "Does the problem persist if I use a different connection?"), but such tests should be used as part of a methodological approach to identifying the source of the problem--they should not constitute the entire methodology.
Another approach, more commonly used in real life, is the expert system approach, where you use your expert knowledge of the network to short-circuit the troubleshooting process. For example, two users complain that they can no longer connect to the server. You can see the server from your workstation, so you immediately suspect the repeater, which you know sits between them and the server.
You can thus use your insight into your network to pounce quickly on the obscure cause of a problem that might be difficult to solve in a step-by-step way. Combined with the trial-and-error approach, troubleshooting like this can be very successful.
The trouble with this approach is that it not only uses your insight into the network but relies on that insight. If you don't have all the information, you might arrive at the wrong answer. Consider the previous example--suppose one of the two users had moved temporarily out of his office, because it was being painted, and into a colleague's office, where he brought his own workstation and connected it to the second user's thinnet strand. Given that information, you would make a very different first guess about the nature of the problem.
Generally it is best to use your expert knowledge of the network (along with trial and error) as part of a methodological approach rather than as the entire basis for the troubleshooting effort.
A more abstract approach, the problem space approach, is to regard the underlying problem as a point (or possibly a series of points) in n-dimensional space. Each of the n dimensions represents a possible parameterization of the problem in a specific hardware or software context. For example, you might have the following dimensions:
The extent of the problem can then be clearly identified by investigating the problem in the context of each of these dimensions.
The usefulness of this way of looking at the problem lies in its comprehensiveness. Define each possible dimension of your network this way before a problem arises. Whenever a problem comes up, you can check it on each dimension in turn. This means that you are unlikely to overlook some aspect of the problem or to be misled by apparently "obvious" causes.
Of course, this approach is not realistic. It is impossible to define all the possible dimensions that apply to any network. Even if you did, this approach would simply allow you to accurately identify the extent of the problem, not its underlying cause--pinpointing the cause would require a logical leap of some sort. The underlying idea, however-- trying to comprehensively identify all possible problem dimensions--can be usefully incorporated into a structured troubleshooting methodology.
The three approaches previously described can be combined in a structured way into a useful system. Avoid adopting an ad hoc approach to combining them--simply mixing the ingredients of trial and error, insight, and a long list of things to look into will not get you very far. Combine them using the following steps:
Figure 31.1
This is a sample fault report form.
NOTE: A definitively refuted hypothesis can be as useful as a definitively confirmed hypothesis. What matters is that you use a hypothesis that you can test and each of the possible outcomes eliminates a range of possible problem causes from consideration.
This sequence is illustrated in figure 35.2. In many cases, of course, the experimentation loop will not be quite so tight as illustrated; a test may produce unexpected results which means that the extent of the problem is not quite what you originally thought. If that happens, you will have to redefine the scope of the fault and proceed again to the formulation of a new hypothesis.
This method uses elements of each approach described earlier:
Figure 31.2
This is the suggested troubleshooting process.
The underlying method here is to continuously restate the problem in a context more amenable to direct action. The problem originally is stated by the user in his or her terms. You then investigate the problem, restating it in your terms and in light of your findings.
The following is an example to illustrate the method:
Now that the problem has been localized to the server, it can be dealt with on a direct basis--you need to establish why the server crashed, bring it back up, and so on.
Notice that the initial fault report was refined in the first three steps. John said that he couldn't log on, but it was necessary to establish whether he had tried from one or more workstations (step 2) and what had happened when he tried (step 3). At that stage, it was clear that the problem as far as John's workstation was concerned was that it couldn't see the server.
The focus then shifted to determining the extent of the problem. If any other machines could see server SAL (step 4), then the focus could shift to John's workstation and his part of the network. Since a second machine could not see the server, though, there was a good chance that the server was the source of the problem.
Step 5 confirmed this, though of course this might not have been the case--the server might have been fine, with the two workstations both affected by another network problem of some sort. The decision to check the server rather than the network in step 5 was based on the expectation that the network was unlikely to act in such a way as to prevent two clients from seeing a working server--the possibility that server SAL was down seemed more likely.
Here's another example:
After step 10, you know that the problem is with the bridge, and you can debug it.
In both of these examples, a more specific problem statement generated by a thorough fault report would have been preferable to the initial problem statement "John can't log on."
If, in the first case, it was "John gets a Server SAL not found message when he tries to log on from his workstation," you would have known right away that NETX was loading (so that his workstation could see at least one server) and that the workstation simply couldn't see SAL.
If, in the second case, it was "NETX on John's workstation gives a Cannot connect to server SAL error message," you would have known that the situation was worse, in the sense that the workstation could see no servers. Whether this is actually worse or not is unimportant here--what matters is that you get enough information at an early stage to distinguish between different symptoms.
The process of redefining the problem into a manageable context requires knowledge about various facets of the particular network that you are dealing with:
Collect information about each of these areas when the network is up and running normally. This allows you to investigate fault reports on the basis of firm information, categorized according to the contexts into which you need to resolve the problems. The following sections look at each of these areas in turn.
You need to understand the topology of a particular network if you hope to troubleshoot network problems on it. Review the structure of the network and draw two maps of it:
In both cases, show all bridges, repeaters, routers, hubs, and segments. Indicate where all servers and clients are connected (grouping large numbers of clients together for convenience).
TIP: Remember to keep both maps updated as your network grows.
A detailed inventory of all active network equipment will accelerate the troubleshooting process. You might need to check model numbers or firmware revision levels when tracing a fault, especially if you need to discuss things with the vendor. Everything proceeds with more speed and less panic if all the relevant information has been collected at an earlier, calmer stage.
Gather at least the following information for each piece of equipment:
A detailed hardware and software inventory for your servers is essential. Gather at least the following information for each server:
Gather at least the following information for each network adapter:
Gather at least the following information for each SCSI controller:
Gather at least the following information for each hard drive:
Gather at least the following information for each adapter (besides SCSI and network):
Make the compilation of this information an integral part of the installation procedure for new servers. Remember that an installation isn't worth much unless it's stable. That means that you must be able to respond quickly to the problems that will inevitably arise. Proper documentation of this sort should be regarded as an investment in the stability of the service. Keep this list up-to-date as the server configuration changes.
It might not be practical for you to keep a detailed inventory of all clients on the network. There might be too many of them, or they might be managed by someone else, meaning that the information does not normally come your way. If you are expected to provide support for these workstations, however--even if you only support their use of the network--you must have a certain amount of information about each workstation.
Gather at least the following information (if the workstations are managed by other people, insist that those people provide this information for each workstation they connect):
Gather at least the following information for each network adapter:
Information about the range of applications in use on the network will help to inform your troubleshooting efforts, but it's more difficult to define and collect than the information about hardware elements discussed in preceding sections.
A knowledge of the patterns of network use can also be helpful. You should be aware if the network is likely to be sporadically or continuously heavy; intermittent problems, for example, might arise only when network traffic is unusually heavy.
Similarly, it is vital that you make yourself aware of the background of network users. In particular, talk to them about their expectations of the network, as well as their likely uses of it. You may find that their perception of the extent and purpose of the network is at odds with your own or that the level of service that they anticipate is unrealistic.
Understanding the vantage point of the user when you discuss a problem helps you get beyond their description--possibly inaccurate--of the problem. For instance, a fault report of "I can't log on" requires one response if the person reporting the problem is an experienced user but requires a different response if the user is unclear on the distinction between logging on and starting an application.
What do you do with all this information after you gather it? You must store it somewhere accessible in a format that suits your needs. Where you store it and which format you choose depends on your particular environment and preferences.
If you store the data electronically, you might find it useful to keep it on a file server so that it can be accessed from a range of locations. Remember, though, that this data will be required when the network is acting up, so storing it on a local hard drive might be more sensible. A compromise is to store the data on a server, but maintain a local copy for backup; if you do this, you must institute a procedure for making sure that the local copy is updated on a regular basis.
Deciding on a storage format is largely a matter of taste. Remember that the information is there for use in a crisis--it doesn't have to look pretty, but it has to be there! A small, uncomplicated network can be documented on a few sheets of paper in a folder. A large WAN might require a specialized database application to keep tabs on developments. In many cases, flat text files are adequate. They can be searched using a range of utilities, they are not application-specific, and there is little or no overhead to setting them up (unlike setting up a relational database).
Finally, it is essential that you keep the information up-to-date. Most networks are dynamic entities, with changes occurring on a regular basis. Keeping the database up-to-date must become an integral part of the culture of the people who manage the network if the database is to serve its purpose and assist you in tracking down network faults.
Every network problem that comes to your attention appears first as a fault report. It may be very informal--"Hey, why can't I print?"--or may be a formal report that is part of a well-defined fault reporting system. Even faults that you discover yourself can be regarded as fault reports in this sense. They are of the informal type, but the difference is that you immediately switch into troubleshooting mode and define the problem more succinctly.
The quality of the initial fault report can have a major bearing on the time and effort required for you to identify the underlying cause of the problem. Sketchy or incorrect information can be a major hindrance to resolving the fault.
Spend some time defining the fault reporting procedure. It can be formal or informal, but it must be structured. If you support the network alone and deal with a small number of users, it might be adequate for you to ask a list of questions whenever a problem is reported. If you work as part of a team or deal with a large number of users, a more formal fault reporting procedure is probably appropriate--a formal procedure might include fault log numbers and a signoff system. In any case, record at least the following information for each fault:
The most useful item in terms of resolving the problem is the context. When someone reports a problem, spend some time talking to them about it and find out exactly what they were doing when the error occurred. Ask as many questions as are necessary to define the problem in your terms, rather than theirs. Users unfamiliar with the system might report inaccurately--for instance, believing that they were logged on when they were not--and even experienced users might not think of mentioning some vital facts.
Discussing the problem at some length often sheds unexpected light on the problem. Facts that to users might seem completely irrelevant ("Well, everything was fine before that little cable broke...") can have significant bearing on the problem. In particular, ask users about recent changes in their hardware and software setup and whether or not they fiddled with any cables.
One of the first decisions you must make when you receive a fault report is whether or not you trust its accuracy. This is where a good working relationship with your users is invaluable. If a user says that his workstation can see server X but not server Y, should you accept that as a fact?
The user's level of understanding of the network is obviously a factor. A naive user could mistakenly think that server X is accessible when in fact it is not, or she could be doing something wrong when she attempts to attach to server Y. If you know that the user is experienced with the procedures for attaching to servers, you might decide to accept that part of the fault report as being as factual as if you had observed it yourself.
A lot also depends on the particular user's understanding of the fault diagnosis process. Even a technically advanced user might neglect to provide information that would be useful to you in tracing the problem. A user who is familiar with the type of information you require and the way in which you use it is likely to provide most of the relevant information from the start: when the error happened, what they were doing at the time, the full text of any error message, and so on.
Users unfamiliar with your procedures, however, might provide incomplete information. Inexperienced users might not notice or understand some aspect of the problem. Experienced users might selectively omit details that they decide are not relevant to the problem at hand. In fact, inexperienced users are more likely to write down the full text of an error message that appears. More confident users are likely to interpret the message when they see it and then report their own interpretation of the message to you.
There is no hard-and-fast rule about the veracity of the fault report. Bear in mind that it could be inaccurate and, depending on your estimation of the user's understanding of the network and the fault diagnosis process, either verify each detail yourself or take it as fact. If you accept it as fact, however, make a mental note that the facts are unconfirmed and realize that you might need to backtrack to establish their accuracy at some later stage in the investigation.
One of the more intractable problem types is the intermittent fault. A user might report that his connection to the server is dropped at arbitrary intervals or that he occasionally needs to boot up twice before establishing a connection. If the problem cannot be reproduced at will, debugging it is very difficult.
The most common cause of such errors is a loose connector or a faulty cable. If an intermittent error occurs on a particular machine, replace the network adapter, cable, and connector. Check the power supply for output quality. If the problem still occurs intermittently, you can adopt a few different approaches.
One approach is to pretend that the problem happens constantly rather than intermittently, and then try to imagine what sort of underlying cause it might have. This is little more than idle speculation. It can be useful if the problem proves intractable, in that it might prompt you to consider some factor which you have neglected up to that point; however, it is not much of a practical step toward solving the problem.
A more positive method is to review the configuration of all affected machines. There might be, for example, some old, unstable drivers that execute sometimes, along with newer, stable drivers that execute more often.
Another useful approach is to watch the user while she goes through her normal daily startup procedure. Sit with her while she powers on the computer, logs on, and starts her usual applications. The problem might arise when you are present, giving you a chance to do some on-the-spot investigation. Or you may notice some aspect of her work patterns that you were not perviously aware of, perhaps inspiring you to look at the fault from a new angle.
If you have no success with these methods, get the user to start logging the error so that you can see if any pattern emerges. There could be a correlation between the incidence of the error and some other event--perhaps it always occurs when the server is being backed up or when some extra demand is placed on the electrical power supply.
Finally, don't discount the possibility of inappropriate practices by the user. They might be logging off while in a DOS window under a networked Windows session, for example, or disconnecting and reconnecting cables at unsuitable times.
Given a network, a fault report, and a proper methodology, it should now be possible to locate the underlying problem. You also need some data, insight, time, and occasionally a bit of good luck.
The remainder of this chapter is devoted to a series of instruction sets that offer some guidance on how to find the underlying cause of some common faults. This list cannot be comprehensive, given the endless variety of computer networks and the rate at which they change. However, it should serve as an adjunct to a sound problem-solving methodology by providing pointers for many real-life situations that commonly arise. Chapter 32, "Repairing Common Types of Problems," explains how to fix the most common problems.
The following pointers are divided into client and server sections, reflecting the usual starting points of fault reports. It is assumed that vague fault reports such as "I can't log on" have been clarified to the level of "I get an Access Denied error message from the server when I try to log on."
NOTE: Remember that you can enhance the initial fault report by asking additional questions until you are confident that all relevant details have been obtained. Then you can decide whether to verify the facts stated by the user.
The majority of fault reports that are difficult to localize arise in the context of problems experienced by users of client workstations. There are a number of broad categories of fault:
The following sections discuss these broad categories in turn.
Remote Booting Workstation Will Not Boot. If a remote booting workstation will not boot, try booting the workstation from a copy of the master boot floppy. If it doesn't boot, make sure that the boot sector of the floppy disk is valid; if so, either the motherboard or floppy drive is faulty.
NOTE: Chapter 28, "Adding Diskless Workstations," contains detailed information on remote booting workstations.
If it boots from a copy of the master boot floppy, can it connect to a server? If not, treat the problem as if it's on a local booting workstation. Proceed to the "Workstation Can See Some but Not All Servers" section or later section as appropriate.
The remainder of this section applies only to remote booting workstations that will not remote boot but will connect to a server when booted from a copy of the master boot floppy.
Establish how far the workstation gets when remote booting. Do this by watching the monitor as the workstation attempts to remote boot.
For workstations using IPX boot PROMs, do the following should you receive an Error finding server message:
For workstations using RPL boot PROMs: If the workstation gets no response from an RPL server, the Find Frame Count (FFC) at the bottom of the screen, which looks like RPL-ROM-FFC: 1 will be incremented by one every second or so.
If this happens, use the following procedure to identify the fault:
Workstation Does Not Load NETX or VLM. Check the error message displayed by NETX or VLM when it refuses to load. If it complains about the MLID, IPIXODI, and the like not being loaded or about the DOS version, then the problem obviously is linked to client configuration.
If NETX or VLM pauses at load time, and then says it can't connect to a server, check that there is at least one server running with REPLY TO GET NEAREST SERVER turned on. If not, get a server running this way before you try connecting again.
Run the network adapter's diagnostic utility to confirm that the adapter can send and receive packets. If the adapter can send and receive packets, then the problem might be due to client misconfiguration (incorrect INT, PORT, and MEM settings or incorrect frame and protocol type definitions in NET.CFG) or routing/bridging problems on the network.
Use this procedure to distinguish between client configuration and network difficulties as the source of the fault:
If the adapter is unable to send and receive packets (using the adapter's diagnostic utility), then the problem might be due to a faulty adapter or faulty network connection.
A procedure similar to the preceding one should help establish whether the fault is related to the client's network adapter or to the network:
Notice that a working laptop was used to decide between workstation configuration and network routing in the first case and to decide between the network adapter and the network link in the second case. Simply attaching a laptop right away to see if it worked would not have resolved the issue as clearly as the approach taken here.
Workstation Can See Some but Not All Servers. If the workstation can see at least one server, then the workstation's adapter and network connection can be assumed to work. If there are some servers that it cannot see, do the following:
Workstation Can Connect to Server but the User Cannot Log On. If a workstation can connect to a server, but the user cannot log on, you need to first eliminate the most obvious possibility--improper logging on. Check that the user is attempting to log on to the correct server with the correct user ID and password. If the user can log on successfully from another workstation, then do the following:
Fault reports starting at the server that are difficult to localize are less common with servers than with clients. It is generally apparent if, for example, the server has crashed! There are only two categories of fault here:
It is sometimes desirable to take the network out of consideration when trying to identify the source of a fault affecting a server. In these circumstances, it is useful to connect a single workstation--perhaps a laptop--to a strand of thinnet and then connect the server to the other end. Remember to terminate both ends.
This takes the usual network out of the loop and makes the debugging process more straightforward. If the laptop can connect to the server using this setup, then the network is the likely cause of the problem; if it cannot connect, then the problem is with the server. This type of setup is referred to in the following sections as single-strand setup.
NOTE: It is important to use the usual network connector on the server's adapter. If the server has a thinnet connector but does not normally use it, then don't use it while testing! You need the test setup to accurately reflect the setup you are trying to debug, so instead use a thinnet transceiver attached to the server's usual network connector.
Server Is Running but No Workstations Can Access It. If you know the server is running but no workstations can access it, load MONITOR.NLM on the server (if it is not already loaded) and look at the network statistics.
If the incoming and outgoing packet counts are static, or if the outgoing packet count increases very slowly and the incoming packet count remains static, then the server is not communicating with the network. Try the following steps:
TIP: When changing the adapter in an EISA server to determine whether the server's original adapter is faulty, use an ordinary ISA card. This is likely to be easier to find than a spare EISA card, and it will be quicker to set up since there is no EISA configuration procedure. The poorer performance of this card won't matter because the exercise is designed to give a simple yes or no answer.
If the incoming and outgoing packet counts are increasing, then the server is able to communicate with the network. Use the single-strand setup described above to confirm that the network is at fault.
Server Is Running but Only Some Workstations Can Access It. Check whether the workstations that cannot attach are able to attach to any other servers. If not, refer to the "Workstation Does Not Load NETX or VLM" section earlier in this chapter.
If the workstations are able to attach to another server, the problem lies in either the network or the frame types being used.
Check whether frame type mismatches are responsible by doing the following:
Depending on the conclusions that you reached during the troubleshooting process, the fault is by now localized to one or more clients, one or more servers, or the network infrastructure. You may, perhaps, have completely identified the problem along the way, in which case it is just about resolved.
It is more likely that a substantial amount of investigative work remains to be done at this stage. The time spent defining the problem as explained in this chapter will help you to avoid wasting precious time looking in the wrong places.
For further information, refer to chapter 32, "Repairing Common Types of Problems." This chapter describes how to resolve some of the many problems that you will have to contend with.
© Copyright, Macmillan Computer Publishing. All rights reserved.