Upgrading and Repairing Networks

- 31 -

Locating the Problem: Server versus Workstation versus Links

Modern computer networks are complex, high-speed marvels of technology with advanced server, client, and network hardware and software working in close cooperation. They are so fast and complex, in fact, that it can sometimes be difficult to trace the source of problems that arise.

This chapter deals with methodology for identifying underlying causes of faults in a client/server network environment. It examines the complexity that can make problem identification so difficult and includes recommendations for precautionary measures that can help you get a handle on the complexity of your particular network. This chapter then provides a roadmap for identifying the underlying cause of a network problem.

NOTE: This chapter focuses on fault identification methodology. It does not describe network troubleshooting tools and how to use them. This chapter also does not describe how to correct a problem once it is identified--chapter 32, "Repairing Common Types of Problems," does that for the most common network problems. The information required to fix other problems can be found elsewhere in this book in the relevant chapters.

NOTE: While this chapter focuses on client/server network faults, the methods described are generic and can be applied equally to peer-to-peer networks.

Methodology

An initial fault report, also known as a problem report, is generally a description of one or more symptoms rather than of the underlying problem itself. The problem can be regarded as successfully located when it has been redefined in a specific hardware or software context where it can be dealt with directly.

For example, a problem might be reported initially as "Can't connect to server." After investigation, you might be able to restate the problem as "The server has a faulty network adapter" or "The client is using the incorrect frame type." In the first case, the server's network adapter needs to be replaced; in the second, the client software needs to be reconfigured. In both cases, the problem has been located (though not yet resolved).

How do you get from the initial statement of the problem, phrased in the user's terms, to a statement in your own terms of the specific locus of the underlying problem? The following sections look at a few possible approaches.

Trial and Error

The trial-and-error approach is one way to find causes of problems--try changing a component here or a driver there and see if it all starts working again. This is a bad approach, but let's think about why it is bad.

First, there is an underlying assumption that you don't really understand the way your network works. If you did understand how the network works, then you'd make use of your insight--rather than trial and error--to approach the problem in a logical way.

Second, there is no guarantee that you'll ever locate the problem. It might be caused by a combination of factors, and unless you happen to eliminate both factors at the same time, the problem will persist.

This doesn't mean, however, that there isn't a useful element of trial and error in properly planned troubleshooting. You need to experiment with the network to some extent to find answers to your questions about the problem (for example, "Does the problem persist if I use a different connection?"), but such tests should be used as part of a methodological approach to identifying the source of the problem--they should not constitute the entire methodology.

Expert System

Another approach, more commonly used in real life, is the expert system approach, where you use your expert knowledge of the network to short-circuit the troubleshooting process. For example, two users complain that they can no longer connect to the server. You can see the server from your workstation, so you immediately suspect the repeater, which you know sits between them and the server.

You can thus use your insight into your network to pounce quickly on the obscure cause of a problem that might be difficult to solve in a step-by-step way. Combined with the trial-and-error approach, troubleshooting like this can be very successful.

The trouble with this approach is that it not only uses your insight into the network but relies on that insight. If you don't have all the information, you might arrive at the wrong answer. Consider the previous example--suppose one of the two users had moved temporarily out of his office, because it was being painted, and into a colleague's office, where he brought his own workstation and connected it to the second user's thinnet strand. Given that information, you would make a very different first guess about the nature of the problem.

Generally it is best to use your expert knowledge of the network (along with trial and error) as part of a methodological approach rather than as the entire basis for the troubleshooting effort.

Problem Space

A more abstract approach, the problem space approach, is to regard the underlying problem as a point (or possibly a series of points) in n-dimensional space. Each of the n dimensions represents a possible parameterization of the problem in a specific hardware or software context. For example, you might have the following dimensions:

Workstations affected
Servers affected
Bridges affected
Routers affected
Repeaters affected
Frame types
Duration of problem
Frequency of problem

The extent of the problem can then be clearly identified by investigating the problem in the context of each of these dimensions.

The usefulness of this way of looking at the problem lies in its comprehensiveness. Define each possible dimension of your network this way before a problem arises. Whenever a problem comes up, you can check it on each dimension in turn. This means that you are unlikely to overlook some aspect of the problem or to be misled by apparently "obvious" causes.

Of course, this approach is not realistic. It is impossible to define all the possible dimensions that apply to any network. Even if you did, this approach would simply allow you to accurately identify the extent of the problem, not its underlying cause--pinpointing the cause would require a logical leap of some sort. The underlying idea, however-- trying to comprehensively identify all possible problem dimensions--can be usefully incorporated into a structured troubleshooting methodology.

A Structured Approach

The three approaches previously described can be combined in a structured way into a useful system. Avoid adopting an ad hoc approach to combining them--simply mixing the ingredients of trial and error, insight, and a long list of things to look into will not get you very far. Combine them using the following steps:

1. Before any problems arise, collate information about your network with a view to troubleshooting. Start by trying to identify as many problem dimensions as possible; then ensure that you have all the relevant data for each dimension.

2. When a problem arises, get as complete a statement of the nature of the problem as possible. Your fault report form should aim at gathering information on each of the dimensions you defined in step 1. The sample fault report form in figure 35.1 shows the type of approach to take. If you take verbal fault reports, be prepared to ask the user a series of questions to obtain the same information.

Figure 31.1
This is a sample fault report form.

3. Perform whatever tests are necessary to determine the real extent of the problem. The initial fault report most likely will be from one user's point of view. Redefine the problem statement in light of your findings.

4. Based on your knowledge of the network, form a working hypothesis about the underlying cause. Frame your hypothesis in a form that can be definitively confirmed or rejected. For example, "It's the workstation's connection to the network" is not as useful as "The workstation can't receive packets from the network."

NOTE: A definitively refuted hypothesis can be as useful as a definitively confirmed hypothesis. What matters is that you use a hypothesis that you can test and each of the possible outcomes eliminates a range of possible problem causes from consideration.

5. Perform an experiment that will either confirm or refute your hypothesis. Be absolutely clear what your hypothesis is before you test it. This will help you choose the most appropriate test and will also make it easier to interpret the results of the test without getting confused. An experiment in this context can be as simple as running SLIST or as complex as using a sniffer to analyze traffic patterns. Make sure you keep full notes on all configuration changes that you make so that you can revert to the original configuration after the experiment.

6. Revise the problem statement based upon the results of your experiment.

7. If you have not yet located the problem, go to step 4 and form another hypothesis. Choose a hypothesis that narrows the search for the problem.

This sequence is illustrated in figure 35.2. In many cases, of course, the experimentation loop will not be quite so tight as illustrated; a test may produce unexpected results which means that the extent of the problem is not quite what you originally thought. If that happens, you will have to redefine the scope of the fault and proceed again to the formulation of a new hypothesis.

This method uses elements of each approach described earlier:

The trial-and-error approach is used, but the trials are carefully designed to advance the resolution of the problem, rather than work ad hoc; the errors are used to differentiate between clearly defined areas of problem space.
The expert system approach is also incorporated. You design the hypotheses and experiments based on your knowledge of the network and your expectation of the most likely fault sources. If you are mistaken in your expectations, this method lets you know fairly quickly.
The problem space approach is also used but only as a means of ensuring that you have all the relevant data at hand when a problem arises and that you have a complete statement of the problem before you start to trace it.

Figure 31.2
This is the suggested troubleshooting process.

The underlying method here is to continuously restate the problem in a context more amenable to direct action. The problem originally is stated by the user in his or her terms. You then investigate the problem, restating it in your terms and in light of your findings.

The following is an example to illustrate the method:

Now that the problem has been localized to the server, it can be dealt with on a direct basis--you need to establish why the server crashed, bring it back up, and so on.

Notice that the initial fault report was refined in the first three steps. John said that he couldn't log on, but it was necessary to establish whether he had tried from one or more workstations (step 2) and what had happened when he tried (step 3). At that stage, it was clear that the problem as far as John's workstation was concerned was that it couldn't see the server.

The focus then shifted to determining the extent of the problem. If any other machines could see server SAL (step 4), then the focus could shift to John's workstation and his part of the network. Since a second machine could not see the server, though, there was a good chance that the server was the source of the problem.

Step 5 confirmed this, though of course this might not have been the case--the server might have been fine, with the two workstations both affected by another network problem of some sort. The decision to check the server rather than the network in step 5 was based on the expectation that the network was unlikely to act in such a way as to prevent two clients from seeing a working server--the possibility that server SAL was down seemed more likely.

Here's another example:

1. John calls and says that he can't log on.

Problem: John can't log on.

2. You determine that the only place that he's tried logging on is from his office.

Problem: John can't log on from his workstation.

3. You ask what happens when he tries. He says he gets Invalid drive letter when he tries to switch to drive F.

Problem: John's workstation is not loading the NetWare shell.

4. You go to his workstation and watch it booting up. NETX reports that it cannot connect to a server and fails to load.

Problem: John's workstation can't see a server.

5. You run SLIST on your workstation and SAL is listed (along with other servers).

Problem: John's workstation can't see servers that are visible to other workstations.

6. You run the diagnostic utility for John's network adapter. It passes all internal tests.

Problem: John's network adapter works, but his workstation can't see servers that are visible to other workstations.

7. You use the diagnostic utility to watch for packets. It can send and receive packets successfully.

Problem: John's network adapter and network connection both work, but his workstation can't see servers that are visible to other workstations.

8. Suspecting a configuration error, you check John's NET.CFG file. It is set up to use the wrong frame type.

Problem: John's workstation is not using the correct frame type.

In this case, steps 1-4 refined the problem. Again, John said that he couldn't log on, but it was necessary to establish whether he had tried from one or more workstations (step 2) and what happened when he tried (step 3). It then was apparent that NETX had not loaded, but it was not apparent why--it might have been, for example, that IPXODI or the MLID was not loaded. Step 4 determined that NETX did not load because it could not see any server.

At that point, the focus shifted to determining the extent of the problem. Since other machines could see server SAL (step 5), the focus became John's workstation and his part of the network. Steps 6 and 7 were aimed at determining whether the problem was internal or related to John's network connection. The diagnostics confirmed that the hardware and the network connection were fine.

Since the server, network (up to and including the workstation connection), and workstation hardware looked fine, the only thing left was the workstation software. The NET.CFG file (as the locus for much configuration information) was the first place to look (step 8).

This guess proved successful in the example, but it might not have in reality. A routing problem on the network might have prevented John's workstation from seeing SAL, even though both SAL and the workstation were functioning properly. If that had been the case, then correcting the frame type error would not have resolved the difficulty, and additional steps would have been necessary:

9. You connect a laptop (one that you know works!) to John's network connection. It also can send and receive on the network but cannot see SAL.

Problem: Valid connections at John's connection point cannot see servers with valid network connections.

10. The network itself is suspect, so you start to trace the problem back toward the server. John's thinnet connection goes via a bridge to the servers. You find that the laptop can see the servers when connected on the server side of the bridge but not when connected on John's side.

Problem: Valid connections on John's side of the bridge cannot see servers with valid network connections.

After step 10, you know that the problem is with the bridge, and you can debug it.

In both of these examples, a more specific problem statement generated by a thorough fault report would have been preferable to the initial problem statement "John can't log on."

If, in the first case, it was "John gets a Server SAL not found message when he tries to log on from his workstation," you would have known right away that NETX was loading (so that his workstation could see at least one server) and that the workstation simply couldn't see SAL.

If, in the second case, it was "NETX on John's workstation gives a Cannot connect to server SAL error message," you would have known that the situation was worse, in the sense that the workstation could see no servers. Whether this is actually worse or not is unimportant here--what matters is that you get enough information at an early stage to distinguish between different symptoms.

Information Gathering

The process of redefining the problem into a manageable context requires knowledge about various facets of the particular network that you are dealing with:

Network topology
Network hardware
Frames and protocols
Server configuration
Client configuration
Applications
User profile

Collect information about each of these areas when the network is up and running normally. This allows you to investigate fault reports on the basis of firm information, categorized according to the contexts into which you need to resolve the problems. The following sections look at each of these areas in turn.

Network Topology

You need to understand the topology of a particular network if you hope to troubleshoot network problems on it. Review the structure of the network and draw two maps of it:

A logical map showing rings, segments, and so on, as well as the logical relationships between various components.
A physical map indicating clearly how the network relates to the buildings where it runs. Show buildings and individual floors (if necessary), and include room numbers where active equipment is stored.

In both cases, show all bridges, repeaters, routers, hubs, and segments. Indicate where all servers and clients are connected (grouping large numbers of clients together for convenience).

TIP: Remember to keep both maps updated as your network grows.

Network Hardware

A detailed inventory of all active network equipment will accelerate the troubleshooting process. You might need to check model numbers or firmware revision levels when tracing a fault, especially if you need to discuss things with the vendor. Everything proceeds with more speed and less panic if all the relevant information has been collected at an earlier, calmer stage.

Gather at least the following information for each piece of equipment:

Make and model
Location
Firmware revision
Settings: hardware switches, jumpers, and software settings
Connections: which ports and to what

Server Configuration

A detailed hardware and software inventory for your servers is essential. Gather at least the following information for each server:

System make and model
Physical location
Manager (if it's someone other than you)
OS/NOS version
Hard copy of all configuration files
Amount of RAM
Frame types used
Protocols used
Internal network address
UPS make and model if any

Gather at least the following information for each network adapter:

Adapter make and model
Network address
Driver version number
Jumper and DIP switch settings
Slot number (PCI/EISA/MCA systems)
IRQ, I/O port, and shared RAM area
Transceiver type if any
Connections: connector type, connected to what
IPX cable number of network to which connected
Protocols bound to adapter
Keep a copy of the AUTOEXEC.NCF and STARTUP.NCF for each server as well

Gather at least the following information for each SCSI controller:

Adapter make and model
Firmware revision level
Jumper and DIP switch settings
Slot number (PCI/EISA/MCA systems)
IRQ, I/O port, and shared RAM area
SCSI ID
Whether terminated
List of connected devices, with SCSI IDs

Gather at least the following information for each hard drive:

Make and model
Formatted capacity
Partition information (names and sizes)
Size of hot fix area
SCSI disks: SCSI ID number
SCSI disks: which SCSI controller if more than one
SCSI disks: whether terminated

Gather at least the following information for each adapter (besides SCSI and network):

Make and model
Jumper and DIP switch settings
Slot number (PCI/EISA/MCA systems)
IRQ, I/O port, and shared RAM area

Make the compilation of this information an integral part of the installation procedure for new servers. Remember that an installation isn't worth much unless it's stable. That means that you must be able to respond quickly to the problems that will inevitably arise. Proper documentation of this sort should be regarded as an investment in the stability of the service. Keep this list up-to-date as the server configuration changes.

Client Configuration

It might not be practical for you to keep a detailed inventory of all clients on the network. There might be too many of them, or they might be managed by someone else, meaning that the information does not normally come your way. If you are expected to provide support for these workstations, however--even if you only support their use of the network--you must have a certain amount of information about each workstation.

Gather at least the following information (if the workstations are managed by other people, insist that those people provide this information for each workstation they connect):

System make and model
Physical location
Person responsible: name and number
OS/NOS version

Gather at least the following information for each network adapter:

Adapter make and model
Network address
Connections: connector type, connected to what
Protocols bound to adapter

Applications

Information about the range of applications in use on the network will help to inform your troubleshooting efforts, but it's more difficult to define and collect than the information about hardware elements discussed in preceding sections.

A knowledge of the patterns of network use can also be helpful. You should be aware if the network is likely to be sporadically or continuously heavy; intermittent problems, for example, might arise only when network traffic is unusually heavy.

User Profile

Similarly, it is vital that you make yourself aware of the background of network users. In particular, talk to them about their expectations of the network, as well as their likely uses of it. You may find that their perception of the extent and purpose of the network is at odds with your own or that the level of service that they anticipate is unrealistic.

Understanding the vantage point of the user when you discuss a problem helps you get beyond their description--possibly inaccurate--of the problem. For instance, a fault report of "I can't log on" requires one response if the person reporting the problem is an experienced user but requires a different response if the user is unclear on the distinction between logging on and starting an application.

The Information Database

What do you do with all this information after you gather it? You must store it somewhere accessible in a format that suits your needs. Where you store it and which format you choose depends on your particular environment and preferences.

If you store the data electronically, you might find it useful to keep it on a file server so that it can be accessed from a range of locations. Remember, though, that this data will be required when the network is acting up, so storing it on a local hard drive might be more sensible. A compromise is to store the data on a server, but maintain a local copy for backup; if you do this, you must institute a procedure for making sure that the local copy is updated on a regular basis.

Deciding on a storage format is largely a matter of taste. Remember that the information is there for use in a crisis--it doesn't have to look pretty, but it has to be there! A small, uncomplicated network can be documented on a few sheets of paper in a folder. A large WAN might require a specialized database application to keep tabs on developments. In many cases, flat text files are adequate. They can be searched using a range of utilities, they are not application-specific, and there is little or no overhead to setting them up (unlike setting up a relational database).

Finally, it is essential that you keep the information up-to-date. Most networks are dynamic entities, with changes occurring on a regular basis. Keeping the database up-to-date must become an integral part of the culture of the people who manage the network if the database is to serve its purpose and assist you in tracking down network faults.

The Fault Report

Every network problem that comes to your attention appears first as a fault report. It may be very informal--"Hey, why can't I print?"--or may be a formal report that is part of a well-defined fault reporting system. Even faults that you discover yourself can be regarded as fault reports in this sense. They are of the informal type, but the difference is that you immediately switch into troubleshooting mode and define the problem more succinctly.

The quality of the initial fault report can have a major bearing on the time and effort required for you to identify the underlying cause of the problem. Sketchy or incorrect information can be a major hindrance to resolving the fault.

Spend some time defining the fault reporting procedure. It can be formal or informal, but it must be structured. If you support the network alone and deal with a small number of users, it might be adequate for you to ask a list of questions whenever a problem is reported. If you work as part of a team or deal with a large number of users, a more formal fault reporting procedure is probably appropriate--a formal procedure might include fault log numbers and a signoff system. In any case, record at least the following information for each fault:

Time of fault
Client or clients who were affected
Duration: short-lived or continuous?
Frequency: once, occasional, frequent, or constant?
Context (booting up, logging on, running an application, printing, sending e-mail, etc.)
Full text of any error messages
Person reporting the fault: name and phone number

The most useful item in terms of resolving the problem is the context. When someone reports a problem, spend some time talking to them about it and find out exactly what they were doing when the error occurred. Ask as many questions as are necessary to define the problem in your terms, rather than theirs. Users unfamiliar with the system might report inaccurately--for instance, believing that they were logged on when they were not--and even experienced users might not think of mentioning some vital facts.

Discussing the problem at some length often sheds unexpected light on the problem. Facts that to users might seem completely irrelevant ("Well, everything was fine before that little cable broke...") can have significant bearing on the problem. In particular, ask users about recent changes in their hardware and software setup and whether or not they fiddled with any cables.

Verifying the Fault Report

One of the first decisions you must make when you receive a fault report is whether or not you trust its accuracy. This is where a good working relationship with your users is invaluable. If a user says that his workstation can see server X but not server Y, should you accept that as a fact?

The user's level of understanding of the network is obviously a factor. A naive user could mistakenly think that server X is accessible when in fact it is not, or she could be doing something wrong when she attempts to attach to server Y. If you know that the user is experienced with the procedures for attaching to servers, you might decide to accept that part of the fault report as being as factual as if you had observed it yourself.

A lot also depends on the particular user's understanding of the fault diagnosis process. Even a technically advanced user might neglect to provide information that would be useful to you in tracing the problem. A user who is familiar with the type of information you require and the way in which you use it is likely to provide most of the relevant information from the start: when the error happened, what they were doing at the time, the full text of any error message, and so on.

Users unfamiliar with your procedures, however, might provide incomplete information. Inexperienced users might not notice or understand some aspect of the problem. Experienced users might selectively omit details that they decide are not relevant to the problem at hand. In fact, inexperienced users are more likely to write down the full text of an error message that appears. More confident users are likely to interpret the message when they see it and then report their own interpretation of the message to you.

There is no hard-and-fast rule about the veracity of the fault report. Bear in mind that it could be inaccurate and, depending on your estimation of the user's understanding of the network and the fault diagnosis process, either verify each detail yourself or take it as fact. If you accept it as fact, however, make a mental note that the facts are unconfirmed and realize that you might need to backtrack to establish their accuracy at some later stage in the investigation.

Intermittent Problems

One of the more intractable problem types is the intermittent fault. A user might report that his connection to the server is dropped at arbitrary intervals or that he occasionally needs to boot up twice before establishing a connection. If the problem cannot be reproduced at will, debugging it is very difficult.

The most common cause of such errors is a loose connector or a faulty cable. If an intermittent error occurs on a particular machine, replace the network adapter, cable, and connector. Check the power supply for output quality. If the problem still occurs intermittently, you can adopt a few different approaches.

One approach is to pretend that the problem happens constantly rather than intermittently, and then try to imagine what sort of underlying cause it might have. This is little more than idle speculation. It can be useful if the problem proves intractable, in that it might prompt you to consider some factor which you have neglected up to that point; however, it is not much of a practical step toward solving the problem.

A more positive method is to review the configuration of all affected machines. There might be, for example, some old, unstable drivers that execute sometimes, along with newer, stable drivers that execute more often.

Another useful approach is to watch the user while she goes through her normal daily startup procedure. Sit with her while she powers on the computer, logs on, and starts her usual applications. The problem might arise when you are present, giving you a chance to do some on-the-spot investigation. Or you may notice some aspect of her work patterns that you were not perviously aware of, perhaps inspiring you to look at the fault from a new angle.

If you have no success with these methods, get the user to start logging the error so that you can see if any pattern emerges. There could be a correlation between the incidence of the error and some other event--perhaps it always occurs when the server is being backed up or when some extra demand is placed on the electrical power supply.

Finally, don't discount the possibility of inappropriate practices by the user. They might be logging off while in a DOS window under a networked Windows session, for example, or disconnecting and reconnecting cables at unsuitable times.

Tracking Specific Problems

Given a network, a fault report, and a proper methodology, it should now be possible to locate the underlying problem. You also need some data, insight, time, and occasionally a bit of good luck.

The remainder of this chapter is devoted to a series of instruction sets that offer some guidance on how to find the underlying cause of some common faults. This list cannot be comprehensive, given the endless variety of computer networks and the rate at which they change. However, it should serve as an adjunct to a sound problem-solving methodology by providing pointers for many real-life situations that commonly arise. Chapter 32, "Repairing Common Types of Problems," explains how to fix the most common problems.

The following pointers are divided into client and server sections, reflecting the usual starting points of fault reports. It is assumed that vague fault reports such as "I can't log on" have been clarified to the level of "I get an Access Denied error message from the server when I try to log on."

NOTE: Remember that you can enhance the initial fault report by asking additional questions until you are confident that all relevant details have been obtained. Then you can decide whether to verify the facts stated by the user.

Client Fault Reports

The majority of fault reports that are difficult to localize arise in the context of problems experienced by users of client workstations. There are a number of broad categories of fault:

Remote booting workstation will not boot.
Workstation does not load NETX or VLM.
Workstation can see some but not all servers.
Workstation can connect to server but the user can't log on.

The following sections discuss these broad categories in turn.

Remote Booting Workstation Will Not Boot. If a remote booting workstation will not boot, try booting the workstation from a copy of the master boot floppy. If it doesn't boot, make sure that the boot sector of the floppy disk is valid; if so, either the motherboard or floppy drive is faulty.

NOTE: Chapter 28, "Adding Diskless Workstations," contains detailed information on remote booting workstations.

If it boots from a copy of the master boot floppy, can it connect to a server? If not, treat the problem as if it's on a local booting workstation. Proceed to the "Workstation Can See Some but Not All Servers" section or later section as appropriate.

The remainder of this section applies only to remote booting workstations that will not remote boot but will connect to a server when booted from a copy of the master boot floppy.

Establish how far the workstation gets when remote booting. Do this by watching the monitor as the workstation attempts to remote boot.

For workstations using IPX boot PROMs, do the following should you receive an Error finding server message:

1. Check that at least one server is replying to GET NEAREST SERVER requests.

2. If so, watch the router tracking screen on the server (type TRACK ON at the console prompt) while the workstation boots.

3. Look for a GET NEAREST SERVER request from the faulty workstation's network address.

4. If you see it, the problem is at the server end.

5. If you don't see it, the GET NEAREST SERVER request might not be getting through to the server. Check that the server's network adapter is bound to the Ethernet 802.3 frame type. Also check that any bridges or routers between the workstation and the server are passing 802.3 packets.

6. Alternatively, the boot PROM might be faulty. Try replacing it with one that definitely works.

For workstations using RPL boot PROMs: If the workstation gets no response from an RPL server, the Find Frame Count (FFC) at the bottom of the screen, which looks like RPL-ROM-FFC: 1 will be incremented by one every second or so.

If this happens, use the following procedure to identify the fault:

1. Check that RPL.NLM has been loaded on at least one server and that RPL has been bound on the server to an adapter using the 802.2 frame type.

2. If RPL.NLM has been bound using NODEFAULT, has the workstation's Ethernet address been entered in BOOTCONF.SYS? Has it been entered in the correct format?

3. Check that any bridges or routers between the workstation and the server are passing 802.2 packets.

Workstation Does Not Load NETX or VLM. Check the error message displayed by NETX or VLM when it refuses to load. If it complains about the MLID, IPIXODI, and the like not being loaded or about the DOS version, then the problem obviously is linked to client configuration.

If NETX or VLM pauses at load time, and then says it can't connect to a server, check that there is at least one server running with REPLY TO GET NEAREST SERVER turned on. If not, get a server running this way before you try connecting again.

Run the network adapter's diagnostic utility to confirm that the adapter can send and receive packets. If the adapter can send and receive packets, then the problem might be due to client misconfiguration (incorrect INT, PORT, and MEM settings or incorrect frame and protocol type definitions in NET.CFG) or routing/bridging problems on the network.

Use this procedure to distinguish between client configuration and network difficulties as the source of the fault:

1. Connect a laptop (one that works!) to the same connection point and see if it can connect to a server.

2. If so, the workstation's configuration is at fault.

3. Otherwise, IPX packets might be unable to get past a router between the workstation and the server.

If the adapter is unable to send and receive packets (using the adapter's diagnostic utility), then the problem might be due to a faulty adapter or faulty network connection.

A procedure similar to the preceding one should help establish whether the fault is related to the client's network adapter or to the network:

1. Connect a laptop (one that works!) to the same connection point and see if it can connect to a server.

2. If so, the workstation's network adapter is at fault.

3. Otherwise, the link from that connection point to the server is at fault.

Notice that a working laptop was used to decide between workstation configuration and network routing in the first case and to decide between the network adapter and the network link in the second case. Simply attaching a laptop right away to see if it worked would not have resolved the issue as clearly as the approach taken here.

Workstation Can See Some but Not All Servers. If the workstation can see at least one server, then the workstation's adapter and network connection can be assumed to work. If there are some servers that it cannot see, do the following:

1. Verify that the missing servers are actually running and that they can be accessed from other points on the network.

2. If the missing servers are running but cannot be accessed from any workstation, refer to the later "Server Fault Reports" section.

3. If the missing servers can be accessed from some points but not from others, check the frame types used by the servers and by those clients that cannot access the servers.

4. If they use different frame types, reconfigure the clients.

5. If the frame types match, the network is likely at fault. Check that packets from the problem workstations have a path to the server.

Workstation Can Connect to Server but the User Cannot Log On. If a workstation can connect to a server, but the user cannot log on, you need to first eliminate the most obvious possibility--improper logging on. Check that the user is attempting to log on to the correct server with the correct user ID and password. If the user can log on successfully from another workstation, then do the following:

1. Check the NCP packet signature level on the server.

2. If it is level 3, the server insists on signed packets. Check the packet signature level on the problem workstation.

3. If the workstation's signature level is 0, the server will refuse logons. Increase the level and try again.

4. If the workstation's signature level is the default value of 1 or higher, then check the version of LOGIN.EXE used by the workstation. It might be an old version of LOGIN.EXE that does not support packet signature.

Server Fault Reports

Fault reports starting at the server that are difficult to localize are less common with servers than with clients. It is generally apparent if, for example, the server has crashed! There are only two categories of fault here:

Server is running but no workstations can access it
Server is running but only some workstations can access it

It is sometimes desirable to take the network out of consideration when trying to identify the source of a fault affecting a server. In these circumstances, it is useful to connect a single workstation--perhaps a laptop--to a strand of thinnet and then connect the server to the other end. Remember to terminate both ends.

This takes the usual network out of the loop and makes the debugging process more straightforward. If the laptop can connect to the server using this setup, then the network is the likely cause of the problem; if it cannot connect, then the problem is with the server. This type of setup is referred to in the following sections as single-strand setup.

NOTE: It is important to use the usual network connector on the server's adapter. If the server has a thinnet connector but does not normally use it, then don't use it while testing! You need the test setup to accurately reflect the setup you are trying to debug, so instead use a thinnet transceiver attached to the server's usual network connector.

Server Is Running but No Workstations Can Access It. If you know the server is running but no workstations can access it, load MONITOR.NLM on the server (if it is not already loaded) and look at the network statistics.

If the incoming and outgoing packet counts are static, or if the outgoing packet count increases very slowly and the incoming packet count remains static, then the server is not communicating with the network. Try the following steps:

1. Check that IPX has been bound to the adapter. IPX should be listed under the network statistics for the adapter. If it is not, bind IPX to the adapter and try again.

2. Check the status LEDs on the server's network adapter. If they indicate an error or no link integrity, bring down the server and run the adapter's diagnostic utility.

3. If the adapter's diagnostic utility can send and receive packets on the network, then the problem is with the particular packets being sent by the server in NetWare mode. Use the single-strand setup previously described to determine whether the problem in this case is with the server configuration or the network.

4. If the adapter's diagnostic utility cannot send and receive packets on the network, then the problem is with either the adapter itself or the server's network connection. Try a different network adapter in the server to determine which is at fault.

TIP: When changing the adapter in an EISA server to determine whether the server's original adapter is faulty, use an ordinary ISA card. This is likely to be easier to find than a spare EISA card, and it will be quicker to set up since there is no EISA configuration procedure. The poorer performance of this card won't matter because the exercise is designed to give a simple yes or no answer.

If the incoming and outgoing packet counts are increasing, then the server is able to communicate with the network. Use the single-strand setup described above to confirm that the network is at fault.

Server Is Running but Only Some Workstations Can Access It. Check whether the workstations that cannot attach are able to attach to any other servers. If not, refer to the "Workstation Does Not Load NETX or VLM" section earlier in this chapter.

If the workstations are able to attach to another server, the problem lies in either the network or the frame types being used.

Check whether frame type mismatches are responsible by doing the following:

1. Check whether the workstations that cannot attach to the server are using the correct frame type. If not, correct the frame type and try again.

2. If the frame types match, the problem lies with the network configuration.

Summary

Depending on the conclusions that you reached during the troubleshooting process, the fault is by now localized to one or more clients, one or more servers, or the network infrastructure. You may, perhaps, have completely identified the problem along the way, in which case it is just about resolved.

It is more likely that a substantial amount of investigative work remains to be done at this stage. The time spent defining the problem as explained in this chapter will help you to avoid wasting precious time looking in the wrong places.

For further information, refer to chapter 32, "Repairing Common Types of Problems." This chapter describes how to resolve some of the many problems that you will have to contend with.