IBM1S506.ADD Lockup

Tutorial Author: Marcus Riley
Revision Date: January 29, 2002

Contents

Introduction
What is Happening
Capturing a Network Trace of the Failure
Trace Analysis
Aliases and Filters
     Filters

Introduction


The goal of this tutorial is to show how difficult problems can be resolved very quickly with the right tools.  Without these tools, it might take a great deal of trial and error troubleshooting to resolve this problem.

A detailed tour of Golden Code's Network Trace and the Trace Analyzer is provided.

The test case in this tutorial is named "IBM1S506.ADD Lockup".  It is stored in the TUTORIAL\IBM1S506 directory of the TUTORIAL.ZIP.

What is Happening

The following problem occurred at one of the remote sites of a Golden Code client.
Here is an outline of the scenario:
  1. A single identifiable event occurring after installation of software updates
  2. The problem occurs at multiple sites but not all sites which have received the updates, also not all workstations in an affected site exhibit the problem.


During the pilot rollout of one of our recent update packages (a package refers to a collection of system and/or application software code changes) for Work Space On-Demand (WSOD) we encountered a situation in which some or all of the WSOD clients at several sites appeared to lock-up during the boot process.  This lock-up was evidenced by the fact that the boot bitmap containing the client logo, which is normally displayed for a few seconds while basedev drivers are loading, continued to be displayed beyond the expected time span and the boot process would not proceed beyond that point.  No similar problems had occurred in our lab during weeks of integrated system testing.

One of the smaller pilot sites was chosen as the test site for resolving the problem. Since this was a remote site with no technical support staff on site, we used the Network Trace program because WSOD is network oriented and this tool allows us to capture the entire boot process with a trace.  The command line interface of the Network Trace program makes this a simple task, we simply connected to the OS/2 server at the site using Telnet and started the trace program then instructed one of the users at the site to power on the workstation.  When the remote user saw that the boot had stalled, we stopped the trace.  After just a few minutes of tracing, we were able to remotely compress the trace file and copy it to a machine in our lab for analysis.

Capturing a Network Trace of the Failure

Successful trace analysis is largely a result of knowing where to look in a trace and understanding that which is seen.  Because we were dealing with WSOD clients with this problem, we knew that we would be able to trace each file loaded to the client over the network.  This would allow us an opportunity to detect network issues, server access issues or any problems related to file access or loading, and possibly determine which, if any was the cause of the problem. Once the trace is captured, we can use the filtering capabilities of the Trace Analyzer to view only those records in which we are interested. By reducing the amount of data to review, odds of success are more likely.

Starting the Network Trace is as easy as typing "Ntrace" and pressing "Enter" at a command prompt.  However, Network Trace allows for modifications to its operating environment through the use of command line directives.  Once started, the trace program will display output similar to that displayed below.  Program information is presented in four sections:

  1. Program licensing information
  2. Control program version and copyright information
  3. Memory buffer and mode information
  4. Trace driver version and operating mode timing resolution and adapter type
These four program information sections are followed by scrolling output of trace activity which displays cumulative data regarding the captured packets.
 
Network Trace for OS/2 Release 1.1
S/N NTO-000001-10 licensed to Internal Use Only
Golden Code Development Corporation

Network Trace Driver Control Application v1.25A
(c) Copyright Golden Code Development Corp., 1999-2000

6 64K Global Segment(s) and 4Mb Trace Buffer
Mode: Transparent, One Time, Full Frames.

Network Trace Driver v1.60A, NTRACE$  in service mode
Timing unit is millisecond
Trace Started Fri Sep 28 14:10:34 2001 for IBM Token-Ring PCI Family Adapter
 

Press ENTER to Stop...
 

The Network Trace program will continue to run, capturing packets until the enter key in pressed or until one of the program's signaling options is used to terminate execution.  Once the trace is stopped the program will go through a process of ordering the trace records to assure that they are properly time sequenced and write the file out as NTRACE.TRC (extensions are '.trc' for Token Ring '.enc' for Ethernet).
The name 'NTRACE' is assigned by default, however, you may supply a name for your output file by using the -fname option at program invocation.

Trace Analysis

This assumes that the Trace Analyzer is installed in the directory D:\JTA.  The trace file for this tutorial is stored in the TUTORIAL\IBM1S506\RESULTS directory.

The Java-based Trace Analyzer can be started with the following command line:

java.exe -jar d:\jta\TraceAnalyzer.jar d:\jta

Once loaded, the "Open" dialog box should appear.  Navigate the directory structure to the trace file and select it for opening.

At the top of the screen, there is a toolbar.  The leftmost drop-down box is a list of "reports" that can be displayed.  Select the "Overview - Network" report.

This will yield something similar to the following main screen:

Please refer the the Trace Analyzer product documentation for more details on the functioning and navigation of the user interface.  It is available online.

The first step in trace analysis is to review the entire trace, use this review as an opportunity to verify that you have captured frames from the devices you intended to capture.

  1. Verify that the MAC addresses of the network adapters associated with the workstations in question are captured in the trace.
  2. Make sure that there are no obvious network errors .  Many common errors are detected by Network Trace and are reported as such by the Trace Analyzer.
  3. Note protocols in use.  There may be detrimental traffic on your network that you were not aware of, this can be evidenced by unfamiliar or obviously inappropriate protocols.
  4. Check frame sizes and insure that they are appropriate for the topology in use.
After we have made a general review of the trace and found no glaring errors, we can focus on the elements of the trace that are most likely to yield information about our problem.
 

Aliases and Filters (an introduction)

As you can see by looking at the lower right corner of the image above, there are 3,577 records in this particular trace file.  We can use filters to narrow this down to just those records that we think will yield information about the problem at hand.  But first, we need to be able to more easily identify the records from the machines we are concerned with.  We do this by assigning aliases to take the place of the less user friendly MAC address identifiers used in the Source Address/Destination Address fields.  The alias can be any mnemonic name you would like to assign to a particular address and it will be mapped in place of that address while viewing the trace in the Trace Analyzer.
First, choose "Configure" from the Analyzer toolbar.  This option presents seven choices, the first of which is "Aliases".

After selecting "Aliases" from the "Configure" pull down menu, the Aliases configuration screen is displayed. We have chosen the "MAC Address" option as the type of alias we wish to create.  Notice that there are options for "IP Address" and "Manufacturer ID" as well.

In the image below, we have assigned the alias "Boot Server" to the MAC address which we know belongs to the WSOD server, the alias "Boot Client" has been assigned to the WSOD workstation.

The end result is the following screen which makes it much easier to identify records belonging to the two machines we wish to study.  Here the server alias "Boot Server" is displayed in each address field where the server MAC address would have appeared and the alias "Boot Client" appears in each address field where the client MAC address would have appeared.

Filters

As you can see there are many different types of records displayed in this trace: NetBIOS, RPL, LLC and SMB to name a few.  These protocols all have their place in our network environment, however we can't always take for granted that they, or the device transmitting them, are behaving properly.  We review the entire trace in order to assure ourselves that there are no obvious "bad boys" flooding the network or otherwise creating errors.  Once we are sure that there are no obvious error conditions on the network,  we can turn our attention to the SMB or Server Message Block  records in which  we are mainly interested. The SMB records contain the actual file loading commands between the server and the workstation.  There are several types of SMB records including SMB Open, SMB Read and SMB Exit.  So we will apply the built-in SMB filter which will allow us to view all SMB records regardless of the specific type.

Once the filter has been applied, the number of records displayed is reduced from  3577 to 849 and they are all SMB type records.  Now we can more easily track the access of files from the server in order to identify the problem.  First we scan the entire Filtered Trace in search of return codes or errors.  One error condition we expect to find is the ERROR_OPEN_FAILED condition.  This apparent error is caused when  OS/2 attempts to load boot drivers without the benefit of a fully qualified path to the file.  This forces OS/2 to search the boot path for the file in question and report the ERROR_OPEN_FAILED condition each time the file is not found in the searched location.  There are three locations OS/2 can search for a boot device driver during a WSOD boot, they are:

These locations, although they are on a network share, are mapped to the traditional C drive locations of C:\, C:\OS2 and C:\OS2\BOOT in a WSOD client.  The basedev file must be in one of these locations in order to load, the trace records all attempts to locate and load these files.  As we scan down the trace, the records we see are the ones we expect to see and the system appears to be behaving normally.  While reviewing the SMB records, we scan records requesting Open and Execute, Read , Close as well as Responses to them all (computers are very chatty, everything has a response) and of course the occassional ERROR_OPEN FAILED record.  Along the way, we did notice that once the client starts up as an OS/2 LAN Requester at record 2019, that all the client's Read commands are followed by not one but two response packets from the server.  This is not optimal behavior but it cannot be blamed for the problem at hand, it is simply a tuning issue or LAN Server mis-configuration. Network Trace has shown us something we weren't looking for!

Finally we reach the final Read command, we can see that this record is followed by  an SMB close command and there are no more commands requesting file opens or reads.  The final file access is IBM1S506.ADD.  After this file is closed we are able to verify that the machine is still connected to the network by the MAC frames being generated.

Use this link to see a filtered, annotated listing of the SMB records in this problem boot trace file.

The trace of this RPL boot showed no errors or unusual events on the network, but review of the files loaded during the trace revealed that the last file accessed and loaded by the workstation during the boot was the OS/2 IDE device driver IBM1S506.ADD.  Since it was known that this file had been updated during this package install, it became a prime suspect as the cause of the problem.  To confirm suspicions, the original version of the IBM1S506 driver was restored to the system, after which the workstations were able to boot successfully.

By inspecting the SMB records we determined that the last file accessed during the boot was the IBM IDE driver IBM1S506.ADD.  Knowing that this driver controls interaction between the IDE controller and the system hard drive and that this driver was one which was updated in the latest package, we had narrowed our search to this subsystem, we were then able concentrate on identifying any inconsistencies which may exist with the disk subsystem. When we dispatched a technician to the site, upon inspection, he discovered that the drives in the failing workstations were configured in a Master/Slave configuration instead of the Master w/o Slave configuration.  Even though this configuration existed all along, the previous driver had no problem handling the improperly configured drive whereas the newer driver expected to find a slave drive attached when the Master was configured this way.  This caused the system to appear to hang as the IDE controller searched for a slave drive which did not exist.  When the drives were configured in the Master w/o Slave configuration, the systems were able to boot with no problem using the new IBM1S506.ADD file.


Western Digital designs their IDE drives to meet one the of the four configurations listed below:

While WD maintains specific settings for a Single drive versus a Master drive, some other manufacturers are not so specific on this point.  Quantum and Maxtor for example make no distinction between a drive which will be used with a slave on the same channel and one which will not on most of their drives.  In other words Quantum and Maxtor Drives would be jumpered as Master in a Single drive configuration or if used as Master with a slave attached. A WD drive, on the other hand, should only be jumpered as Master if there is actually a Slave drive attached.  This situation was only made evident in our environment with the addition of a new IBM1S506.ADD driver.


© 2001 Golden Code Development Corporation.  ALL RIGHTS RESERVED