Introduction
What is Happening
Capturing a Network Trace of the Failure
Trace Analysis
Aliases and Filters
Filters
The goal of this tutorial is to show how difficult problems can be resolved
very quickly with the right tools. Without these tools, it might take
a great deal of trial and error troubleshooting to resolve this problem.
A detailed tour of Golden Code's Network Trace and the Trace Analyzer is provided.
The test case in this tutorial is named "IBM1S506.ADD Lockup". It is stored in the TUTORIAL\IBM1S506 directory of the TUTORIAL.ZIP.
During the pilot rollout of one of our recent update packages (a package
refers to a collection of system and/or application software code changes)
for Work Space On-Demand (WSOD) we encountered a situation in which some
or all of the WSOD clients at several sites appeared to lock-up during the
boot process. This lock-up was evidenced by the fact that the boot
bitmap containing the client logo, which is normally displayed for a few
seconds while basedev drivers are loading, continued to be displayed beyond
the expected time span and the boot process would not proceed beyond that
point. No similar problems had occurred in our lab during weeks of
integrated system testing.
One of the smaller pilot sites was chosen as the test site for resolving the problem. Since this was a remote site with no technical support staff on site, we used the Network Trace program because WSOD is network oriented and this tool allows us to capture the entire boot process with a trace. The command line interface of the Network Trace program makes this a simple task, we simply connected to the OS/2 server at the site using Telnet and started the trace program then instructed one of the users at the site to power on the workstation. When the remote user saw that the boot had stalled, we stopped the trace. After just a few minutes of tracing, we were able to remotely compress the trace file and copy it to a machine in our lab for analysis.
Starting the Network Trace is as easy as typing "Ntrace" and pressing "Enter" at a command prompt. However, Network Trace allows for modifications to its operating environment through the use of command line directives. Once started, the trace program will display output similar to that displayed below. Program information is presented in four sections:
Network Trace for OS/2 Release 1.1
S/N NTO-000001-10 licensed to Internal Use Only Golden Code Development Corporation Network Trace Driver Control Application
v1.25A 6 64K Global Segment(s) and 4Mb Trace
Buffer Network Trace Driver v1.60A, NTRACE$
in service mode Press ENTER to Stop... |
The Network Trace program will continue to run, capturing packets until
the enter key in pressed or until one of the program's signaling options is
used to terminate execution. Once the trace is stopped the program will
go through a process of ordering the trace records to assure that they are
properly time sequenced and write the file out as NTRACE.TRC (extensions are
'.trc' for Token Ring '.enc' for Ethernet).
The name 'NTRACE' is assigned by default, however, you may supply a name
for your output file by using the -fname option at program invocation.
The Java-based Trace Analyzer can be started with the following command line:
java.exe -jar d:\jta\TraceAnalyzer.jar d:\jta
Once loaded, the "Open" dialog box should appear. Navigate the directory structure to the trace file and select it for opening.
At the top of the screen, there is a toolbar. The leftmost drop-down box is a list of "reports" that can be displayed. Select the "Overview - Network" report.
This will yield something similar to the following main screen:
Please refer the the Trace Analyzer product documentation for more details on the functioning and navigation of the user interface. It is available online.
The first step in trace analysis is to review the entire trace, use this review as an opportunity to verify that you have captured frames from the devices you intended to capture.
After selecting "Aliases" from the "Configure" pull down menu, the Aliases configuration screen is displayed. We have chosen the "MAC Address" option as the type of alias we wish to create. Notice that there are options for "IP Address" and "Manufacturer ID" as well.
In the image below, we have assigned the alias "Boot Server" to the MAC address which we know belongs to the WSOD server, the alias "Boot Client" has been assigned to the WSOD workstation.
The end result is the following screen which makes it much easier to identify records belonging to the two machines we wish to study. Here the server alias "Boot Server" is displayed in each address field where the server MAC address would have appeared and the alias "Boot Client" appears in each address field where the client MAC address would have appeared.
Once the filter has been applied, the number of records displayed is reduced from 3577 to 849 and they are all SMB type records. Now we can more easily track the access of files from the server in order to identify the problem. First we scan the entire Filtered Trace in search of return codes or errors. One error condition we expect to find is the ERROR_OPEN_FAILED condition. This apparent error is caused when OS/2 attempts to load boot drivers without the benefit of a fully qualified path to the file. This forces OS/2 to search the boot path for the file in question and report the ERROR_OPEN_FAILED condition each time the file is not found in the searched location. There are three locations OS/2 can search for a boot device driver during a WSOD boot, they are:
Finally we reach the final Read command, we can see that this record is followed by an SMB close command and there are no more commands requesting file opens or reads. The final file access is IBM1S506.ADD. After this file is closed we are able to verify that the machine is still connected to the network by the MAC frames being generated.
Use this link to see a filtered, annotated listing of the SMB records in this problem boot trace file.
The trace of this RPL boot showed no errors or unusual events on the network, but review of the files loaded during the trace revealed that the last file accessed and loaded by the workstation during the boot was the OS/2 IDE device driver IBM1S506.ADD. Since it was known that this file had been updated during this package install, it became a prime suspect as the cause of the problem. To confirm suspicions, the original version of the IBM1S506 driver was restored to the system, after which the workstations were able to boot successfully.
By inspecting the SMB records we determined that the last file accessed during the boot was the IBM IDE driver IBM1S506.ADD. Knowing that this driver controls interaction between the IDE controller and the system hard drive and that this driver was one which was updated in the latest package, we had narrowed our search to this subsystem, we were then able concentrate on identifying any inconsistencies which may exist with the disk subsystem. When we dispatched a technician to the site, upon inspection, he discovered that the drives in the failing workstations were configured in a Master/Slave configuration instead of the Master w/o Slave configuration. Even though this configuration existed all along, the previous driver had no problem handling the improperly configured drive whereas the newer driver expected to find a slave drive attached when the Master was configured this way. This caused the system to appear to hang as the IDE controller searched for a slave drive which did not exist. When the drives were configured in the Master w/o Slave configuration, the systems were able to boot with no problem using the new IBM1S506.ADD file.
Western Digital designs their IDE drives to meet one the of the four configurations listed below: