Last Updated: November 1996, converted very quickly to html July 2000

IM053 Digital Unix and Oracle7: Recipes for Disaster

DECUS Anaheim: Thursday 14 November 1996, 11:00 - 11:50 Room C1
The bad news:
  • What can happen to YOU
  • What WILL happen to you
The good news:
  • What you can do about it
  • How you can prevent it

Presenters

Gordon Schumacher, sxrgs@alaska.edu
David DeWolfe,     sxdjd@alaska.edu
Kurt Carlson,      sxkac@alaska.edu

University of Alaska
Statewide Office of Information Services, Technical Services
910 Yukon Drive
Fairbanks, Alaska 99775-6200

Abstract

ORA-0600, HSZ battery failures, AdvFS panics, CPU EXCEPTION, bus reset, oracle recovery failed? Been there, done that. This session will share some practical experience reacting to and avoiding disasters in a Digital Unix and Oracle7 environment. Recovery techniques, backup strategies, firmware and error management, patch management, configuration management, and vendor management will be discussed.

Introduction

"We're not in Kansas anymore..."
The University's transition in administrative data processing.

The University of Alaska consists of three major campuses located in Fairbanks, Anchorage and Juneau, plus 11 outlying campuses serving regional communities. The University currently serves approximately 33,000 students. The University is also a research institution and supports community service, educational, and cooperative extension programs across the state employing approximately 5000.

The University's administrative computing applications provide support for Financial, Human Resource and Student Information. Programming, technical and operational support for these applications is provided by the Statewide Office of Information Services and the Statewide Office of Network Services.

Where we're coming from:

Since 1985, the University's administrative applications have been running on an IBM mainframe platform running MVS and using IDMS as the database management system. On that platform, Finance, Human Resource and the Student Information Systems run as three separate applications in dedicated databases.

For a decade and through various migrations from one operating system to the next (MVS/SP->MVS/XA->MVS/ESA), and various versions of IDMS and of the application code, this environment has provided us with a secure, robust and manageable environment (i.e., IBM's oft mentioned RAS: Reliability / Availability / Serviceability is not a meaningless acronym).

Our current "legacy system" is an IBM3090-150E running MVS/ESA 4.3, IDMS 12.0, and various layered products from IBM and third party vendors.

Where we're going:

In 1993 the decision was made to migrate to a client/server computing environment and to implement the "Banner" product from Systems & Computer Technologies (SCT) to provide an integrated application with support for all administrative statewide computing systems. Digital Equipment's Alpha product line running Digital UNIX (then OSF/1) was chosen for the operating system platform and Oracle RDBMS was chosen for the database management system. Implementation began in the summer of 1994 with the first application (Finance) "live" in July 1995. Human Resources comes live at the close of 1996 with the final application, the Student Information system, coming live after spring semester registration in January 1997.

Computing Environment:

Alpha 8400 5/440 (glacier)
4 CPU's and 6GB memory, 110GB behind HSZ40 controllers;
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running the SCT Banner Finance, Human Resource and Student Information.
Alpha 2100A 5/300 (spike)
3 CPU's and 2GB memory, 80GB+ behind HSZ40 controllers;
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running an Oracle instance for data warehousing and decision support.
Alpha 2100 4/200 (nugget)
1 CPU and 768mb memory, 20GB+ behind HSZ40 controller;
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running our test Oracle instance and supporting applications development.
Two 3000-300LX systems used strictly for system and database software testing.

Speakers

The speakers all work in Technical Services for the University of Alaska Statewide Office of Information Technology. Kurt Carlson has 20 years data processing experience as a systems programmer supporting GCOS, VAX/VMS, IBM/MVS and Digital UNIX. David DeWolfe has 15 years DP experience and provides database administration for IDMS and Oracle as well as systems support for Digital UNIX. Gordon Schumacher has 18 years DP experience, 15 of them as a systems programmer supporting IBM/MVS, IBM/VM and Digital UNIX operating systems.

General Outline


The Oracle Story

OUR AFFECTED ORACLE ENVIRONMENT:

WHAT HAPPENED IN BRIEF:

WHAT HAPPENED NEXT:

Oracle support was called and a priority 1 TAR (Technical Assistance Request) was opened.

In the hour and a half that we waited to hear back from Oracle we formulated a general game plan:

RECOVERY ATTEMPT #1:

Oracle trace files were email'd to Oracle US support.

Support instructed us to add the following undocumented init.ora parms to the FINP init.ora:

	event="10210 trace name context forever, level 10"
	event="10231 trace name context forever, level 10"
	event="10211 trace name context forever, level 10"
	event="10015 trace name context forever, level 10"
	_db_block_cache_protect=true
	_db_block_compute_checksums=true

These are discussed in the Oracle press Backup and Recovery Handbook.

Support also informed us that we had corruption in 2 datafiles, #64 and #92. I informed them that there were only 33 datafiles associated with this database. They called back a while later to inform us that they had incorrectly converted addresses in the trace files, and that the actual corrupted datafiles were #16 and #23.

Support then instructed us to alter the two datafiles off-line and attempt to open the database. This failed with an error 376 as the database was trying to recover transactions in objects in both of the off-line datafiles.

We were given two options at this point:

  1. dump the redo logs and send them to Oracle Support so they could attempt to figure out the problem.
  2. do point-in-time recovery to prior to the FINP crash.

We selected the second option after being informed that the first option could take several days, if not weeks.

Somewhere around here the TAR was transferred to the Australian support center.

Since we were going to do incomplete recovery we had to restore all 33 datafiles. This took approximately 3 hours. At about 10:30 PM on the 15th we initiated incomplete recovery:

	recover database until time "1996-02-15:11:45:00"

At 11:00 PM another DBA came in to monitor the recovery while the rest of us went home to get some sleep. At 12:30 AM on the 16th I was called and informed that the recover had aborted with an (oh no!) ORA-0600 while applying archive log sequence #2101. Sequence #2101 was the active on-line redo log when Glacier crashed at 3:38 AM on the 15th.

I came back in and called Oracle UK support whom the TAR had been transferred to.

UK support suggested doing point-in-time recovery to prior to the 3:38 AM system crash on the 15th. We informed them that that would result in the loss of about 5 hours of financial transactions, and that only as a *last* resort would we consider that. We firmly requested that other options be explored.

I prepared to restore the 33 datafiles, and informed support that it would take 3 hours. Support said they would call back.

RECOVERY ATTEMPT #2:

UK support called back and informed us that they had identified the object # and datafile that caused the recovery to fail. We could not open the database to find out what object #4173 was, but it resided in an index "data" file! This was the break we were looking for.

UK support indicated that we might be able to get past our recovery ORA-0600 by doing the following:

perform cancel-based recovery up until and including archive log sequence #2100.
alter the datafiles associated with the index tablespace above off-line.
open the database without resetting the log files and determine what object #4173 was.
drop object #4173.
alter the datafiles on-line that were altered off-line above.
shutdown the database.
startup mount the database.
initiate complete recovery.
take a backup.
recreate the index that was dropped above (#4173).

UK support indicated that they did not know if this would work or not, but they, and we, thought it was worth a try.

While restoring the datafiles, the TAR was transferred back to US support in Florida. They promptly questioned the plan that UK support had formulated. I *demanded* that US support contact the UK analysts who had worked on the TAR, and that they discuss/modify/verify the planned recovery attempt.

US support called back a short time later with a modified plan. We would:

perform cancel-based recovery up until and including archive log sequence #2100.
shutdown the database and do a backup (our idea).
startup mount the database.
alter the datafiles associated with the index tablespace off-line. 
initiate complete recovery, which only recovers on-line datafiles:
	"recover database"
open the database.
drop the index tablespace (drop tablespace including contents).
recreate the index tablespace.
recreate the indexes in that tablespace.

The cancel based recovery succeeded, so we shutdown the database and did a backup. We then started the database (startup mount), and altered the datafiles associated with the index tablespace off-line. Next we attempted complete database recovery, which was possible since we had not reset the redo logs after performing the incomplete database recovery:

	"recover database"

Since complete recovery only recovers datafiles that are on-line, archive log sequence #2101 was applied successfully, and the complete recovery finished successfully at 10:11 AM on the 16th.

At this point US support instructed us to open the database. This resulted in the following:

  ORACLE instance FINP (pid = 15)
  Error 376 encountered while recovering transaction (16,131) on object 4173.

We shutdown the database, and US support instructed us to add "_offline_" to the "rollback_segments" parameter in the init.ora. We attempted to open the database again and received several messages regarding the datafiles of a tablespace being off-line, and SMON messages regarding the rollback segments. The database was open however, and we proceeded to drop the index tablespace in question:

	"drop tablespace finlindex including contents"

We shut the database down and removed "_offline_" from the "rollback_segments" parameter from the init.ora. The database was then started and opened successfully. We again shut the database down, and removed all the undocumented init.ora parameters as well as changing MTS_SERVERS back to 70. This next startup would be a "normal" startup.

The database was then successfully started and opened, and shutdown.

Then we did a backup!

After the backup we started the database and recreated the index tablespace that we had dropped. We then recreated all of the indexes in that tablespace. The SQL to recreate the indexes was generated by querying our pre-production database which, database object wise, is an exact replica of our production database. The index recreation took approximately 6 hours, as some of the indexes in question were on tables of greater than 10 million rows.

The database was backed up again as part of our regular Friday night backups, and as of Saturday the 17th at 6:00 PM the application was deemed "OK" by the application owner.

The entire recovery effort took about 55 hours, with the first 31 hours being non-stop, around-the-clock.

SINCE THEN:

We have had a few system crashes.

Since our faith in Digital Unix and Oracle has been severely shaken, we now do the following after a system crash:

Since we now know enough to take these precautions (exporting the entire database) we can specifically identify corrupted objects and perform recovery on the individual datafiles in which the corrupted objects reside. While the full export takes approximately 2 hours, our last database recovery took less than 1/2 hour. We didn't even bother to call Oracle support on this one.

We opened a TAR with Oracle in an attempt to determine if Oracle did synchronous writes, read-after-write, or used other techniques to avoid this type of corruption and to gather any other info that would help us in determining what was happening. Oracle's response was to upgrade the database to a more current release.


The Digital Story

Overview

Details

Kernel 'short read'

Symptoms:

P00>>>b
[...]
OSF boot - Thu Mar 31 01:10:53 EST 1994

Loading vmunix ...
Current PAL Revision <0x10000000010111>
Switching to OSF PALcode Succeeded
New PAL Revision <0x10000000020115>
Loading into KSEG Address Space

Sizes:
text = 3261360
short read (text)		<-- Can also get 'short read (data)'

halted CPU 0

halt code = 5
HALT instruction executed
PC = 2001003c
boot failure
Environments where problem occurred:
  Digital Unix v3.0, v3.2c on 7620; boot disk AdvFS RZ28B raidset behind HSZ40
  Digital Unix v3.2d-1     on 8400; boot disk AdvFS RZ29B raidset behind HSZ40
Likely cause: Kernel (/vmunix) fragmentation

Detection:

# /sbin/showfile /*vmunix*
       Id Vol PgSz Pages  XtntType Segs SegSz  Log  Perf  File
 ed4.8001   1  16    920    simple   **    **  off  100%  genvmunix
 21c.8017   1  16   1011    simple   **    **  off   33%  vmunix
11ce.8001   1  16   1017    simple   **    **  off  100%  vmunix.Pre:CSC0626
Recovery:
Copy /vmunix to a new file, look for better performance. If necessary defragment.

Hint: always grow root partition beyond default 64mb to retain multiple kernels.

The 2100 affectionately known as Lemon (a.k.a., 'spike')

Chronology:

08/95	2100-5/250 received, named 'spike', 2*cpu 2*512mb
09/95	second disk installed, was D.O.A.;
	inserting that disk also fried an existing boot disk
10/95	KZPSA failed and replaced
01/96	KZPSA failed and replaced
01/96	VGA (compact qvision) adapter replaced
02/96	KZPSA failed and replaced
02/96	memory module failed and replaced
02/96	backplane replaced (believed cause of other problems)
08/96	KZPSA failed and replaced
08/96	IO module replaced (likely bad)
08/96	backplane replaced (precautionary)
Between 9/95 and 2/96 there were a great number of unusual problems with Spike which caused far more confusion than actual outages. Phantom devices would show up with a hardware initialization or boot, devices would disappear, graphics console would randomly be unusable after booting to level 3, etc. Many of these problems can be blamed on a bad backplane which was replaced in February. Prior to the backplane replacement, Spike started getting infinite loops during hardware init:
P00>>>starting console on CPU 0
breakpoint at PC 128480 desired, XDELTA not loaded

interrupt through vector 660 on CPU 0

CPU detected C-bus error

*** unexpected interrupt through vector 00000066
[...]
One of these occurred immediately after the backplane was replaced, they then ceased. Digital identified a known firmware problem in the initial KZPSA firmware from the IO module during initialization as the cause.

In addition, a bad backplane could easily have contributed to the high failure rate (many fried KZPSA's). The IO module may also have been marginal in February, a problem we saw in August (looping C-bus errors) was identical to the problem we saw in February. However, in February the problem was transient. The problem was hard in August and the system never completed initialization with repeated resets.

Spike, also known as lemon, is now a 2100A with 3*5/300 cpu's and 4*512mb modules. Our other 2100-4/200 has been stable and solid since installation in June 1994. We have not had fried KZPSA's on our other systems.

So, you think you have battery backup?

HSZ40 Batteries and write-back cache

HSZ40 controllers with write-back cache have battery backup to protect in-flight data from loss during a power failure. Our first and only power failure on 18 September 1995 resulted in an unbootable system. Why? One raidset had unflushed cache, the battery was dead. Resolution was to restore the raidset and the other two raidsets containing Oracle data for that instance and roll-forward. Digital's response on this was there was a known bad batch of HSZ batteries. They suggested for avoidance we consider running HSZ's in dual redundant pairs.

This was HSZ40 v2.0, a subsequent release (v2.5 or v2.7) will disable disk access when battery low conditions are detected. This will prevent possible data loss on power failure.

HSZ40 Dual Redundancy

On 29 May 1996 our 8400 started generating massive AdvFS errors to the console until the system finally crashed. What happened? One of the HSZ's of a dual redundant pair detected a bad battery and disabled disk access to avoid corruption. What didn't happen? The HSZ did not "fail" so it's redundant partner could assume the load... so the system panic'd.

On 26 August 1996 our 2100 'lemon' system had a similar problem which also lead to a crash. The condition went undetected because that particular member of the HSZ dual-redundant pair was not connected to console manager so the HSZ message was missed. Subsequent problem with a bad IO module for the 2100 prevented system booting.

Rumor has it, a new version of HSZ40 firmware (v3.0) will cause dual-redundant failover under a bad battery condition.

We presently run seven HSZ40's, three dual redundant pairs and one single HSZ, on three different systems. While the configuration options and performance of the HSZ40 controllers have been very good, the three separate battery problems are a concern. Rumor has it that batteries manufactured after 12/95 are "good", prior to that it appears the batteries had a functional life of between 12 to 18 months before failure. Also, Digital is now actively monitoring our batteries (we think).

Other Places with Batteries?

Prestoserve IO caching modules, for one. We sent ours back to Digital as not offering much, typically, under AdvFS and being another possible weak point during a system problem. Plus, we have HSZ cache a little further down the bus. Note prestoserve is not supported for DECsafe ASE at all... ASE was once in our future but was abandoned as more risk than benefit.

AdvFS panics Prevention (patches in general)

If you ran AdvFS under Digital Unix v3.0 you likely experienced a large number of AdvFS panics. Digital said, "Everything will be wonderful under v3.2b (or v3.2c)". Well, things were better under v3.2c and better still under v3.2d-1. However, we still had AdvFS panics. Each time we called one in we were told it was a known problem and a patch was provided. The third time this occurred we finally asked the obvious question:

How the devil do we get these patches on a pro-active basis
before the system panics and our users get hostile?
A support center manager told us we should acquire patches.tar.Z on a regular basis and apply the AdvFS consolidated patch. Since we started aggressively applying preventative patches last May, we have had no software induced panics. We will routinely apply the AdvFS consolidated patch, any patch identified as "potential security vulnerability", and any patch relating to a potential panic which indicates a configuration similar to ours. Have you ever heard of this service? Do you know where to find this file? Do you know how to deconsolidate the file and apply patches selectively?

To the best of our knowledge, this is not an official Digital service and is subject to change (presumably enhancement) without notice. You can find tools to automatically pull and deconsolidate the file under anonymous ftp:

ftp.alaska.edu:/pub/sois/UA_DUtools.README
ftp.alaska.edu:/pub/sois/UA_DUtools.tar.Z
These tools have been used for v3.2c, v3.2d-1, v3.2g, v4.0, and v4.0a.

Plus, starting 30 September 1996 the consolidated README for each osfv* version started appearing in the same directory as patches.tar.Z so you can grab just the README to see if you need or want to grab the whole patch kit. It is very useful to have this file on-site to grep for strings of problems you may encounter.

As of mid-October (through the 02 October v3.2g consolidated file), we had the following patches applied to v3.2g:

sxkac@nugget> PATCHES
+++ OSF375-047 install: sxkac ran ./OSF375-047/Install on Sat Sep 21 10:10:35 1996
+++ OSF375-049 install: sxkac ran ./OSF375-049/Install on Sat Sep 21 10:11:28 1996
+++ OSF375-350222 install: sxkac ran ./OSF375-350222/Install on Sat Sep 21 10:12:03 1996
+++ OSF375-350245 install: sxkac ran ./OSF375-350245/Install on Sat Sep 21 10:12:29 1996
+++ C960424-5207v32g install: sxkac ran ./C960424-5207v32g/Install on Sat Sep 21 10:13:50
+++ OSF375-050 install: sxrmh1 ran ./OSF375-050/Install on Wed Oct 16 14:44:15 1996
+++ OSF375-052 install: sxrmh1 ran ./OSF375-052/Install on Wed Oct 16 14:44:40 1996
+++ OSF375-055 install: sxrmh1 ran ./OSF375-055UA/Install on Wed Oct 16 14:45:02 1996
+++ OSF375-350269 install: sxrmh1 ran ./OSF375-350269/Install on Wed Oct 16 14:45:24 1996
+++ OSF375-370034 install: sxrmh1 ran ./OSF375-370034/Install on Wed Oct 16 14:45:35 1996
You will notice we have a means to track what patches were applied (including when and by whom). We suspect Digital may eventually provide this type of capability using setld for patch application. Under v3.2d-1 we had the following patches applied:
C960215-4398 C960424-4096 C960424-5207 C960424-5207_2
OSF360-012 OSF360-020 OSF360-023 OSF360-025 OSF360-027 OSF360-030
OSF360-033 OSF360-039 OSF360-041 OSF360-043 OSF360-044 OSF360-046
OSF360-350056 OSF360-350061 OSF360-350079 OSF360-350084
OSF360-350090 OSF360-350096 OSF360-350102 OSF360-350142
OSF360-350145 OSF360-350152 OSF360-350154 OSF360-350164
OSF360-350177 OSF360-350178 OSF360-350182 OSF360-350183
OSF360-350184 OSF360-350186 OSF360-350188 OSF360-350197
OSF360-350200 OSF360-350201 OSF360-350203 OSF360-350205
OSF360-350206 OSF360-350211 OSF360-350218 OSF360-350222
OSF360-350223 OSF360-350234 OSF360-350245 OSF360X-350015
Four were custom patches (three were variations of cam_tape.o). Of the others, all but two were incorporated into v3.2g. A patch *is* a change. There is some risk a patch will break something as well as fix something. By appearances, with only two patches not being incorporated in v3.2g, the success rate (quality) of patches making it to the consolidated file is very good even if the advertisement, distribution, and application tools for it are poor. Digital also does custom patches for specific customer problems, these do not make the consolidated file (likely because they get a less thorough quality review). We have yet to find a means to tap into the knowledge which generates those patches to be able to recognize if we have any similar problems.

For a given release we see a new consolidated file showing up every couple weeks:

Subject:  atlanta_ftp osfv32g: Oct3osfv32g /PATCH/patches/atlanta/osfv32g_961010

PATCH ID: OSF375-052     SUBSET(s): OSFADVFSBIN375 OSFADVFS350 OSFADVFSBIN350

sdiff k  osfv32g_961010 (Oct3osfv32g) osfv32g_961003 (Sep30osfv32g):

PATCH ID: OSF375X-350023     SUBSET(s): OSFX11350               <
PATCH ID: OSF375-052     SUBSET(s): OSFADVFSBIN375 OSFADVFS350  |  PATCH ID: OSF375-047
PATCH ID: OSF375-050     SUBSET(s): OSFBIN375 OSFBIN350 OSFBIN  |  PATCH ID: OSF375-049
PATCH ID: OSF375-053     SUBSET(s): OSFBIN375                   <
PATCH ID: OSF375-370033     SUBSET(s): OSFHWBIN350 OSFHWBIN375  <
PATCH ID: OSF375-370034     SUBSET(s): OSFHWBIN375              <
PATCH ID: OSF375-350269     SUBSET(s): OSFBIN350                <
Typically there are 3 to 12 new or superseded patches in each new release. We review the new patches and schedule application across our systems for those which are applicable. The review is very quick once you are current.

AdvFS panics Recovery

Well, unfortunately we once lost /usr, which happens to include /var->/usr/var and /nsr->/var/nsr, in one of those AdvFS panics. Recovery from this means rebuilding the root disk and recovering NSR. A checklist of this process is also under UA_DUtools.tar.Z. On the positive side, it was reassuring to discover our backup strategies and recovery skills were indeed adequate.

In addition, for corruption where domains#filesets are still mountable there are the relatively well known, albeit unsupported and undocumented, AdvFS utilities under /usr/field:

/usr/field/msfsck

This is the AdvFS bitfile-subsystem metadata structure checker. It verifies low-level meta-structures like the BMT, storage bitmap, and tag directories.

The file domain must be inactive to run msfsck. You also need at least one mounted fileset (this is because msfsck uses the .tags directory in the fileset to access the metadata).

To run msfsck, first 'cd' to the mount point of a mounted fileset.
Then, run "/usr/field/msfsck -t ".
/usr/field/vchkdir

This is the AdvFS directory structure checker and fixer. It verifies that the directory structure is correct and that all directory entries reference a valid file (tag) and that all files (tags) have a directory entry.

The -f flag will create symlinks in "<mount-point>/lost+found/" to all files (tags) that do not contain a directory entry; these are called lost files. The -f flag also remove 'dead' directory entries (ones that do not point to valid tags).

The -d option will delete lost files and it will delete corrupted directories.

Note, you may need to run vchkdir several times to cleanup a fileset.

The file domain must be inactive to run vchkdir.
The fileset to be checked/fixed must be mounted.

To run vchkdir do "/usr/field/vchkdir <mount-point>".

/usr/field/tag2name

This program will display the full pathname of a file when only the file's tag (inode) number is known. This is mainly a debugging aid when msfsck or vchkdir report errors for specific tags.

To run do "/usr/field/tag2name <mount-point>/.tags/<tag-number>".

It's only firmware, right? (KZPSA A09 and bus resets)

Due to late shipment of our 7620 to 8400 upgrade (that never happens to anybody else, right?), we missed a Christmas-New Years installation window and deferred until early February following spring semester student registration. Because of the complexity of installing the consolidated patch kit (we had not, as yet, written tools to handle it), we decided to go from Unix v3.2c to v3.2d-1 in the same time frame for fixes relevant for 8400's. Included with the v3.2d-1 upgrade was firmware CD v3.4.

This required three changes (hardware, operating system, and firmware) in a relatively tight time frame. We have the advantage of a 2100-4/200 (nugget, not lemon) used for testing and development. This system is generally the first to get any change, typically several weeks in advance for software changes, in this case it was same week. The firmware upgrade on the 2100-4/200 KZPSA's (A08 to A09) failed with:

	open failure firmware filename 'KZPSA_fw'
We opened a problem with Digital, they had no ideas, but we stayed with v3.2d-1 and continued with plans for the 8400. We did not worry about the firmware upgrade failure. After all, it's only firmware.

The upgrade to the 8400 and 3.2d-1/fw3.4 on both the 8400 and 2100-5/250 went smoothly the first weekend of February. However, several days later both systems started having problems where they would periodically lose access to their disks behind their HSZ40's. Digital support concurred with us that the likely problem was the KZPSA A09 firmware (the common element and indicated by bus resets showing with uerf), but offered no resolution other than trying to escalate getting us a copy of A10. Understandably, having one bad firmware release with no release notes did not make us anxious to install another. We unilaterally backed them to A08, which wasn't recommended, and restored stability in the interim until we had both the CD required for the 8400 and the opportunity to review potential impact of A10.

It is not clear that our 2100-4/200 would have seen the problem in advance had it been upgraded to KZPSA A09, but the moral is to manage changes better (less at once) as well as limiting "trust" for anything without detailed release notes.

The 2100-4/200 failed to upgrade the KZPSA firmware because we placed the CD in a bus-1 CD-drive instead of RZ6 internal to the 2100... this was an unknown limitation (e.g., bug) or under-documented feature in fw3.4 kits for the 4/200.

It's only firmware, right? (Disk firmware and bus timeouts)

On the consolidated software CD is a firmware directory containing only RZ28B and RZ25L firmware. Specifically, RZ28B 003 to 006 upgrade is required for ASE configurations, but it was not otherwise recommended (in fact, almost discouraged).

In panics (simple_lock timeouts) on 15 February, 24 April, and 28 April our 8400 hung, permanently, in 'syncing disks...'. The hang was so severe that the ctrl/p console interrupt was ineffective and the system had to be restarted. We had no further hangs during 'syncing disks...' (we did have more panics) after:

Root domain moved to RZ28M's from RZ28B's (003) on less used HSZ40 and bus;
Disabled the prestoserve module;
Second (of three) cam_tape.o patches was applied;
OSF360-018 consolidated AdvFS patches were applied;
System activity was lower at times of non-corruption panics.

We do not know specifically what fixed the 'syncing disks...' hangs, but informal word from Digital engineering was there were some known problems with bus timeouts with the RZ28B 003 firmware which could have been the culprit.

During the second of three HSZ40 battery problems we noticed some unusual HSZ behaviours which ultimately lead to the suspect HSZ40 being replaced. However, Digital engineers also made recommendations that the following disk firmware upgrades be done:

RZ28	D41C|441c	to	442D
RZ28M	0466		to	0568

It was noted that these were not "mandatory" updates, but we have really started believing in pro-active problem avoidance. Since this already involved a significant percentage of our disk farm, we looked for current firmware for all of our disks and did the following:

RZ28B	0003	to	0006		 7 disks
RZ25L	0006	to	0008		 2 disks
RZ26	T386	to	T392A		 2 disks
RZ26L	440C	to	442D		 2 disks
RZ28	D41C	to	442D		29 disks
RZ28	441C	to	442D		 3 disks
RZ28M	0466	to	0616		 8 disks
RZ29B	0014	to	0016		50 disks
RZ28M	0568	to	0616		12 disks

This amounted to 115 of our 117 disks. It was a 3 month process to cycle through all of them. Pay attention to the instructions to backup your disks before upgrading, aggregate we fried 3 disks doing upgrades. The most reliable means to update was via HSUTIL available in v2.7 HSZ40 firmware. Documented instructions for RZ26L's were inaccurate at least on 3000-300LX systems... one must wait for 5 minutes after SCU download indicates it is complete (it really isn't), or you will fry the disk. We found it was impossible to upgrade RZ26 T386's either via SCU or HSUTIL and ultimately gave up on those disks. Further disk firmware update information and examples is in UA_DUtools.tar.Z.

'simple_lock: time limit exceeded' panics

As an introduction, we have seen at least three different flavors of simple_lock panics. However, a very common one (at least 8 sites) is triggered by volumes of soft tape errors on TZ877 tape drives.

Symptoms:

simple_lock: time limit exceeded

	pc of caller:         0xfffffc00004c172c
	lock address:         0xfffffc007f8fc7d0
 	current lock state:   0x00000000004c1605 (cpu=0,pc=...

panic (cpu 0): simple_lock: time limit exceeded
syncing disks...
[...]
root@glacier> dbx -k /vmunix	# use pc from panic
dbx version 3.11.8
(dbx) 0xfffffc00004c172c/i
[ctape_close:2674, 0xfffffc00004c172c] bsr r26, simple_lock(line262)
(dbx) quit

If it is in ctape_close and uerf shows a volume of soft errors, you have the problem. There are patches to cam_tape.o for v3.2c, v3.2d-1, and v3.2g available through CSC, they are not yet in the consolidated patch file. The problem likely exists in both v4.0 and v4.0a in some form.

In our case, a single TZ877 was generating hundreds of errors each night during the cloning run to copy NSR savesets. The patch (third revision) was effective for stopping the panics, but ultimately the TZ877 also had to be replaced to stop the soft errors. Several other sites had similar experiences.

Recommendation:

Monitor hardware errors and do pro-active reviews before problems escalate.

This was clearly a software problem, but it was hardware triggered. We send automated nightly messages summarizing uerf reports, these take the form:

#glacier  Tue Oct  8 1996
>11:13:23      3  100 CPU EXCEPTION
#glacier  Sat Oct 12 1996
>22:26:56      4  199 Bus:06 lu:49.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
#glacier  Sun Oct 13 1996
>00:12:42      5  199 Bus:06 lu:49.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+

Summary:
     Total     1  100 CPU EXCEPTION
     Total     1  199 Bus:06 lu:49.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
     Total     1  199 Bus:06 lu:49.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+
The programs and scripts sending these summaries are in UA_DUtools.tar.Z.

Presto-chango: AdvFS and data corruption

Our first simple_lock panic in ctape_close on 15 February 1996 started a cascade of other problems. It occured at 3:38am. The system hung syncing disks, was manually restarted, panic'd again shortly after 8am then auto rebooted. The Oracle databases restarted fine. At 11am the primary Oracle database crashed and would NOT restart, this was described earlier. By appearances, an index was corrupted and the active archive log had data corruption. Subsequent examination showed several other non-Oracle data file corruption and some AdvFS structure corruption. A patch to cam_tape.o was applied to remedy the cause of the panic, it did not do anything for the resulting corruption.

Our second simple_lock panic in ctape_close occurred on 24 April 1996 and lead to two subsequent panics of different types:

         \%\%\% Wed Apr 24 05:00:00 AKDT 1996 \%\%\%

    (1)  simple_lock: time limit exceeded

            pc of caller:         0xfffffc00004c172c
            lock address:         0xfffffc007f8fc7d0
            current lock state:   0x00000000004c1605 (cpu=0,pc=...

        panic (cpu 0): simple_lock: time limit exceeded
        syncing disks...
   ------------------------------------------------------------------------
        [...system rebooting, processing rc3.d...]

        -> /sbin/rc3.d/S90schedule start
        [...]
        ***
        *** POLYCENTER Scheduler has STARTED all requested processes
        ***   Startup script procedure exiting with success.
        ***
        SCHED-I-PARTOPENREQ, Partition (ROOT_GLACIER) was requested to Open

   (2)  trap: invalid memory read access from kernel mode

            faulting virtual address:     0x0000000000000015
            pc of faulting instruction:   0xfffffc00003f4018
            ra contents at time of fault: 0xfffffc00003f3ac4
            sp contents at time of fault: 0xffffffffb8342ad0

        panic (cpu 0): kernel memory fault
        syncing disks... done
   ------------------------------------------------------------------------
        \%\%\% Wed Apr 24 10:00:00 AKDT 1996 \%\%\%

   (3)  # ADVFS EXCEPTION
        Module = 2, Line = 1356
        panic (cpu 0):
        syncing disks... done
Remembering the previous problem from 15 February, we held the databases down and with a full export we were able to detect table corruption.

It was fairly obvious that the first panic (simple_lock) either directly or indirectly caused a large degree of corruption which lead to the subsequent panics. The corruption included:

  1.1	/etc/auth/system/ttys
	This file was corrupted causing telnet logins to fail until recovered.
	Note, CSC said this file often gets corrupted.

  1.2	/var/sched/tmp/sched_22-Mar-96_05:32:13.exp
	This directory was corrupted, access ('ls -l') caused panic type (2).

  1.3	/usr/bin/test
	/usr/bin/tsort
	/var/sched/help/
	These showed up in the msfsck and 'vchkdir /var' crashed the system.
	Domain containing /var and /usr required restoration.

  1.4	oracle (/u??/ORACLE/FINP/data/?) FBGENL and FGBTRND
	There were corrupted rows in both tables which showed up
	with an attempted full export of the Oracle database.
	No corruption was apparent at the file system level.
	Required action was to recover and roll-forward Oracle database.

  1.5	/users/sx/sxcnb99/ipl/now/SYS1.PROCLIB_RDR400
	/users/sx/sxcnb99/ipl/now/SYS3.PROCLIB_CAS
	Both showed up in find as 'bad status' apparently mapped to
	the directory but not really there.

  1.6	/oraalogs/ORACLE/FINP/arch/*41*.dbf (or possibly 40)
	An attempt to delete this file caused panic (3).  On reboot,
	no filesets in that domain could be mounted (restore required).

The initial panic (1) hung during 'syncing disks...'. Ctrl/P from the 8400 console was ineffective, a "restart" was required. Likewise, the panic on 15 February hung during 'syncing disks...'. Logically, this is a serious concern as syncing the disks is going to flush any pending writes to disk. We have yet to receive response from Oracle whether they employ synchronous writes, read-after-write, or other techniques to know if this is an exposure for Oracle. Detailed analysis from Digital showed that AdvFS does have an exposure during panic syncing to cause corruption, one patch was produced for this but was ineffective for SMP environments, a second patch is now out.

A third panic of this nature occurred on 28 April and the system again hung during the 'syncing disks'. However, this panic did not result in any corruption. By that point we had identified the one TZ877 used for cloning as triggering the problem, so we shut down Oracle during the cloning process. Also, we had preserved the old /usr and /var RZ28B fw 003 disks corrupted on 24 April for Digital to analyze, the root raidset was restored onto RZ28M disks. Finally, we had disabled the prestoserve module as being suspect and not very useful in our configuration. It is not known which of the above changes prevented corruption on 28 April, maybe we were just lucky for a change.

Other Oracle Corruption (not from cam_tape induced simple_lock panics)

On 21 September 1996 a bad scsi bus termination resulted in the 8400 generating lots of AdvFS exceptions and halting (it did not even panic):

msfs_unmount failed for /u05 with error 16 during shutdown
msfs_unmount failed for /u02 with error 16 during shutdown
msfs_unmount failed for /users/sx with error 16 during shutdown
msfs_unmount failed for /users with error 16 during shutdown
msfs_unmount failed for /tmp with error 16 during shutdown
msfs_unmount failed for /var/adm with error 16 during shutdown
msfs_unmount failed for /var with error 16 during shutdown
msfs_unmount failed for /usr with error 16 during shutdown
syncing disks... 26 20 7 done
CPU 0: Halting... (transferring to monitor)

halted CPU 0
CPU 1 is not halted

halt code = 5
HALT instruction executed
PC = fffffc0000481950
P00>>>

The cause is known (bad bus termination), the results are still under investigation. Paranoia (lack of disks successfully syncing) caused us to do a full export of the Oracle databases and a table was found to be corrupted. Recover and roll forward of the affected table was effective, unfortunately the export (even to /dev/null) still takes over 2 hours as a diagnostic aid.

Recommendations

Other sites, particularly 2100's, have reported problems via alpha-osf-managers list with hangs during disk sync's. Several other sites have had data corruption after panics, specifically one lost NSR indices with a cam_tape simple_lock and another lost ufs based Oracle files after a cdisk_strategy simple_lock panic. The primary exposure appears to be writes not making it to disk. This exposure may be higher for "database" type applications such as:

Oracle, NSR, POLYCENTER Scheduler (uses Informix), and AdvFS
where a 'lost write' may not be at end-of-file or may be improperly synchronized causing irrecoverable corruption.

We still have our 8400 set 'auto_action boot', but we have added checks within rc3 to detect whether the system was shutdown normally and do not proceed with the site defined 'init 4', which starts Oracle databases, if the system panic'd and there is any indication the disks did not sync. This has saved us twice already. Examples of rc3 and 'init 4' customization is also included in UA_DUtools.tar.Z. In addition, use of the AdvFS /usr/field utilities and|or walking the entire filesystem (via 'ls / -lR >x' or something like 'find / -atime 1 -ls >x' will identify file system corruption.

Also, the patch produced as a result of the 24 April event (OSF375-055) is now applied on our systems and we believe it should be effective in at least partially avoiding possible future corruption. In general, looking for and applying preventative patches to avoid panics and thereby reducing risk is "A Good Idea".

CPU EXCEPTION's and error log analysis uerf and dia

CPU EXCEPTION does not mean just the CPU board(s), more often than not it is an error logged against memory or the cache resident on the CPU board. With uerf you will see something like:

sxkac@glacier> uerf -c err,oper -o full -t s:`ua_date -uerf -7` | more
[...]
EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  100.     CPU EXCEPTION
SEQUENCE NUMBER                  3.
OPERATING SYSTEM                        DEC OSF/1
OCCURRED/LOGGED ON                      Tue Oct  8 11:13:23 1996
OCCURRED ON SYSTEM                      glacier
SYSTEM ID                 x0005000C     CPU TYPE:  DEC 7000
SYSTYPE                   x00000000
PROCESSOR COUNT                  2.
PROCESSOR WHO LOGGED      x00000000

----- UNIT INFORMATION -----

UNIT CLASS                              CPU
'uerf -Z' will do a raw record dump, DECevent provides the knowledge to interpret it. DECevent shows quite alot more (albeit it produces pages of output to scan):
sxkac@glacier> dia -i cpu -o full -t s:`ua_date -uerf -7` | more
[...]
Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             3.
Timestamp of occurrence              08-OCT-1996 11:13:23
Host name                            glacier

System type register      x0000000C  AlphaServer 8x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000000

Event validity                    1. O/S claims event is valid
Event severity                    5. Low Priority			<==
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   4. 620 System Correctable Error
[...]
TLBER                     x00440000  CORRECTABLE READ DATA ERROR	<==
                                     DATA SYNDROME 2
[...]
The previous is obviously a correctable error which can be ignored unless they start occurring regularly or frequently. However, DECevent also lies:
sxkac@spike> dia -i cpu -o full -t s:`ua_date -uerf -20` | more
[...]
Logging OS                        2. Digital UNIX
System Architecture               2. Alpha
Event sequence number             2.
Timestamp of occurrence              26-SEP-1996 17:38:34
Host name                            spike

System type register      x00000009  AlphaServer 2x00
Number of CPUs (mpnum)    x00000002
CPU logging event (mperr) x00000001

Event validity                    1. O/S claims event is valid
Event severity                    1. Severe Priority			<==
Entry type                      100. CPU Machine Check Errors

CPU Minor class                   2. 660 Entry

-- ENTRY FRAME FOLLOWS --
Frame ID                  x00000022  Machine Check Frame
[...]
Machine Check Error Code  x00000202  CPU Detected Unrecoverable Error
[...]
-- ENTRY FRAME FOLLOWS --
Frame ID                  x00000008  Memory Frame

Memory Module ID          x00000003
Error Register 1          x0000000000040001				<==
                                     [Even] Error Summary
                                     [Even] EDC Corr Error		<==
[...]
The above is a correctable memory error which is erroneously reported as severe. The only way to know is to look closely and ask Digital. This (x040001 on a 2100) is a single bit correctable memory error which can be ignored unless it happens in volume or often.

As stated previously, the primary value is automatically filtering the error log to something quickly scanable for anomalies, such as:

Subject: spike ua_uerf from 25-sep-1996,00:00:00

#spike    Thu Sep 26 1996
>17:38:34  2 100 CPU EXCEPTION    
>17:38:34  3 100 CPU EXCEPTION    

#spike    Tue Oct  1 1996
>18:15:05  4 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 7918698

#spike  Sat Oct  5 1996
#spike  Sat Oct  5 1996 07:48:54  26 301 SHUTDOWN|halted by root:#!> 2100A and CPU upgrade
#spike  Sat Oct  5 1996 12:21:17   0 300 STARTUP   

Summary:
  Total 2 100 CPU EXCEPTION    
  Total 1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block number: 7918698
  Total 1 301 SYSTEM SHUTDOWN  
  Total 1 300 SYSTEM STARTUP   

26-SEP-1996 17:38:34 spike  1. Severe Priority Memory Module ID x03 Error Reg 1 x0000000000040001
26-SEP-1996 17:38:34 spike  1. Severe Priority Memory Module ID x03 Error Reg 1 x0000000000040001

                                     cdisk_bbr: BBR disabled bad block number: 
Device Locator              x000401  Port    =   1. 
                                     Target  =   4. 
                                     LUN     =   0. 
Some sites will pump error log entries into an immediate filter for reporting. At a minimum daily summaries are "A Good Idea". In addition, monthly summaries can also provide good management information:
#spike  Sat Sep 21 1996
#spike  Sat Sep 21 1996 11:46:07  48 301 SHUTDOWN|halted by root:Starting fw 3.7 du 3.2g upgrade
#spike  Sat Sep 21 1996 13:09:26   0 300 STARTUP
#spike  Sat Sep 21 1996 13:32:47   1 301 SHUTDOWN|halted by root:Final boot DU 3.2g fw 3.7 clc3.1
#spike  Sat Sep 21 1996 13:58:02   0 300 STARTUP
#spike  Sat Oct  5 1996
#spike  Sat Oct  5 1996 07:48:54  26 301 SHUTDOWN|halted by root:#!> 2100A and CPU upgrade
#spike  Sat Oct  5 1996 12:21:17   0 300 STARTUP

Summary:
  Total 42 199 Bus:01 lu: 8.0 R=ctape_move_tape:Hard Error Detected:DEC TZ877:+
  Total 10 100 CPU EXCEPTION
  Total  3 199 Bus:01 lu: 9.1 R=changer_check_status:::Recovered error
  Total  2 199 Bus:02 lu:17.0 R=cdisk_check_sense:::Event - Unit Attention
  Total  3 301 SYSTEM SHUTDOWN
  Total  3 300 SYSTEM STARTUP
  Total  1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 8322460
  Total  1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 7918698
  Total 34 199 Bus:01 lu: 9.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
  Total  2 199 Bus:01 lu: 9.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+
Sample scripts and programs supporting this are in UA_DUtools.tar.Z.

'doconfig' doesn't: The phantom XMI bus

After upgrading from a 7620 to 8400, we had to retain the old XMI bus cage for several months. Finally, it was removed. However, the first time we attempted a 'doconfig' without '-c HOST' we got:

*** PERFORMING KERNEL BUILD ***

A log file listing special device files is located in /dev/MAKEDEV.log
        Working....Mon Sep  9 16:12:49 AKDT 1996

*** WARNING ***
An error has occurred during system configuration.  A partial listing
of the error log file (./errs) follows:

env - COMP_HOST_ROOT=/ COMP_TARGET_ROOT=/ /bin/cc -std0 -EL -I -I. -I..
[...]
The XMI still existed in the /sys/conf/HOST file, but the kernel could not be rebuilt. The workaround was:
# doconfig -c BOGUS
[...]
# mv /sys/GLACIER	/sys/GLACIER.old
# mv /sys/BOGUS		/sys/GLACIER
# mv /sys/conf/GLACIER 	/sys/conf/GLACIER.old
# mv /sys/conf/BOGUS 	/sys/conf/GLACIER
# ed /sys/conf/GLACIER		s/BOGUS/GLACIER/
# mv /sys/conf/GLACIER.list /sys/conf/GLACIER.list.old
# mv /sys/conf/BOGUS.list   /sys/conf/GLACIER.list
# shutdown -r

'doconfig' doesn't: Missing tapes

We rarely do a 'doconfig' without '-c HOST'. Why? 'doconfig' likes to number mt* sequentially while we prefer to name tapes logically:

	mt0,mt1,mt2,mt3	are TZ877's
	mt4*		are 4mm 
	mt8*		are 3480's
	mt9		is (fortunately only one) 9-track reel
The only reason for a simple 'doconfig' is a significant bus architecture change. This also saves one from repetitive raising of maxusers, most other kernel parameters are best set in /etc/sysconfigtab.

RAID-5: is it any good?

Failure detection (and configuration management)

Is RAID-5 behind an HSZ40 any good? Almost too good, the first disk failure we had went undetected for three days. The SW800 disk cabinet is in a frequently unoccupied room. One day somebody looked in the cabinet and saw an amber light on a disk. If you implement an HSZ40 controller, you must also implement a means to monitor its console. It is another computer which serves disks. The hszterm utility (subset SWACLI11A) provides a means to do this, for example:

sxkac@nugget> sudo hszterm -f /dev/rrz17c <<!HEOF >hsz_3n.961015
> show this full
> show unit full
> show raidset full
> show device full
> !HEOF
sxkac@nugget> sudo hszterm -f /dev/rrz17c "run fmu" "show last most" >hsz_3n.961015.err
Placing variations of the above commands in a scheduled job and simple scripts to grep key lines and sdiff, i.e.:
sdiff $TODAY $YESTERDAY | grep -E ' < | > | \| ' | mailx [...]
can alert you to changes or problems.

However, not all HSZ errors are logged via FMU (many are not) and very few HSZ errors actually get logged to the host errlog. We connect HSZ consoles, along with all host system serial consoles, to POLYCENTER Console Manager so we have a means to play back and review any events which occured. This has proved invaluable a number of times in diagnosing and reconstructing problems.

Performance (Raidsets on HSZ40)

Read performance has been fine. Aside from the recurrent bad battery problems, write performance on raidsets using write-back cache has been adequate most of the time. However, we have noticed several times where Oracle has bogged down. We traced one to very high (unusually so) write activity against TEMP table space and we traced another to the db writer being bottlenecked. What exactly triggered these is still under investigation. Our suspicion is there is some activity occuring which should not be occuring during prime hours, tracking it is a challenge.

Watching performance of HSZ disks is also a challenge, iostat is effectively useless, we wrote a modified verion that interprets the device names:

sxkac@glacier> uak_iosts 5 3
      tty    rz25    rz26    rz27    rz58    rz59    rz60   rzb57   rzc57     cpu
 tin tout bps tps bps tps bps tps bps tps bps tps bps tps bps tps bps tps  us ni sy id
   9 1106  46   3   9   0 106   4  26   0  13   0  56   3  18   0  97   4  16  0 10 74
  23 3927          13   2  19   2 593  68  40   1  61   8          26   3  60  0 16 24
  20 3387                  40   3 408  40  38   1  77   4          46   4  74  0 12 14
Alan Rollow's monitor utility is also an excellent means to monitor disk IO on-line.

Software destruction (etc.)

In the last 18 months we have fully restored large Oracle databases five times, restored root disks three times, and done various other partial restores (AdvFS domains, individual Oracle datafiles, index rebuilds, miscellaneous files). Events causing these have been AdvFS panics, other panics, apparent Oracle problems, bad HSZ batteries, and other hardware problems. RAID-5 has protected us from loss due to a disk failure (we have had three), but is ineffective for faulty software or ineffective hardware recovery which are by all appearances much greater risks.

Run-away processes (oh no, what happened to my CPU?)

Several sites, including us, have at times had problems with run-away processes consuming the majority of CPU resources. Often they have been orphaned processes which loop after a hangup. The most common one for us was a menu application coded "improperly" to not check for errors. It would issue an fgets() and not check for error, then re-issue the fgets() repetitively looking for an 'exit' or other valid command. Since the terminal (telnet connection) was disconnected, the loop was endless, brutally efficient (tight and fast), and made worse with the IO attempts in consuming 'system' cpu. The following code segment remedied the application:

[...]
  if (!fgets(menu_selection, MAXINPUT, stdin))
  {
    if (ferror(stdin))  housekeeping_exit (0, 0);
        eof++;
    if (2 > eof)
    {
        fprintf(stderr, "\t\007Another  will exit!\007");
        fflush (stderr);
        sleep (5);      /* ensure pause before screen re-paint */
    }
    else                housekeeping_exit (0, 0);
  }
  else  eof = 0;
[...]
Abnormal telnet disconnects are a common problem with long-distance communications in places like Alaska. The number of looping orphans was a very small fraction of the abnormal disconnects. By appearances, in some situations the HUP signal would get lost triggering this problem. Digital's response was there was likely a faulty pc-telnet implementation causing this, but we saw it in at least three common telnet implementations (on both pc's and mac's). We were convinced there were conditions where HUP's were lost as we occasionally saw similar orphans from Digital X-terminal applications. We have not seen any since v3.2d-1.

One method to combat this problem is to add soft CPU limits to /etc/profile (korn or bourne, use /etc/csh.login and 'limit' for csh):

ulimit -St 600     # soft limit - limit process CPU time to 10 min
If you choose to do this, be sure to *NOT* set it for root, Oracle, or other userids which are expected to have large or unlimited consumption, otherwise you will find things terminating due to signal 24.

Another common limit imposed on korn shell users is:

export  TMOUT=1800 # 30 minute timeout
This is effective only for the korn shell, not for programs executing. As with ulimit, be sure to *NOT* set it for server type userids, for Oracle in particular you will get SQL*Net disconnects if this is set and you are running MTS (multi-threaded server).

To 4.0 or not to 4.0?

Planned change and configuration management is the best way to avoid problems. Sound practices like reducing the number of concurrent changes so cause and effect for any subsequent problems are more obvious seem like common sense. However, supporting multiple systems and having a number of items to update (operating systems, preventative or remedial patches, databases, applications, layered products, firmware at multiple levels, hardware, network adapters, disk controllers, etc.) as well as increasing demands for 24x7 availability for parts or all of the configuration lead to strong pressure to cut corners and merge activities. Be careful, very careful.

Problems are like flowers, they like to occur in bunches (particularly near changes):

  (A)	17 Sep 1995	AdvFS panic corrupts /usr (/var,/)
	18 Sep 1995	Power failure
	18 Sep 1995	HSZ40 battery bad, raidset corrupted

  (B)	31 Jan 1996	2100 cannot load KZPSA firmware off bus-1 CD
	02 Feb 1996	7620->8400,  DU v3.2c->v3.2d-1, fw 3.4 cd
	05 Feb 1996	2100/8400 lose HSZ disks due to KZPSA A09
	12 Feb 1996	2100 C-bus loops... backplane replaced
	15 Feb 1996	8400 simple-lock panic corrupts Oracle

  (C)	19 Aug 1996	2100 KZPSA fails
	26 Aug 1996	HSZ40 battery fails, 2100 crashes
	26 Aug 1996	2100 IO module fails preventing reboot
In the case of (B) in particular, cause and effect relationships were initially indecipherable as different problems overlaid each other masking the symptoms.

Reduce your risk by controlling the changes to the environment. Do firmware updates in advance of software updates when possible and stage upgrades in phases whenever it is reasonable. Small changes are less risky than big changes.

We went to v3.2d-1 shortly after it was available. While a "fix only" release should be relatively safe, we heard on several support calls that patches available for v3.2c had not yet been ported to v3.2d-1. Digital is improving on cycling patches to all supported releases, but don't be caught as the first one with a new release.

In going from DU v3.0 to v3.2c, we found POLYCENTER Performance Manager inoperable pending a patch. Even preventative patches can cause problems, one applied to v3.2d-1 broke POLYCENTER HSM requiring a reinstall of both HSM and SCSI-CAM (CLC). If one read the early release information for DU v4.0, there were many layered products initially unsupported under v4.0. Being early to a new release incurs risk of something not yet being there.

Digital Unix v4.0 is a major upgrade. Our strategy is to let other non-production systems find the initial bugs. General rule of thumb is to wait at least 6 months after general release for a production system. Already v4.0a is available (and is the migration path for v3.2g which is our current version). Besides, the current level of our Oracle database and application software package are not vendor validated against v4.x. We have one 3000 on v4.0a, we will follow with another 3000 and the test/development 2100-4/200 sometime in first quarter of 1997 as staff resources permit.


Vendor Management

Is "Technical Support" an oxymoron?

The Mushroom Theory:

Customers are mushrooms... keep them in the dark,
toss in some manure, and they will flourish?
Certainly not all Digital and Oracle support technicians reflect this corporate attitude, there are some good ones. Unfortunately, the corporate attitude is sometimes infectious and it appears that some engineers or engineering managers may treat the support centers in this manner at times.

Where is the technical information?
(particularly the preventative kind)

Have you heard the echo "this is a known problem", are you asking why you feel like you are the last one to know?

While it may be "bad marketing" to acknowledge problems exist, it is excremental support to not let your customers know so they can avoid them.

Don't settle for just getting it working again, insist on root cause analysis and preventative measures (for you and for your neighbors).

Pursue the problems behind the problems.
Yes, if a system panics you do need to stop the panics to avoid the outages. However, if the panic causes further damage (either actual or extending the outage for data validation), there is a second problem to pursue.

Make them answer:

(Network): I'm mad as hell and I'm not going to take this any more.

What you can do about it

Proactive system management

Document, Document, Document

Network

There are many capable and knowlegable people "out there" and many ways or getting in contact with them. The following are some good resources.

What you can do about it

Maintain your sense of humor