| The bad news: |
|
| The good news: |
|
Gordon Schumacher, sxrgs@alaska.edu David DeWolfe, sxdjd@alaska.edu Kurt Carlson, sxkac@alaska.edu University of Alaska Statewide Office of Information Services, Technical Services 910 Yukon Drive Fairbanks, Alaska 99775-6200
ORA-0600, HSZ battery failures, AdvFS panics, CPU EXCEPTION, bus reset, oracle recovery failed? Been there, done that. This session will share some practical experience reacting to and avoiding disasters in a Digital Unix and Oracle7 environment. Recovery techniques, backup strategies, firmware and error management, patch management, configuration management, and vendor management will be discussed.
The University of Alaska consists of three major campuses located in Fairbanks, Anchorage and Juneau, plus 11 outlying campuses serving regional communities. The University currently serves approximately 33,000 students. The University is also a research institution and supports community service, educational, and cooperative extension programs across the state employing approximately 5000.
The University's administrative computing applications provide support for Financial, Human Resource and Student Information. Programming, technical and operational support for these applications is provided by the Statewide Office of Information Services and the Statewide Office of Network Services.
Since 1985, the University's administrative applications have been running on an IBM mainframe platform running MVS and using IDMS as the database management system. On that platform, Finance, Human Resource and the Student Information Systems run as three separate applications in dedicated databases.
For a decade and through various migrations from one operating system to the next (MVS/SP->MVS/XA->MVS/ESA), and various versions of IDMS and of the application code, this environment has provided us with a secure, robust and manageable environment (i.e., IBM's oft mentioned RAS: Reliability / Availability / Serviceability is not a meaningless acronym).
Our current "legacy system" is an IBM3090-150E running MVS/ESA 4.3, IDMS 12.0, and various layered products from IBM and third party vendors.
In 1993 the decision was made to migrate to a client/server computing environment and to implement the "Banner" product from Systems & Computer Technologies (SCT) to provide an integrated application with support for all administrative statewide computing systems. Digital Equipment's Alpha product line running Digital UNIX (then OSF/1) was chosen for the operating system platform and Oracle RDBMS was chosen for the database management system. Implementation began in the summer of 1994 with the first application (Finance) "live" in July 1995. Human Resources comes live at the close of 1996 with the final application, the Student Information system, coming live after spring semester registration in January 1997.
Alpha 8400 5/440 (glacier)4 CPU's and 6GB memory, 110GB behind HSZ40 controllers;Alpha 2100A 5/300 (spike)
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running the SCT Banner Finance, Human Resource and Student Information.3 CPU's and 2GB memory, 80GB+ behind HSZ40 controllers;Alpha 2100 4/200 (nugget)
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running an Oracle instance for data warehousing and decision support.1 CPU and 768mb memory, 20GB+ behind HSZ40 controller;Two 3000-300LX systems used strictly for system and database software testing.
Digital UNIX v3.2g, AdvFS, NSR, Oracle 7.1.4/7.2.3, etc.;
Running our test Oracle instance and supporting applications development.
The speakers all work in Technical Services for the University of Alaska Statewide Office of Information Technology. Kurt Carlson has 20 years data processing experience as a systems programmer supporting GCOS, VAX/VMS, IBM/MVS and Digital UNIX. David DeWolfe has 15 years DP experience and provides database administration for IDMS and Oracle as well as systems support for Digital UNIX. Gordon Schumacher has 18 years DP experience, 15 of them as a systems programmer supporting IBM/MVS, IBM/VM and Digital UNIX operating systems.
Oracle support was called and a priority 1 TAR (Technical Assistance Request) was opened.
In the hour and a half that we waited to hear back from Oracle we formulated a general game plan:
Oracle trace files were email'd to Oracle US support.
Support instructed us to add the following undocumented init.ora parms to the FINP init.ora:
event="10210 trace name context forever, level 10" event="10231 trace name context forever, level 10" event="10211 trace name context forever, level 10" event="10015 trace name context forever, level 10" _db_block_cache_protect=true _db_block_compute_checksums=true
These are discussed in the Oracle press Backup and Recovery Handbook.
Support also informed us that we had corruption in 2 datafiles, #64 and #92. I informed them that there were only 33 datafiles associated with this database. They called back a while later to inform us that they had incorrectly converted addresses in the trace files, and that the actual corrupted datafiles were #16 and #23.
Support then instructed us to alter the two datafiles off-line and attempt to open the database. This failed with an error 376 as the database was trying to recover transactions in objects in both of the off-line datafiles.
We were given two options at this point:
We selected the second option after being informed that the first option could take several days, if not weeks.
Somewhere around here the TAR was transferred to the Australian support center.
Since we were going to do incomplete recovery we had to restore all 33 datafiles. This took approximately 3 hours. At about 10:30 PM on the 15th we initiated incomplete recovery:
recover database until time "1996-02-15:11:45:00"
At 11:00 PM another DBA came in to monitor the recovery while the rest of us went home to get some sleep. At 12:30 AM on the 16th I was called and informed that the recover had aborted with an (oh no!) ORA-0600 while applying archive log sequence #2101. Sequence #2101 was the active on-line redo log when Glacier crashed at 3:38 AM on the 15th.
I came back in and called Oracle UK support whom the TAR had been transferred to.
UK support suggested doing point-in-time recovery to prior to the 3:38 AM system crash on the 15th. We informed them that that would result in the loss of about 5 hours of financial transactions, and that only as a *last* resort would we consider that. We firmly requested that other options be explored.
I prepared to restore the 33 datafiles, and informed support that it would take 3 hours. Support said they would call back.
UK support called back and informed us that they had identified the object # and datafile that caused the recovery to fail. We could not open the database to find out what object #4173 was, but it resided in an index "data" file! This was the break we were looking for.
UK support indicated that we might be able to get past our recovery ORA-0600 by doing the following:
perform cancel-based recovery up until and including archive log sequence #2100. alter the datafiles associated with the index tablespace above off-line. open the database without resetting the log files and determine what object #4173 was. drop object #4173. alter the datafiles on-line that were altered off-line above. shutdown the database. startup mount the database. initiate complete recovery. take a backup. recreate the index that was dropped above (#4173).
UK support indicated that they did not know if this would work or not, but they, and we, thought it was worth a try.
While restoring the datafiles, the TAR was transferred back to US support in Florida. They promptly questioned the plan that UK support had formulated. I *demanded* that US support contact the UK analysts who had worked on the TAR, and that they discuss/modify/verify the planned recovery attempt.
US support called back a short time later with a modified plan. We would:
perform cancel-based recovery up until and including archive log sequence #2100. shutdown the database and do a backup (our idea). startup mount the database. alter the datafiles associated with the index tablespace off-line. initiate complete recovery, which only recovers on-line datafiles: "recover database" open the database. drop the index tablespace (drop tablespace including contents). recreate the index tablespace. recreate the indexes in that tablespace.
The cancel based recovery succeeded, so we shutdown the database and did a backup. We then started the database (startup mount), and altered the datafiles associated with the index tablespace off-line. Next we attempted complete database recovery, which was possible since we had not reset the redo logs after performing the incomplete database recovery:
"recover database"
Since complete recovery only recovers datafiles that are on-line, archive log sequence #2101 was applied successfully, and the complete recovery finished successfully at 10:11 AM on the 16th.
At this point US support instructed us to open the database. This resulted in the following:
ORACLE instance FINP (pid = 15) Error 376 encountered while recovering transaction (16,131) on object 4173.
We shutdown the database, and US support instructed us to add "_offline_" to the "rollback_segments" parameter in the init.ora. We attempted to open the database again and received several messages regarding the datafiles of a tablespace being off-line, and SMON messages regarding the rollback segments. The database was open however, and we proceeded to drop the index tablespace in question:
"drop tablespace finlindex including contents"
We shut the database down and removed "_offline_" from the "rollback_segments" parameter from the init.ora. The database was then started and opened successfully. We again shut the database down, and removed all the undocumented init.ora parameters as well as changing MTS_SERVERS back to 70. This next startup would be a "normal" startup.
The database was then successfully started and opened, and shutdown.
After the backup we started the database and recreated the index tablespace that we had dropped. We then recreated all of the indexes in that tablespace. The SQL to recreate the indexes was generated by querying our pre-production database which, database object wise, is an exact replica of our production database. The index recreation took approximately 6 hours, as some of the indexes in question were on tables of greater than 10 million rows.
The database was backed up again as part of our regular Friday night backups, and as of Saturday the 17th at 6:00 PM the application was deemed "OK" by the application owner.
The entire recovery effort took about 55 hours, with the first 31 hours being non-stop, around-the-clock.
We have had a few system crashes.
Since our faith in Digital Unix and Oracle has been severely shaken, we now do the following after a system crash:
EXP-00014: error on row 102232 of table TBRMISD EXP-00222: System error message 2 EXP-00008: ORACLE error 1578 encountered ORA-01578: ORACLE data block corrupted (file # 46, block # 7818) ORA-01110: data file 46: '/u08/ORACLE/FINP/data/datasat03FINP.dbf'
Since we now know enough to take these precautions (exporting the entire database) we can specifically identify corrupted objects and perform recovery on the individual datafiles in which the corrupted objects reside. While the full export takes approximately 2 hours, our last database recovery took less than 1/2 hour. We didn't even bother to call Oracle support on this one.
We opened a TAR with Oracle in an attempt to determine if Oracle did synchronous writes, read-after-write, or used other techniques to avoid this type of corruption and to gather any other info that would help us in determining what was happening. Oracle's response was to upgrade the database to a more current release.
Kernel 'short read'
Symptoms:
Environments where problem occurred:P00>>>b [...] OSF boot - Thu Mar 31 01:10:53 EST 1994 Loading vmunix ... Current PAL Revision <0x10000000010111> Switching to OSF PALcode Succeeded New PAL Revision <0x10000000020115> Loading into KSEG Address Space Sizes: text = 3261360 short read (text) <-- Can also get 'short read (data)' halted CPU 0 halt code = 5 HALT instruction executed PC = 2001003c boot failure
Digital Unix v3.0, v3.2c on 7620; boot disk AdvFS RZ28B raidset behind HSZ40 Digital Unix v3.2d-1 on 8400; boot disk AdvFS RZ29B raidset behind HSZ40Likely cause: Kernel (/vmunix) fragmentation
Detection:
# /sbin/showfile /*vmunix*
Id Vol PgSz Pages XtntType Segs SegSz Log Perf File
ed4.8001 1 16 920 simple ** ** off 100% genvmunix
21c.8017 1 16 1011 simple ** ** off 33% vmunix
11ce.8001 1 16 1017 simple ** ** off 100% vmunix.Pre:CSC0626
Recovery:Hint: always grow root partition beyond default 64mb to retain multiple kernels.
The 2100 affectionately known as Lemon (a.k.a., 'spike')
Chronology:
Between 9/95 and 2/96 there were a great number of unusual problems with Spike which caused far more confusion than actual outages. Phantom devices would show up with a hardware initialization or boot, devices would disappear, graphics console would randomly be unusable after booting to level 3, etc. Many of these problems can be blamed on a bad backplane which was replaced in February. Prior to the backplane replacement, Spike started getting infinite loops during hardware init:08/95 2100-5/250 received, named 'spike', 2*cpu 2*512mb 09/95 second disk installed, was D.O.A.; inserting that disk also fried an existing boot disk 10/95 KZPSA failed and replaced 01/96 KZPSA failed and replaced 01/96 VGA (compact qvision) adapter replaced 02/96 KZPSA failed and replaced 02/96 memory module failed and replaced 02/96 backplane replaced (believed cause of other problems) 08/96 KZPSA failed and replaced 08/96 IO module replaced (likely bad) 08/96 backplane replaced (precautionary)
One of these occurred immediately after the backplane was replaced, they then ceased. Digital identified a known firmware problem in the initial KZPSA firmware from the IO module during initialization as the cause.P00>>>starting console on CPU 0 breakpoint at PC 128480 desired, XDELTA not loaded interrupt through vector 660 on CPU 0 CPU detected C-bus error *** unexpected interrupt through vector 00000066 [...]
In addition, a bad backplane could easily have contributed to the high failure rate (many fried KZPSA's). The IO module may also have been marginal in February, a problem we saw in August (looping C-bus errors) was identical to the problem we saw in February. However, in February the problem was transient. The problem was hard in August and the system never completed initialization with repeated resets.
Spike, also known as lemon, is now a 2100A with 3*5/300 cpu's and 4*512mb modules. Our other 2100-4/200 has been stable and solid since installation in June 1994. We have not had fried KZPSA's on our other systems.
So, you think you have battery backup?
HSZ40 Batteries and write-back cache
HSZ40 controllers with write-back cache have battery backup to protect in-flight data from loss during a power failure. Our first and only power failure on 18 September 1995 resulted in an unbootable system. Why? One raidset had unflushed cache, the battery was dead. Resolution was to restore the raidset and the other two raidsets containing Oracle data for that instance and roll-forward. Digital's response on this was there was a known bad batch of HSZ batteries. They suggested for avoidance we consider running HSZ's in dual redundant pairs.This was HSZ40 v2.0, a subsequent release (v2.5 or v2.7) will disable disk access when battery low conditions are detected. This will prevent possible data loss on power failure.
HSZ40 Dual Redundancy
On 29 May 1996 our 8400 started generating massive AdvFS errors to the console until the system finally crashed. What happened? One of the HSZ's of a dual redundant pair detected a bad battery and disabled disk access to avoid corruption. What didn't happen? The HSZ did not "fail" so it's redundant partner could assume the load... so the system panic'd.On 26 August 1996 our 2100 'lemon' system had a similar problem which also lead to a crash. The condition went undetected because that particular member of the HSZ dual-redundant pair was not connected to console manager so the HSZ message was missed. Subsequent problem with a bad IO module for the 2100 prevented system booting.
Rumor has it, a new version of HSZ40 firmware (v3.0) will cause dual-redundant failover under a bad battery condition.
We presently run seven HSZ40's, three dual redundant pairs and one single HSZ, on three different systems. While the configuration options and performance of the HSZ40 controllers have been very good, the three separate battery problems are a concern. Rumor has it that batteries manufactured after 12/95 are "good", prior to that it appears the batteries had a functional life of between 12 to 18 months before failure. Also, Digital is now actively monitoring our batteries (we think).
Other Places with Batteries?
Prestoserve IO caching modules, for one. We sent ours back to Digital as not offering much, typically, under AdvFS and being another possible weak point during a system problem. Plus, we have HSZ cache a little further down the bus. Note prestoserve is not supported for DECsafe ASE at all... ASE was once in our future but was abandoned as more risk than benefit.
AdvFS panics Prevention (patches in general)
If you ran AdvFS under Digital Unix v3.0 you likely experienced a large number of AdvFS panics. Digital said, "Everything will be wonderful under v3.2b (or v3.2c)". Well, things were better under v3.2c and better still under v3.2d-1. However, we still had AdvFS panics. Each time we called one in we were told it was a known problem and a patch was provided. The third time this occurred we finally asked the obvious question:
A support center manager told us we should acquire patches.tar.Z on a regular basis and apply the AdvFS consolidated patch. Since we started aggressively applying preventative patches last May, we have had no software induced panics. We will routinely apply the AdvFS consolidated patch, any patch identified as "potential security vulnerability", and any patch relating to a potential panic which indicates a configuration similar to ours. Have you ever heard of this service? Do you know where to find this file? Do you know how to deconsolidate the file and apply patches selectively?How the devil do we get these patches on a pro-active basis
before the system panics and our users get hostile?
To the best of our knowledge, this is not an official Digital service and is subject to change (presumably enhancement) without notice. You can find tools to automatically pull and deconsolidate the file under anonymous ftp:
These tools have been used for v3.2c, v3.2d-1, v3.2g, v4.0, and v4.0a.ftp.alaska.edu:/pub/sois/UA_DUtools.README ftp.alaska.edu:/pub/sois/UA_DUtools.tar.Z
Plus, starting 30 September 1996 the consolidated README for each osfv* version started appearing in the same directory as patches.tar.Z so you can grab just the README to see if you need or want to grab the whole patch kit. It is very useful to have this file on-site to grep for strings of problems you may encounter.
As of mid-October (through the 02 October v3.2g consolidated file),
we had the following patches applied to v3.2g:
You will notice we have a means to track what patches were applied (including when and by whom).
We suspect Digital may eventually provide this type of capability using setld for patch application.
Under v3.2d-1 we had the following patches applied:
sxkac@nugget> PATCHES
+++ OSF375-047 install: sxkac ran ./OSF375-047/Install on Sat Sep 21 10:10:35 1996
+++ OSF375-049 install: sxkac ran ./OSF375-049/Install on Sat Sep 21 10:11:28 1996
+++ OSF375-350222 install: sxkac ran ./OSF375-350222/Install on Sat Sep 21 10:12:03 1996
+++ OSF375-350245 install: sxkac ran ./OSF375-350245/Install on Sat Sep 21 10:12:29 1996
+++ C960424-5207v32g install: sxkac ran ./C960424-5207v32g/Install on Sat Sep 21 10:13:50
+++ OSF375-050 install: sxrmh1 ran ./OSF375-050/Install on Wed Oct 16 14:44:15 1996
+++ OSF375-052 install: sxrmh1 ran ./OSF375-052/Install on Wed Oct 16 14:44:40 1996
+++ OSF375-055 install: sxrmh1 ran ./OSF375-055UA/Install on Wed Oct 16 14:45:02 1996
+++ OSF375-350269 install: sxrmh1 ran ./OSF375-350269/Install on Wed Oct 16 14:45:24 1996
+++ OSF375-370034 install: sxrmh1 ran ./OSF375-370034/Install on Wed Oct 16 14:45:35 1996
Four were custom patches (three were variations of cam_tape.o).
Of the others, all but two were incorporated into v3.2g.
A patch *is* a change.
There is some risk a patch will break something as well as fix something.
By appearances, with only two patches not being incorporated in v3.2g,
the success rate (quality) of patches making it to the consolidated file is very good even if the advertisement,
distribution, and application tools for it are poor.
Digital also does custom patches for specific customer problems,
these do not make the consolidated file (likely because they get a less thorough quality review).
We have yet to find a means to tap into the knowledge which generates those patches
to be able to recognize if we have any similar problems.
C960215-4398 C960424-4096 C960424-5207 C960424-5207_2
OSF360-012 OSF360-020 OSF360-023 OSF360-025 OSF360-027 OSF360-030
OSF360-033 OSF360-039 OSF360-041 OSF360-043 OSF360-044 OSF360-046
OSF360-350056 OSF360-350061 OSF360-350079 OSF360-350084
OSF360-350090 OSF360-350096 OSF360-350102 OSF360-350142
OSF360-350145 OSF360-350152 OSF360-350154 OSF360-350164
OSF360-350177 OSF360-350178 OSF360-350182 OSF360-350183
OSF360-350184 OSF360-350186 OSF360-350188 OSF360-350197
OSF360-350200 OSF360-350201 OSF360-350203 OSF360-350205
OSF360-350206 OSF360-350211 OSF360-350218 OSF360-350222
OSF360-350223 OSF360-350234 OSF360-350245 OSF360X-350015
For a given release we see a new consolidated file showing up every couple weeks:
Typically there are 3 to 12 new or superseded patches in each new release.
We review the new patches and schedule application across our systems for those which are applicable.
The review is very quick once you are current.
Subject: atlanta_ftp osfv32g: Oct3osfv32g /PATCH/patches/atlanta/osfv32g_961010
PATCH ID: OSF375-052 SUBSET(s): OSFADVFSBIN375 OSFADVFS350 OSFADVFSBIN350
sdiff k osfv32g_961010 (Oct3osfv32g) osfv32g_961003 (Sep30osfv32g):
PATCH ID: OSF375X-350023 SUBSET(s): OSFX11350 <
PATCH ID: OSF375-052 SUBSET(s): OSFADVFSBIN375 OSFADVFS350 | PATCH ID: OSF375-047
PATCH ID: OSF375-050 SUBSET(s): OSFBIN375 OSFBIN350 OSFBIN | PATCH ID: OSF375-049
PATCH ID: OSF375-053 SUBSET(s): OSFBIN375 <
PATCH ID: OSF375-370033 SUBSET(s): OSFHWBIN350 OSFHWBIN375 <
PATCH ID: OSF375-370034 SUBSET(s): OSFHWBIN375 <
PATCH ID: OSF375-350269 SUBSET(s): OSFBIN350 <
AdvFS panics Recovery
Well, unfortunately we once lost /usr, which happens to include /var->/usr/var and /nsr->/var/nsr, in one of those AdvFS panics. Recovery from this means rebuilding the root disk and recovering NSR. A checklist of this process is also under UA_DUtools.tar.Z. On the positive side, it was reassuring to discover our backup strategies and recovery skills were indeed adequate.
In addition, for corruption where domains#filesets are still mountable there are the relatively well known, albeit unsupported and undocumented, AdvFS utilities under /usr/field:
/usr/field/msfsckThis is the AdvFS bitfile-subsystem metadata structure checker. It verifies low-level meta-structures like the BMT, storage bitmap, and tag directories.
The file domain must be inactive to run msfsck. You also need at least one mounted fileset (this is because msfsck uses the .tags directory in the fileset to access the metadata).
To run msfsck, first 'cd' to the mount point of a mounted fileset./usr/field/vchkdir
Then, run "/usr/field/msfsck -t". This is the AdvFS directory structure checker and fixer. It verifies that the directory structure is correct and that all directory entries reference a valid file (tag) and that all files (tags) have a directory entry.
The -f flag will create symlinks in "<mount-point>/lost+found/" to all files (tags) that do not contain a directory entry; these are called lost files. The -f flag also remove 'dead' directory entries (ones that do not point to valid tags).
The -d option will delete lost files and it will delete corrupted directories.
Note, you may need to run vchkdir several times to cleanup a fileset.
The file domain must be inactive to run vchkdir./usr/field/tag2name
The fileset to be checked/fixed must be mounted.To run vchkdir do "/usr/field/vchkdir <mount-point>".
This program will display the full pathname of a file when only the file's tag (inode) number is known. This is mainly a debugging aid when msfsck or vchkdir report errors for specific tags.
To run do "/usr/field/tag2name <mount-point>/.tags/<tag-number>".
It's only firmware, right? (KZPSA A09 and bus resets)
Due to late shipment of our 7620 to 8400 upgrade (that never happens to anybody else, right?), we missed a Christmas-New Years installation window and deferred until early February following spring semester student registration. Because of the complexity of installing the consolidated patch kit (we had not, as yet, written tools to handle it), we decided to go from Unix v3.2c to v3.2d-1 in the same time frame for fixes relevant for 8400's. Included with the v3.2d-1 upgrade was firmware CD v3.4.
This required three changes (hardware, operating system, and firmware) in a relatively tight time frame. We have the advantage of a 2100-4/200 (nugget, not lemon) used for testing and development. This system is generally the first to get any change, typically several weeks in advance for software changes, in this case it was same week. The firmware upgrade on the 2100-4/200 KZPSA's (A08 to A09) failed with:
open failure firmware filename 'KZPSA_fw'We opened a problem with Digital, they had no ideas, but we stayed with v3.2d-1 and continued with plans for the 8400. We did not worry about the firmware upgrade failure. After all, it's only firmware.
The upgrade to the 8400 and 3.2d-1/fw3.4 on both the 8400 and 2100-5/250 went smoothly the first weekend of February. However, several days later both systems started having problems where they would periodically lose access to their disks behind their HSZ40's. Digital support concurred with us that the likely problem was the KZPSA A09 firmware (the common element and indicated by bus resets showing with uerf), but offered no resolution other than trying to escalate getting us a copy of A10. Understandably, having one bad firmware release with no release notes did not make us anxious to install another. We unilaterally backed them to A08, which wasn't recommended, and restored stability in the interim until we had both the CD required for the 8400 and the opportunity to review potential impact of A10.
It is not clear that our 2100-4/200 would have seen the problem in advance had it been upgraded to KZPSA A09, but the moral is to manage changes better (less at once) as well as limiting "trust" for anything without detailed release notes.
The 2100-4/200 failed to upgrade the KZPSA firmware because we placed the CD in a bus-1 CD-drive instead of RZ6 internal to the 2100... this was an unknown limitation (e.g., bug) or under-documented feature in fw3.4 kits for the 4/200.
It's only firmware, right? (Disk firmware and bus timeouts)
On the consolidated software CD is a firmware directory containing only RZ28B and RZ25L firmware. Specifically, RZ28B 003 to 006 upgrade is required for ASE configurations, but it was not otherwise recommended (in fact, almost discouraged).
In panics (simple_lock timeouts) on 15 February, 24 April, and 28 April our 8400 hung, permanently, in 'syncing disks...'. The hang was so severe that the ctrl/p console interrupt was ineffective and the system had to be restarted. We had no further hangs during 'syncing disks...' (we did have more panics) after:
Root domain moved to RZ28M's from RZ28B's (003) on less used HSZ40 and bus; Disabled the prestoserve module; Second (of three) cam_tape.o patches was applied; OSF360-018 consolidated AdvFS patches were applied; System activity was lower at times of non-corruption panics.
We do not know specifically what fixed the 'syncing disks...' hangs, but informal word from Digital engineering was there were some known problems with bus timeouts with the RZ28B 003 firmware which could have been the culprit.
During the second of three HSZ40 battery problems we noticed some unusual HSZ behaviours which ultimately lead to the suspect HSZ40 being replaced. However, Digital engineers also made recommendations that the following disk firmware upgrades be done:
RZ28 D41C|441c to 442D RZ28M 0466 to 0568
It was noted that these were not "mandatory" updates, but we have really started believing in pro-active problem avoidance. Since this already involved a significant percentage of our disk farm, we looked for current firmware for all of our disks and did the following:
RZ28B 0003 to 0006 7 disks RZ25L 0006 to 0008 2 disks RZ26 T386 to T392A 2 disks RZ26L 440C to 442D 2 disks RZ28 D41C to 442D 29 disks RZ28 441C to 442D 3 disks RZ28M 0466 to 0616 8 disks RZ29B 0014 to 0016 50 disks RZ28M 0568 to 0616 12 disks
This amounted to 115 of our 117 disks. It was a 3 month process to cycle through all of them. Pay attention to the instructions to backup your disks before upgrading, aggregate we fried 3 disks doing upgrades. The most reliable means to update was via HSUTIL available in v2.7 HSZ40 firmware. Documented instructions for RZ26L's were inaccurate at least on 3000-300LX systems... one must wait for 5 minutes after SCU download indicates it is complete (it really isn't), or you will fry the disk. We found it was impossible to upgrade RZ26 T386's either via SCU or HSUTIL and ultimately gave up on those disks. Further disk firmware update information and examples is in UA_DUtools.tar.Z.
'simple_lock: time limit exceeded' panics
As an introduction, we have seen at least three different flavors of simple_lock panics. However, a very common one (at least 8 sites) is triggered by volumes of soft tape errors on TZ877 tape drives.
Symptoms:
simple_lock: time limit exceeded pc of caller: 0xfffffc00004c172c lock address: 0xfffffc007f8fc7d0 current lock state: 0x00000000004c1605 (cpu=0,pc=... panic (cpu 0): simple_lock: time limit exceeded syncing disks... [...] root@glacier> dbx -k /vmunix # use pc from panic dbx version 3.11.8 (dbx) 0xfffffc00004c172c/i [ctape_close:2674, 0xfffffc00004c172c] bsr r26, simple_lock(line262) (dbx) quit
If it is in ctape_close and uerf shows a volume of soft errors, you have the problem. There are patches to cam_tape.o for v3.2c, v3.2d-1, and v3.2g available through CSC, they are not yet in the consolidated patch file. The problem likely exists in both v4.0 and v4.0a in some form.
In our case, a single TZ877 was generating hundreds of errors each night during the cloning run to copy NSR savesets. The patch (third revision) was effective for stopping the panics, but ultimately the TZ877 also had to be replaced to stop the soft errors. Several other sites had similar experiences.
Recommendation:
Monitor hardware errors and do pro-active reviews before problems escalate.
This was clearly a software problem, but it was hardware triggered.
We send automated nightly messages summarizing uerf reports, these take the form:
The programs and scripts sending these summaries are in UA_DUtools.tar.Z.
#glacier Tue Oct 8 1996
>11:13:23 3 100 CPU EXCEPTION
#glacier Sat Oct 12 1996
>22:26:56 4 199 Bus:06 lu:49.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
#glacier Sun Oct 13 1996
>00:12:42 5 199 Bus:06 lu:49.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+
Summary:
Total 1 100 CPU EXCEPTION
Total 1 199 Bus:06 lu:49.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
Total 1 199 Bus:06 lu:49.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+
Presto-chango: AdvFS and data corruption
Our first simple_lock panic in ctape_close on 15 February 1996 started a cascade of other problems. It occured at 3:38am. The system hung syncing disks, was manually restarted, panic'd again shortly after 8am then auto rebooted. The Oracle databases restarted fine. At 11am the primary Oracle database crashed and would NOT restart, this was described earlier. By appearances, an index was corrupted and the active archive log had data corruption. Subsequent examination showed several other non-Oracle data file corruption and some AdvFS structure corruption. A patch to cam_tape.o was applied to remedy the cause of the panic, it did not do anything for the resulting corruption.
Our second simple_lock panic in ctape_close occurred on 24 April 1996 and lead to two subsequent panics of different types:
\%\%\% Wed Apr 24 05:00:00 AKDT 1996 \%\%\%
(1) simple_lock: time limit exceeded
pc of caller: 0xfffffc00004c172c
lock address: 0xfffffc007f8fc7d0
current lock state: 0x00000000004c1605 (cpu=0,pc=...
panic (cpu 0): simple_lock: time limit exceeded
syncing disks...
------------------------------------------------------------------------
[...system rebooting, processing rc3.d...]
-> /sbin/rc3.d/S90schedule start
[...]
***
*** POLYCENTER Scheduler has STARTED all requested processes
*** Startup script procedure exiting with success.
***
SCHED-I-PARTOPENREQ, Partition (ROOT_GLACIER) was requested to Open
(2) trap: invalid memory read access from kernel mode
faulting virtual address: 0x0000000000000015
pc of faulting instruction: 0xfffffc00003f4018
ra contents at time of fault: 0xfffffc00003f3ac4
sp contents at time of fault: 0xffffffffb8342ad0
panic (cpu 0): kernel memory fault
syncing disks... done
------------------------------------------------------------------------
\%\%\% Wed Apr 24 10:00:00 AKDT 1996 \%\%\%
(3) # ADVFS EXCEPTION
Module = 2, Line = 1356
panic (cpu 0):
syncing disks... done
Remembering the previous problem from 15 February, we held the databases down and
with a full export we were able to detect table corruption.
It was fairly obvious that the first panic (simple_lock) either directly or indirectly caused a large degree of corruption which lead to the subsequent panics. The corruption included:
1.1 /etc/auth/system/ttys
This file was corrupted causing telnet logins to fail until recovered.
Note, CSC said this file often gets corrupted.
1.2 /var/sched/tmp/sched_22-Mar-96_05:32:13.exp
This directory was corrupted, access ('ls -l') caused panic type (2).
1.3 /usr/bin/test
/usr/bin/tsort
/var/sched/help/
These showed up in the msfsck and 'vchkdir /var' crashed the system.
Domain containing /var and /usr required restoration.
1.4 oracle (/u??/ORACLE/FINP/data/?) FBGENL and FGBTRND
There were corrupted rows in both tables which showed up
with an attempted full export of the Oracle database.
No corruption was apparent at the file system level.
Required action was to recover and roll-forward Oracle database.
1.5 /users/sx/sxcnb99/ipl/now/SYS1.PROCLIB_RDR400
/users/sx/sxcnb99/ipl/now/SYS3.PROCLIB_CAS
Both showed up in find as 'bad status' apparently mapped to
the directory but not really there.
1.6 /oraalogs/ORACLE/FINP/arch/*41*.dbf (or possibly 40)
An attempt to delete this file caused panic (3). On reboot,
no filesets in that domain could be mounted (restore required).
The initial panic (1) hung during 'syncing disks...'. Ctrl/P from the 8400 console was ineffective, a "restart" was required. Likewise, the panic on 15 February hung during 'syncing disks...'. Logically, this is a serious concern as syncing the disks is going to flush any pending writes to disk. We have yet to receive response from Oracle whether they employ synchronous writes, read-after-write, or other techniques to know if this is an exposure for Oracle. Detailed analysis from Digital showed that AdvFS does have an exposure during panic syncing to cause corruption, one patch was produced for this but was ineffective for SMP environments, a second patch is now out.
A third panic of this nature occurred on 28 April and the system again hung during the 'syncing disks'. However, this panic did not result in any corruption. By that point we had identified the one TZ877 used for cloning as triggering the problem, so we shut down Oracle during the cloning process. Also, we had preserved the old /usr and /var RZ28B fw 003 disks corrupted on 24 April for Digital to analyze, the root raidset was restored onto RZ28M disks. Finally, we had disabled the prestoserve module as being suspect and not very useful in our configuration. It is not known which of the above changes prevented corruption on 28 April, maybe we were just lucky for a change.
Other Oracle Corruption (not from cam_tape induced simple_lock panics)
On 21 September 1996 a bad scsi bus termination resulted in the 8400 generating lots of AdvFS exceptions and halting (it did not even panic):
msfs_unmount failed for /u05 with error 16 during shutdown msfs_unmount failed for /u02 with error 16 during shutdown msfs_unmount failed for /users/sx with error 16 during shutdown msfs_unmount failed for /users with error 16 during shutdown msfs_unmount failed for /tmp with error 16 during shutdown msfs_unmount failed for /var/adm with error 16 during shutdown msfs_unmount failed for /var with error 16 during shutdown msfs_unmount failed for /usr with error 16 during shutdown syncing disks... 26 20 7 done CPU 0: Halting... (transferring to monitor) halted CPU 0 CPU 1 is not halted halt code = 5 HALT instruction executed PC = fffffc0000481950 P00>>>
The cause is known (bad bus termination), the results are still under investigation. Paranoia (lack of disks successfully syncing) caused us to do a full export of the Oracle databases and a table was found to be corrupted. Recover and roll forward of the affected table was effective, unfortunately the export (even to /dev/null) still takes over 2 hours as a diagnostic aid.
Recommendations
Other sites, particularly 2100's, have reported problems via alpha-osf-managers list with hangs during disk sync's. Several other sites have had data corruption after panics, specifically one lost NSR indices with a cam_tape simple_lock and another lost ufs based Oracle files after a cdisk_strategy simple_lock panic. The primary exposure appears to be writes not making it to disk. This exposure may be higher for "database" type applications such as:
Oracle, NSR, POLYCENTER Scheduler (uses Informix), and AdvFSwhere a 'lost write' may not be at end-of-file or may be improperly synchronized causing irrecoverable corruption.
We still have our 8400 set 'auto_action boot', but we have added checks within rc3 to detect whether the system was shutdown normally and do not proceed with the site defined 'init 4', which starts Oracle databases, if the system panic'd and there is any indication the disks did not sync. This has saved us twice already. Examples of rc3 and 'init 4' customization is also included in UA_DUtools.tar.Z. In addition, use of the AdvFS /usr/field utilities and|or walking the entire filesystem (via 'ls / -lR >x' or something like 'find / -atime 1 -ls >x' will identify file system corruption.
Also, the patch produced as a result of the 24 April event (OSF375-055) is now applied on our systems and we believe it should be effective in at least partially avoiding possible future corruption. In general, looking for and applying preventative patches to avoid panics and thereby reducing risk is "A Good Idea".
CPU EXCEPTION's and error log analysis uerf and dia
CPU EXCEPTION does not mean just the CPU board(s),
more often than not it is an error logged against memory or the cache resident on the CPU board.
With uerf you will see something like:
'uerf -Z' will do a raw record dump,
DECevent provides the knowledge to interpret it.
DECevent shows quite alot more (albeit it produces pages of output to scan):
sxkac@glacier> uerf -c err,oper -o full -t s:`ua_date -uerf -7` | more
[...]
EVENT CLASS ERROR EVENT
OS EVENT TYPE 100. CPU EXCEPTION
SEQUENCE NUMBER 3.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Tue Oct 8 11:13:23 1996
OCCURRED ON SYSTEM glacier
SYSTEM ID x0005000C CPU TYPE: DEC 7000
SYSTYPE x00000000
PROCESSOR COUNT 2.
PROCESSOR WHO LOGGED x00000000
----- UNIT INFORMATION -----
UNIT CLASS CPU
The previous is obviously a correctable error which can be ignored unless they start occurring regularly or frequently.
However, DECevent also lies:
sxkac@glacier> dia -i cpu -o full -t s:`ua_date -uerf -7` | more
[...]
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 3.
Timestamp of occurrence 08-OCT-1996 11:13:23
Host name glacier
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority <==
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
[...]
TLBER x00440000 CORRECTABLE READ DATA ERROR <==
DATA SYNDROME 2
[...]
The above is a correctable memory error which is erroneously reported as severe.
The only way to know is to look closely and ask Digital.
This (x040001 on a 2100) is a single bit correctable memory error which
can be ignored unless it happens in volume or often.
sxkac@spike> dia -i cpu -o full -t s:`ua_date -uerf -20` | more
[...]
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 2.
Timestamp of occurrence 26-SEP-1996 17:38:34
Host name spike
System type register x00000009 AlphaServer 2x00
Number of CPUs (mpnum) x00000002
CPU logging event (mperr) x00000001
Event validity 1. O/S claims event is valid
Event severity 1. Severe Priority <==
Entry type 100. CPU Machine Check Errors
CPU Minor class 2. 660 Entry
-- ENTRY FRAME FOLLOWS --
Frame ID x00000022 Machine Check Frame
[...]
Machine Check Error Code x00000202 CPU Detected Unrecoverable Error
[...]
-- ENTRY FRAME FOLLOWS --
Frame ID x00000008 Memory Frame
Memory Module ID x00000003
Error Register 1 x0000000000040001 <==
[Even] Error Summary
[Even] EDC Corr Error <==
[...]
As stated previously, the primary value is automatically filtering the
error log to something quickly scanable for anomalies, such as:
Subject: spike ua_uerf from 25-sep-1996,00:00:00
#spike Thu Sep 26 1996
>17:38:34 2 100 CPU EXCEPTION
>17:38:34 3 100 CPU EXCEPTION
#spike Tue Oct 1 1996
>18:15:05 4 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 7918698
#spike Sat Oct 5 1996
#spike Sat Oct 5 1996 07:48:54 26 301 SHUTDOWN|halted by root:#!> 2100A and CPU upgrade
#spike Sat Oct 5 1996 12:21:17 0 300 STARTUP
Summary:
Total 2 100 CPU EXCEPTION
Total 1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block number: 7918698
Total 1 301 SYSTEM SHUTDOWN
Total 1 300 SYSTEM STARTUP
26-SEP-1996 17:38:34 spike 1. Severe Priority Memory Module ID x03 Error Reg 1 x0000000000040001
26-SEP-1996 17:38:34 spike 1. Severe Priority Memory Module ID x03 Error Reg 1 x0000000000040001
cdisk_bbr: BBR disabled bad block number:
Device Locator x000401 Port = 1.
Target = 4.
LUN = 0.
Some sites will pump error log entries into an immediate filter for reporting.
At a minimum daily summaries are "A Good Idea".
In addition, monthly summaries can also provide good management information:
#spike Sat Sep 21 1996
#spike Sat Sep 21 1996 11:46:07 48 301 SHUTDOWN|halted by root:Starting fw 3.7 du 3.2g upgrade
#spike Sat Sep 21 1996 13:09:26 0 300 STARTUP
#spike Sat Sep 21 1996 13:32:47 1 301 SHUTDOWN|halted by root:Final boot DU 3.2g fw 3.7 clc3.1
#spike Sat Sep 21 1996 13:58:02 0 300 STARTUP
#spike Sat Oct 5 1996
#spike Sat Oct 5 1996 07:48:54 26 301 SHUTDOWN|halted by root:#!> 2100A and CPU upgrade
#spike Sat Oct 5 1996 12:21:17 0 300 STARTUP
Summary:
Total 42 199 Bus:01 lu: 8.0 R=ctape_move_tape:Hard Error Detected:DEC TZ877:+
Total 10 100 CPU EXCEPTION
Total 3 199 Bus:01 lu: 9.1 R=changer_check_status:::Recovered error
Total 2 199 Bus:02 lu:17.0 R=cdisk_check_sense:::Event - Unit Attention
Total 3 301 SYSTEM SHUTDOWN
Total 3 300 SYSTEM STARTUP
Total 1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 8322460
Total 1 199 Bus:02 lu:20.3 R=cdisk_bbr_done:::cdisk_bbr: BBR disabled bad block: 7918698
Total 34 199 Bus:01 lu: 9.0 R=ctape_iodone:Soft Error Detected (rec:DEC TZ877:+
Total 2 199 Bus:01 lu: 9.0 R=ctape_wfm:Soft Error Detected (rec:DEC TZ877:+
Sample scripts and programs supporting this are in UA_DUtools.tar.Z.
'doconfig' doesn't: The phantom XMI bus
After upgrading from a 7620 to 8400, we had to retain the old XMI bus cage for several months. Finally, it was removed. However, the first time we attempted a 'doconfig' without '-c HOST' we got:
*** PERFORMING KERNEL BUILD ***
A log file listing special device files is located in /dev/MAKEDEV.log
Working....Mon Sep 9 16:12:49 AKDT 1996
*** WARNING ***
An error has occurred during system configuration. A partial listing
of the error log file (./errs) follows:
env - COMP_HOST_ROOT=/ COMP_TARGET_ROOT=/ /bin/cc -std0 -EL -I -I. -I..
[...]
The XMI still existed in the /sys/conf/HOST file, but the kernel could not be rebuilt.
The workaround was:
# doconfig -c BOGUS [...] # mv /sys/GLACIER /sys/GLACIER.old # mv /sys/BOGUS /sys/GLACIER # mv /sys/conf/GLACIER /sys/conf/GLACIER.old # mv /sys/conf/BOGUS /sys/conf/GLACIER # ed /sys/conf/GLACIER s/BOGUS/GLACIER/ # mv /sys/conf/GLACIER.list /sys/conf/GLACIER.list.old # mv /sys/conf/BOGUS.list /sys/conf/GLACIER.list # shutdown -r
'doconfig' doesn't: Missing tapes
We rarely do a 'doconfig' without '-c HOST'. Why? 'doconfig' likes to number mt* sequentially while we prefer to name tapes logically:
mt0,mt1,mt2,mt3 are TZ877's mt4* are 4mm mt8* are 3480's mt9 is (fortunately only one) 9-track reelThe only reason for a simple 'doconfig' is a significant bus architecture change. This also saves one from repetitive raising of maxusers, most other kernel parameters are best set in /etc/sysconfigtab.
RAID-5: is it any good?
Failure detection (and configuration management)
Is RAID-5 behind an HSZ40 any good? Almost too good, the first disk failure we had went undetected for three days. The SW800 disk cabinet is in a frequently unoccupied room. One day somebody looked in the cabinet and saw an amber light on a disk. If you implement an HSZ40 controller, you must also implement a means to monitor its console. It is another computer which serves disks. The hszterm utility (subset SWACLI11A) provides a means to do this, for example:
Placing variations of the above commands in a scheduled job and simple scripts to grep key lines and sdiff, i.e.:sxkac@nugget> sudo hszterm -f /dev/rrz17c <<!HEOF >hsz_3n.961015 > show this full > show unit full > show raidset full > show device full > !HEOF sxkac@nugget> sudo hszterm -f /dev/rrz17c "run fmu" "show last most" >hsz_3n.961015.err
can alert you to changes or problems.sdiff $TODAY $YESTERDAY | grep -E ' < | > | \| ' | mailx [...]
However, not all HSZ errors are logged via FMU (many are not) and very few HSZ errors actually get logged to the host errlog. We connect HSZ consoles, along with all host system serial consoles, to POLYCENTER Console Manager so we have a means to play back and review any events which occured. This has proved invaluable a number of times in diagnosing and reconstructing problems.
Performance (Raidsets on HSZ40)
Read performance has been fine. Aside from the recurrent bad battery problems, write performance on raidsets using write-back cache has been adequate most of the time. However, we have noticed several times where Oracle has bogged down. We traced one to very high (unusually so) write activity against TEMP table space and we traced another to the db writer being bottlenecked. What exactly triggered these is still under investigation. Our suspicion is there is some activity occuring which should not be occuring during prime hours, tracking it is a challenge.
Watching performance of HSZ disks is also a challenge, iostat is effectively useless,
we wrote a modified verion that interprets the device names:
Alan Rollow's monitor utility is also an excellent means to monitor disk IO on-line.
sxkac@glacier> uak_iosts 5 3
tty rz25 rz26 rz27 rz58 rz59 rz60 rzb57 rzc57 cpu
tin tout bps tps bps tps bps tps bps tps bps tps bps tps bps tps bps tps us ni sy id
9 1106 46 3 9 0 106 4 26 0 13 0 56 3 18 0 97 4 16 0 10 74
23 3927 13 2 19 2 593 68 40 1 61 8 26 3 60 0 16 24
20 3387 40 3 408 40 38 1 77 4 46 4 74 0 12 14
Software destruction (etc.)
In the last 18 months we have fully restored large Oracle databases five times, restored root disks three times, and done various other partial restores (AdvFS domains, individual Oracle datafiles, index rebuilds, miscellaneous files). Events causing these have been AdvFS panics, other panics, apparent Oracle problems, bad HSZ batteries, and other hardware problems. RAID-5 has protected us from loss due to a disk failure (we have had three), but is ineffective for faulty software or ineffective hardware recovery which are by all appearances much greater risks.
Run-away processes (oh no, what happened to my CPU?)
Several sites, including us, have at times had problems with run-away processes consuming the majority of CPU resources. Often they have been orphaned processes which loop after a hangup. The most common one for us was a menu application coded "improperly" to not check for errors. It would issue an fgets() and not check for error, then re-issue the fgets() repetitively looking for an 'exit' or other valid command. Since the terminal (telnet connection) was disconnected, the loop was endless, brutally efficient (tight and fast), and made worse with the IO attempts in consuming 'system' cpu. The following code segment remedied the application:
[...]
if (!fgets(menu_selection, MAXINPUT, stdin))
{
if (ferror(stdin)) housekeeping_exit (0, 0);
eof++;
if (2 > eof)
{
fprintf(stderr, "\t\007Another will exit!\007");
fflush (stderr);
sleep (5); /* ensure pause before screen re-paint */
}
else housekeeping_exit (0, 0);
}
else eof = 0;
[...]
Abnormal telnet disconnects are a common problem with long-distance communications in places like Alaska.
The number of looping orphans was a very small fraction of the abnormal disconnects.
By appearances, in some situations the HUP signal would get lost triggering this problem.
Digital's response was there was likely a faulty pc-telnet implementation causing this,
but we saw it in at least three common telnet implementations (on both pc's and mac's).
We were convinced there were conditions where HUP's were lost as we occasionally
saw similar orphans from Digital X-terminal applications.
We have not seen any since v3.2d-1.
One method to combat this problem is to add soft CPU limits to /etc/profile (korn or bourne, use /etc/csh.login and 'limit' for csh):
If you choose to do this, be sure to *NOT* set it for root, Oracle, or other userids which are expected to have large or unlimited consumption, otherwise you will find things terminating due to signal 24.ulimit -St 600 # soft limit - limit process CPU time to 10 min
Another common limit imposed on korn shell users is:
This is effective only for the korn shell, not for programs executing. As with ulimit, be sure to *NOT* set it for server type userids, for Oracle in particular you will get SQL*Net disconnects if this is set and you are running MTS (multi-threaded server).export TMOUT=1800 # 30 minute timeout
To 4.0 or not to 4.0?
Planned change and configuration management is the best way to avoid problems. Sound practices like reducing the number of concurrent changes so cause and effect for any subsequent problems are more obvious seem like common sense. However, supporting multiple systems and having a number of items to update (operating systems, preventative or remedial patches, databases, applications, layered products, firmware at multiple levels, hardware, network adapters, disk controllers, etc.) as well as increasing demands for 24x7 availability for parts or all of the configuration lead to strong pressure to cut corners and merge activities. Be careful, very careful.
Problems are like flowers, they like to occur in bunches (particularly near changes):
In the case of (B) in particular, cause and effect relationships were initially indecipherable as different problems overlaid each other masking the symptoms.(A) 17 Sep 1995 AdvFS panic corrupts /usr (/var,/) 18 Sep 1995 Power failure 18 Sep 1995 HSZ40 battery bad, raidset corrupted (B) 31 Jan 1996 2100 cannot load KZPSA firmware off bus-1 CD 02 Feb 1996 7620->8400, DU v3.2c->v3.2d-1, fw 3.4 cd 05 Feb 1996 2100/8400 lose HSZ disks due to KZPSA A09 12 Feb 1996 2100 C-bus loops... backplane replaced 15 Feb 1996 8400 simple-lock panic corrupts Oracle (C) 19 Aug 1996 2100 KZPSA fails 26 Aug 1996 HSZ40 battery fails, 2100 crashes 26 Aug 1996 2100 IO module fails preventing reboot
Reduce your risk by controlling the changes to the environment. Do firmware updates in advance of software updates when possible and stage upgrades in phases whenever it is reasonable. Small changes are less risky than big changes.
We went to v3.2d-1 shortly after it was available. While a "fix only" release should be relatively safe, we heard on several support calls that patches available for v3.2c had not yet been ported to v3.2d-1. Digital is improving on cycling patches to all supported releases, but don't be caught as the first one with a new release.
In going from DU v3.0 to v3.2c, we found POLYCENTER Performance Manager inoperable pending a patch. Even preventative patches can cause problems, one applied to v3.2d-1 broke POLYCENTER HSM requiring a reinstall of both HSM and SCSI-CAM (CLC). If one read the early release information for DU v4.0, there were many layered products initially unsupported under v4.0. Being early to a new release incurs risk of something not yet being there.
Digital Unix v4.0 is a major upgrade. Our strategy is to let other non-production systems find the initial bugs. General rule of thumb is to wait at least 6 months after general release for a production system. Already v4.0a is available (and is the migration path for v3.2g which is our current version). Besides, the current level of our Oracle database and application software package are not vendor validated against v4.x. We have one 3000 on v4.0a, we will follow with another 3000 and the test/development 2100-4/200 sometime in first quarter of 1997 as staff resources permit.
Customers are mushrooms... keep them in the dark,
toss in some manure, and they will flourish?Certainly not all Digital and Oracle support technicians reflect this corporate attitude, there are some good ones. Unfortunately, the corporate attitude is sometimes infectious and it appears that some engineers or engineering managers may treat the support centers in this manner at times.
Where is the technical information?
(particularly the preventative kind)
Have you heard the echo "this is a known problem", are you asking why you feel like you are the last one to know?
While it may be "bad marketing" to acknowledge problems exist, it is excremental support to not let your customers know so they can avoid them.
Don't settle for just getting it working again, insist on root cause analysis and preventative measures (for you and for your neighbors).
Pursue the problems behind the problems.
Yes, if a system panics you do need to stop the panics to avoid the outages.
However, if the panic causes further damage (either actual or extending the outage for data validation),
there is a second problem to pursue.
Make them answer:
(Network): I'm mad as hell and I'm not going to take this any more.
The importance of addressing the standard issues involved in providing adequate backup of one's production applications cannot be over-stressed. In order to be able to respond to the disasters covered in the previous portions of this session, a robust procedure for handling data was necessary. Basic areas that should be addressed are:
- use of RAID and redundant hardware to protect file systems from "single point of failure",
- development and scheduling of regular full and incremental backups of all data,
- multiple copies of backups in multiple locations, and
- access to the backups when they are needed.
All of our Digital UNIX file systems reside on RAID 5 disks controlled by HSZ40 controllers providing immediate failover in the event of a disk crash with no application outages. Additionally we are configured with dual redundant controllers and dual power supplies. At this point we have not implemented DECSafe ASE due to the perceived complexity of the configuration and the feeling that it would not provide us with any significant increase in availability.
TZ877 DLT jukeboxes. Backup jobs are run via POLYCENTER Scheduler such that full backups of all systems are performed once per week with subsequent incrementals nightly. Additionally, each Oracle instance is backed up separately and the logs required for recovery are duplicated in multiple file domains and on multiple media, i.e., at all times there are at least 3 copies of each log on magnetic disk, 2 in one file domain and one in a different domain, and 1 additional copy on optical disk. [Does this sound paranoid??]
All backups are automatically "cloned" nightly to a separate volume group and the "clone" tapes are taken to off-site storage each morning.
Backup copies of data are of no use to an installation if they cannot be reliably located for use in a recovery situation. In order to ensure that we can have confidence in the validity of the backup copies we automatically track the status of the jobs which produce them. Backup job status messages are sent to Tech Services staff and operations staff. Additionally the NSR indexes are printed when updated, enabling recovery in the event that the disk containing the index is itself destroyed.
In order to minimize negative impact on our production applications and increase our ability to perform problem determination, it is helpful, when at all possible, to introduce changes in the operating environment hardware, software and configuration by means of a controlled and progressively staged implementation methodology.
For operating system, layered product and database software, multiple hosts can aid in providing staged, well tested implementation for system, layered product and database upgrades. For application upgrades multiple oracle instances and source control can provide controlled migration of changes with the opportunity to non-destructively test changes prior to implementation in production.In our installation we currently operate with two internal test platforms on which Tech Services can implement changes without affecting the rest of the organization, one applications test platform where the only staff potentially affected are the applications programmers, and finally our two production platforms: one supports our production applications instance plus an additional "pre-production" Oracle instance for end-user verification of application changes, the other supports a production data warehousing application.
The usual procedure we use locally is to initially install a software upgrade/patch on moka and java (our Alpha 3000's running minimal and non-critical oracle apps) and run them there for a period of time to verify stability. From there the upgrade is installed on nugget, an Alpha 2100 used for application development and the location of our TEST Oracle instance. Assuming existing applications, layered software and database applications function without problems, changes are installed on spike (Alpha 2100 running our production DSD system) and finally to glacier - our primary production server.In the interest of reducing the impact of inter-dependencies of hardware and software on production systems we maintain our final testing area (pre-production) on the same host as production.
Use of change and problem control afford a mechanism for tracking what is changing in one's environment and what the affects of the change may be. Used effectively it can help identify potential conflicts and provide risk assessment. It can also be useful after the fact in providing aids in problem resolution by showing "what changed" and by linking problems that occur with the changes that caused them. We are currently in the process of converting from a mainframe based product (IBM's Information Management) to an Oracle based problem/change/configuration management system (Prolin's ITSM).
Ongoing creation and maintenance of documentation of your configuration and local procedures is critical to being successful in recovering from the types of disasters we've been describing. Documentation can potentially be doneAdditionally it pays to be sure that the documentation produced is readily available when needed for an emergency.
- proactively as procedures are developed and implemented,
- regularly produced via scheduled jobs and tools that log activity, and
- reactively to capture information produced by ad hoc activities.
If the person who set up and documented a system or procedure gets "run over by a truck" (or leaves for other reasons...) others should be able to take over relatively easily.
Obviously, one should document procedures that you know are going to potentially need, e.g., database recovery, and have them in place and tested before they are actually needed. Document what has been done to provide the ability to recover and how an actual recovery should be done. Test your procedures when they are changed and periodically to verify currency.
Automated documentation of system configuration provides a current view of your environment. Regularly produced listings of changes to system files can aid in problem determination (i.e., "what changed").Additionally, use of products that trap and retain system console output can be useful for recording configuration changes and providing a record of what happened to a system when events happen after hours. We currently use POLYCENTER Console Manager for both providing central access to multiple systems and logging system updates (e.g., product installs, kernel rebuilds) as they are performed.
Document newly developed emergency procedures as they occur, e.g., when you have to restore from a backup that is no longer accessible via normal NSR recover procedures, document how it was done - someone will undoubtedly have the need to do this again.
Make sure documentation will be accessible when needed. Documentation, no matter how good, is useless if it is inaccessible. We currently maintain our documentation on two hosts as well as multiple paper copies and provide web access.
There are many capable and knowlegable people "out there" and many ways or getting in contact with them. The following are some good resources.
- Digital UNIX Information Sources
alpha-osf-managers comp.unix.osf.misc www.service.digital.com www.digital.com DECUS ftp:atlanta.service.digital.com/pub/patches DSN and DSN ITS (Digital Support Network Interactive Text Search) networker mail list (subscribe to majordomo@iphase.com)- Oracle Information Sources
Backup and Recovery Handbook (Oracle Press) comp.databases.oracle.* Oracle OpenWorld Oracle support notes CD www.oracle.com and Oracle Support Link (req. "Metal" support)- Freely Available Digital UNIX Software Index:
www.digital.com/info/misc/pub-domain-osf1.txt.html- Freely available software from University of Alaska:
ftp.alaska.edu:/pub/randy/ zuausr Enhanced security userid maintenance programs proc_info Get process information perf_mon_tools/ syd-3.2a Enhancements to syd sys_stat-1.2 Show|log system performance statistics across network ftp.alaksa.edu:/pub/sois/ UA_DUtools Various system management scripts, programs, and docs