Created 24 June 1998 by kcarlson, no updates anticipated.
General Information
Index
Available Handouts
Monday
Tuesday
Wednesday
Thursday
Friday
Unrelated Tidbits
General Information http://www.cug.org/:
From ARSC the following people attended:
Virginia Bedford, Nathan Bills, Richard Burton, Kurt Carlson, Liam Forbes, Sergei Maurits, Barbara Horner-Miller, Guy Robinson
ARSC was one of the best represented sites (it was very nice to be able to cover all appropriate sessions vs. having to be in three places at once). CUG is a small conference relative to other vendor conferences I have attended (DECUS and SHARE over last decade), smallness had an advantage for having direct conversations with both Cray/SGI employees and other sites.
Attached are list of handouts I received there. Attached also are some "highlights" from over 40 pages of notes which I took. See me or any of the others for details | discussion.
Monday
08:30 Scheduling and Configuration Tuning for the T3E
11:00 Performance Optimization for the Origin 2000
14:00 General Session: Welcome
16:00 Open Meeting: Performance and Evaluation
Tuesday
09:00 SGI/Cray Monitoring Tools
09:30 Industry Directions in Storage
11:00 General Session: SGI/Cray Corporate Vision
14:00 OpenMP Programming Model
14:30 OpenMP: a Multitasking and Autotasking
15:00 Update of System Management Software for Large Origin Systems
16:00 DCE/DFS
16:30 Performance Tips for GigaRing Disk IO
17:00 Cray Product Installation and Configuration
18:30 BOF: T3E
Wednesday
09:00 Scalability and Performance of Distributed I/O on Massively
Parallel Processors
09:30 Distributed Supercomputing Services in a Heterogeneous Environment
10:00 Cray Networking Update
11:00 General Session: CUG Elections & Keynote Benz
14:00 SGI/Cray Service Report
14:30 SGI/Cray Q&A Panel
16:00 Cellular IRIX: Plans & Status
16:30 Cray T90 versus Tera MTA
19:00 CUG Night Out: Mercedes Museum
Thursday
09:00 The Age-Old Question of How to Balance Batch and Interactive
09:30 MISER: User Level Job Scheduler
10:00 Serving a Demanding Client While Short on Resources
11:00 General Session: SGI/Cray Service Report, etc.
14:00 Mass Storage at the NCSA: DMF and Convex UniTree
14:30 The State of Security for UNICOS & IRIX
15:00 IRIX Accounting Limits and UDB Functionality
16:00 Program Steering Committee Structure
16:30 SIG: High Performance Computing
18:00 Program Committee Review
Friday
09:00 IRIX: Getting It All Together
09:30 Integrating an Origin2000 into a Cray Data Center
10:00 Origin Craylink Partitioning
11:00 General Session: Keynote & SGI/Cray Hardware Report
Available Handouts: Top General Index Handouts Mon. Tue. Wed. Thu. Fri. Tidbits
Note, much of this should be available on the forthcoming proceedings CD, proceedings may also appear on the web (TBD).
General:
Attended:
Picked up:
Monday Top General Index Handouts Mon. Tue. Wed. Thu. Fri. Tidbits
08:30 Scheduling and Configuration Tuning for the T3E
Jim Grindle, SGI/Cray, Mgr U/mk Engineering jsg@cray.com
Tutorial covering GRM (global resource manager) and psched (political scheduler).
psched consists of GS (Gang Scheduler), LB (load balancer), and MUSE (multi-layered user scheduling environment). GS & LB have seen lots of fixes with 2.0.3 of U/mk.
Futures:
- looking at 'exit' for GRM making external requests;
- GRM queue: proiritization, starvation
- single PE applications
- optimization
11:00 Performance Optimization for the Origin 2000
Jeff Brooks, SGI/Cray Benchmarking Department, jpb@sgi.com
See handout.
Note in Origin floating divide by zero does not cause program fault by default.
Note comparisons for tuning parallel code for origin.
Note F90 option: -PHASE:flist on (for debuging, presently undocumented).
See information on OpenMP (recommended standard).
14:00 Welcome University of Stuttgart
E. Messerschmid, Prorektor Research and Technology
14:20 Parallel Numerical Simulations of Environmental Phenomena
Gabriel Wittum, Director, Institute for Computer Applications, University of Stuttgart
15:05 CUG Report
Gary Jensen, CUG President, UIUCNCSA
Sam Milosevich, CUG Vice-President, ELI LILLY
16:00 Open Meeting: Performance and Evaluation
Chair: Jeff Kuehn
Slated to "disappear" as major SIG (Focus).
P&E has many submissions, about 1/2 rejected; criteria:
- quality
- appropriateness (some passed to other SIGs)
- competition for available slots
Tuesday Top General Index Handouts Mon. Tue. Wed. Thu. Fri. Tidbits
09:00 SGI/Cray Monitoring Tools
Randy Lambertus, SGI/Cray, rl@sgi.com
Proposed 'future' integration of support tools (e.g., vaporware):
SSS: System Support Software, 3 command sets:
- System Support Manager: Command Module, Decision Support, Monitor & Notify, Event Handler, SUpport Databse
- System Group Manager: Manage multiple sytstems... Group event tracking, Config mgt., Availability monitor, Notification based on group system events.
- System Support Console: gui and ascii interfaces; launch and configure; control notifiers and reports.
Proposed to be available next year for Irix; written for NT as well.
Intent is to provide for U/mk (unknown: "Efforts all directed to IRIX now").
09:30 Industry Directions in Storage
Mike Anderson, SGI/Cray
Seagate & IBM are primary players in high performance disks.
Quantum bought out by Matsushita (sp).
Market dominated by desktop (70% of units), roughly 17% is high performance and 13% mobile.
CD-RW likely to takeover CD market by 2000-2001.
Industry has not accepted IBM SSA disks.
Fibre channel-0wid has industry acceptance.
Capacities growing (expect 40gb drives by end of 1998).
MTBF measurement varies... for some it's when 2/3 have failed, for some it's measured by returned failed drives (many of which are just thrown away so by measurement there still ticking); useful life expectancy is 5 years, but economic life may be less than actual life.
LTO (Linear Tape Open) (www.lto-technology.com) is a new emerging media type/standard... near term expect 100gb capacity, expect 800gb futures.
Super-DLT (100gb/cart) also should be out by 1999, SGI will support when available.
STK Eagle will be released June 1998, SGI will need 4 months for validation testing once released.
11:00 SGI/Cray Research Corporate Vision
Rick Belluzzo, CEO, SGI
Data management; Visualization; Computation.
Want to dominate "Time to insight" (modeling and simulation).
Operational changes to improve efficiency.
Change business model (profitability).
Execution & results: clear responsibilities and accountability.
See convergence of vector and traditional.
6 key industries: Mfg., Gov., Ent./Media, Energy, Science, Com.
Execution & results: clear responsibilities and accountability.
11:30 SGI/Cray Research Corporate Operations Report
Beau Vrolyk, SGI
Review of SV1 announcements...
Already sold 20m system, orders for 500 processors already.
Entry price of .5m (cheaper), based on J90 technology.
J90->SV1->SV1e->SV2.
Targeted core markets.
5x performance of J90.
5x price performance improvements over J90.
14:00 OpenMP Programming Model
Ramesh Menon, SGI/Cray menon@sgi.com
See: http://www.sgi.com/Technology/OpenMP
and http://www.openmp.org
Motivation: no portable standard for shared memory parallelism, each vendor had proprietary SMP.
SGI is leading, joined by HP, Intel, Sun, IBM, DEC, etc.
Presently a consortium, incoporating as a non-profit.
Fortran v1.0 spec due out 10/97ish.
C/C++ v1.0 spec due out 8/98ish.
Validation suite for Fortran planned.Salient features: fine & coarse parallelism; incremental parallelization; provide access to strengths of shared memory (e.g., avoid message passing); exploit cache coherent scalable hardware.
Interoperability: can mix with MPI and PVM.
shmem & pthreads not supported initial version.Due to architecture, will NOT be supported ever on T3E.
14:30 OpenMP: a Multitasking and Autotasking
Perspective, Neal Gaarder, SGI/Cray
Direction towards OpenMP: standard, preferred alternative.
All IRIX compilers (7.2).
PVP 10.0.0.3 and PE 3.1.
Not for YMP (10.0 required) or T3E.Conditional compilation:
#ifdef_OpenMP
!$, c$, *$ directives.Conversion to OpenMP:
gradual (intermixing directives).Paper has more details.
15:00 Update of System Management Software for Large Origin
Systems
Dan Higgins, SGI/Cray
Data center quality and HPC functionality into IRIX.
Share II (fairshare) in IRIX 6.5.
Miser in 6.5 (miser API a future).
Checkpoint/Restart: CPR 1.0 in Irix 6.4.
CPR 1.1 w/6.4 update (pthreads and fixes).
CPR 1.2 w/6.5 (shmem).
Resource limits (udb-like capabilities).
Accounting 1H99 Cray-style project ids & reporting (csa).
Enhancements to Array Services (Irix clusters).NQE: Daryl Coulthart dbc@cray.com:
"NQE will be stabilized at current release (3.3)"
Actively pursuing a partner (need to define scope).
Cray did NQE because the had to.
"NQE is mature".
Alternatives now: Codine, PVS, LSF, ...
Distributed Computing: data exchange; data consistancy; transparent access to other services; single system image.
DCE/DFS: structured name space; central admin; concise auth model; env for appl development; a truly distributed file system.
Core services: threads; RPC; CDS (Cell Directory Services); Security; GDS (Global Directory Services); DTS (Distributed Time Service).
Delegation of security space.
16:30 Performance Tips for GigaRing Disk IO
Kent Koeninger, SGI/Cray
IPN:
Limit daisy chaining to a depth of 2.
Striping and chaining reduces performance.
JBOD gets ~50% of peak (recommend use of RAID).
DA-302 35mb/s sustained.FCN:
240 mb/s read 160mb/s write, RAID (set of 5).
Sustained 48ish read, 32ish write.
17:00 Cray Product Installation and Configuration
Scott Grabow, SGI/Cray grabowc@cray.com
Binary release for mainline system (complete replacement), upgrade tiem has gone from 9ish hours to 3ish hours.
Unicos has 3 package types: Executable; Relocatable (.o); Source (source is buildable).
Unicos/mk has 2 types.
Installation is the same (CIT) with Unicos and U/mk.
Recommend use of CIT which comes with CD, not version on SWS.NOT recommended with CIT: root prefix (rootdev) and do not modify the VIF (verification information file) file.
Similarities between U & U/mk:
prep, install, configure;
doc flow & text similar;
similar packaging;
pre-installed;
async product installation is less complex.Upgrade install process:
- Install: install, verify, unicos.pstload or unicosmk.pstload.
- List revised and updated modules in this revision (optional, source release only).
- [re]-Apply local mods.
- Configuration tasks.
Where to look for problems:
PackageName.log, config.new, cit.log, cit.misc.log.If problems and need to SPR, include:
log files from /tmp/cit.*; sysinfo file; initial or upgrade install; going from release X to release Y; was a single source upgrade being performed; was backup used; options to commands given.Note, I spent half an hour discussing issues with Scott after session.
Also see the handouts for this session.
18:30 BOF: T3E (Jim Grindle, SGI)
Wednesday Top General
Index Handouts Mon.
Tue. Wed. Thu.
Fri. Tidbits
09:00 Scalability and Performance of Distributed I/O
on Massively Parallel Processors
Peter W. Haas, RUS haas@rus.uni-stuttgart.de
ucca.rus.uni-stuttgart.de
use guest/paragon see READMECUG98.
09:30 Distributed Supercomputing Services in a Heterogeneous
Environment
Peter Morreale, NCAR
Basic principle is netwaork mass storage.
Implement is ms* commands (msls, msrcp, msXXX).
Added functions like msopen().
DCE based implementation.
10:00 Cray Networking Update
Michael Langer, SGI/Cray mlanger@cray.com
New features and futures:
J90->T3E 50mb/s (HIPPI 30ish mb/s).
Need 2.0.3 (5/98), need 2.0.4 T3E->T3E (11/98), need SWS-ION 3.0 (6/98), need
Unicos 10.0.0.2 (5/98).
Improvement over NFS3.
Supported in 2.0.3 and 9.3+.
gb ethernet (in 6.4, soon 6.5); ATM OC-12 (~9/98); HIPPI 6500/ST (~6/99 and ~12/99 for Scheduled Transfer); IPv6 (~6/99); Network Node Manager.
11:00 CUG Elections
11:40 Keynote Address:
25 Years of Computer Aided Engineering at Daimler Benz in Stuttgart, Germany
Michael Heib, Manager HWW
Does IT for others: 50% split of public (University) vs. industry at HWW.
Objectives:
More power at sme cost; less operating personnel; smaller infrastructure costs; ability to solve very large problems; bi-directional knowledge transfer between industry and University.
Academic & Industry: 2 very different cultures, took time to get it together.
12:25 CUG Election Results
Other board members:
Treasurer: Barbara Horner-Miller
Americas Director: Barry Sharp (Boeing)
Asia/Pacific Dir: Shigeki Myajli (sp)
Past Pres: Gary Jensen
14:00 SGI/Cray Service Report
Bob Brooks, SGI/Cray
"trying to becom customer focused"
"predictable, consistant, responsive, pro-active"
quicker pattern recognition
improved esclation management
merge CRInform and Supportfolio (by this time next year)
14:30 SGI/Cray Q&A Panel
Moderator: Charlie Clark, SGI/Cray
Dave Kiefer - Dir of Engineering (T90)
Sylvia Crain - Mgr Cray Prog Env.
Tom Boyle - Dir Technical Support
Kathy Nottingham - Mkt Mgr for SSO (Strategic Software Org)
Jim Grindle - Mgr Unicos Op Sys
TB:
T90 very bad MTBF (bad batch), not enough inventory... increased 10x.
T3E: Redundant PE's build in (map in & out, requires reboot).
KN:
Panic avoidance... working on it;
U/mk 2.0.2 & 2.0.3 warm boot (PE) continuing effort.
DK:
DOA: 10%, now targeting 5%.
3 Causes: NTF (no trouble found)... bad contact alignment; Chip failure; Loose leaves.
JG:
"we're not happy with it, working on it"
Top priority: Critical... site escalation process (through field service);
Critical, Urgent Major, Minor, Design.
Consistance on SPR comments (some dormant a long time).
Unicos: 250 Major, 150 minor, 75 design, backlog of 50 critical.
U/mk: backlog of 15 critical
TB:
test package for taurus cables, preventative replacements being done.
KN:
Old process was patches tested separately with patches on patches or patches not
tested together
new process will be integration.... quarterly updates in patch sets, introduced with
IRIX 6.5.
IRIX 6.5 much more reliable with initial release than prevous versions
Why? more exposure testing done with 500 beta sites over 6 weeks, tripled system
types tested on (particularly larger systems)... result is not as many patches and
those will be better integrated.
JG:
Yes, it's supposed to be.
Pieces were possibly missed in gigaring environment.
Need to check support plans for SV1.
TB:
Need to do this consistantly between IRIX and Unicos.
End of this year consistant field notice.
Don't have answer yet with security notification process.
DK:
Early O200's had numer problems (missing parts and DOA's, mfg and design changes
subsequently to address this;
Power supply problems (now being proactively replaced).
TB:
Parallelizing file system checks (looking into).
TB:
Defect tracking system... replaces Cray method, first time on SGI.
16:00 Cellular IRIX: Plans & Status
Gabriel Broner, SGI/Cray broner@cray.com
Running in house now.
Will be the resultant operating system for all (move from dual expertise to less duplication with more applications available).
Support for large systems: 64 to 4000 CPU's: fault tolerance, reliability, high-end features like checkpoints and accounting.
Support for server systems: 4-64 CPUS: general purpose workload; fault containment; different requirements from large.... constant availability.
Architecture: scalability and fault containment.blah blah blah
16:30 Cray T90 versus Tera MTA
Jay Boisseau, SDSC
Preliminary results, more at SuperComputing.
See papers.
19:00 CUG Night Out: Mercedes Museum
Interesting...
Thursday Top General
Index Handouts Mon.
Tue. Wed. Thu.
Fri. Tidbits
09:00 The Age-Old Question of How to Balance Batch and Interactive
Barry Sharp, BCS barry.sharp@boeing.com
See paper.
Made kernel mods to facilitate.
09:30 MISER: User Level Job Scheduler
Nawaf Bitar, SGI/Cray
IRIX only (6.5 and later), NQE 3.3 integrated with MISER.
Scheduling is by CPU (wall clock estimate) andmemory.
Goal: Predictable batch completion times, balance batch & interactive loads.
Deterministic scheduling (avoid static parpritions which waste resources).
Flexibility and customer overrides.
Allows quasi-static parititioning (guaranteed completion times).
User must specify CPU and memory requirements, MISER looks for best fit (first fit) for scheduling (the guarantee), can finish early if available slots.
Opportunistic: time not taken against wall clock when running early (pre-emptable).Futures:
Better exception handling (undersized jobs are rudely terminated, adding signals);
User/Site-provided policies (first fit isn't always best).
10:00 Serving a Demanding Client While Short on Resources
Manfred Stolle, ZIB
See: http://www.zib.de/rz/backup-service/CUG/dmscp
2000 lines C code.See: http://www.zib.de/rz/backup-service/CUG/dmtaperepair
perl script, must be configured at site to work.
11:00 SGI/Cray Service Report
Denice Gibson: Senior VP, Strategic Software Organization
More RAS, Interoperability Performance
Supercomputing API by 12/98
Source compatability IRIX and Unicos
Data Senter Resource Mgmt Unicos->IRIX
Highly Scalable I/O
"SGI Computing without boundaries"
"big data, big compute, big graphics, high availability"
Mike Booth, VP Engineering
CAT: Cray Alligned Technical
vector technology, big compute, large scalability
10.0 updates every 3-6 months, no need for an 11.0
support for SV1
enhanced resiliancy and reliability
believe (direction) different processor needed for very high end (sv2 vs. sn2)
SV2 operating sub-system will be focused on scaling and IO performance hooked to IRIX for broader base of functionality.
Commodity processors (IA64) scaled up.
...
Ken Coleman, SGI/Cray Senior VP Customer & Professional Services
Allegedly surveying closed incidents, reported 25% return rate... ARSC (and others) reported that they'd never been surveyed or even heard of this.l
11:30 SGI/Cray Joint Software Report
Mike Booth, SGI, Denice Gibson, SGI/Cray
14:00 Mass Storage at the NCSA: DMF and Convex UniTree
Jeff Iterstrup, NCSA
Purge policy: non-small files (>1-2mb) will last only 6 hours
Reference notes for more information (configurations).
14:30 The State of Security for UNICOS & IRIX
Jay McCauley, SGI/Cray mccauley@cray.com
IRIX 6.5: Auditing, ACL's, least privilege required
ACL's: xfs: 'ls -D' to display ACL
default ACL on directory controls new file creation;
extending NFS3 for IRIX<->IRIX (turned on with mount option).Least privilege needed: avoid setuid exploitations:
sub-divide root into ~40 distinct capabilities.
Did privilege wrapping ins ~70 existing setuid wrappers with this.
Transparent in IRIX 6.5 (Don't have to use it).
Classes of capabilities: discretionary, mandatory, sys admin, net admin.
Capability inheritance.
3 Modes:
Classical: uid 0 all powerful (no longer supported in 6.5);
Augmented: uid 0 still privileged, capabilities honored;
Strict: uid 0 not privileged... trusted IRIX.
SGI did formal review of all privileged programsFutures (guessing 2/99ish availability):
plugable auth modules (PAM), also gives UDB functionality;
base technology for single signon;
API same as Linux.
15:00 IRIX Accounting Limits and UDB Functionality
Jay McCauley, SGI/Cray mccauley@cray.com
Cray style accounting (CSA) & udb limits in early 1999.
Requirements:
tools to manage large configurations; richer accounting facilities, udb for limits mechanis.Architecture:
New database subset of udb; Initialization via PAM module; kernel enforcement.Features:
Based on "job" container vs. individual process.
Partial list: cpu time, memory, vm, file size, open files, #threads, core file size, ...Provide data capture with basic reduction and reporting.
16:00 Program Steering Committee Structure
12 SIGs to 5 super-SIGs with Group Chairs and Focus Chairs
Operations: Dan Drobnis
User Services" Leslie Southern
Mass Storage: Laney Kalsrud
Networking: Hans Mandt
Unicos: Ingeborg Weidl
Irix: Open
Security: Open
Software Tools:
Compilers: Open
Applications: Larry Eversole
Visualization: Open
Performance: Open
Notes:
Each group will have one or more official SGI liason.
Guidelines for long term to be drafted by VP with 5 GC's.
Chairs not performing will be replaced (HOW?).
16:30 SIG: High Performance Computing
8 participants, most with perfomance interest. Many viewed performance here as application or algorithms oriented vs. capacity planning and data center management which may be covered by Group 1 (not clear). Lots of TBD's.
18:00 Program Committee Review
No session evaluation forms. No conference evalutaion forms... needs to be remedied.
Friday Top
General Index Handouts
Mon. Tue. Wed.
Thu. Fri. Tidbits
09:00 IRIX: Getting It All Together
Daryl Grunau, LANL
See notes, reviewed several problems and site experiences.
09:30 Integrating an Origin2000 into a Cray Data Center
Chuck Keagle, BCS chuck.keagle@boeing.com
See notes and paper.
Flexlm keeps open socket, can cause checkpoint to fail.
10:00 Origin Craylink Partitioning
Steve Whitney, SGI/Cray
See notes.
11:10 Parallel and Distributed Development and Simulation of Atmospheric Models
V. Mastrangelo and I. Mehilli, CNAM-University; et.al.
11:40 SGI/Cray Hardware Report and Hardware Futures
Steve Oberlin, SGI/Cray
SN2 scalar (commodity)
SV2 vector: new architecture, not binary compatable, 6xT90 speed, 20x price/perf
12:40 CUG Next Steps: CUG 99 in Minneapolis, MN
John Sell, MSC
12:50 Closing Remarks
CUG President: Sandy Haerer haerer@ucar.edu
Unrelated Tidbits (before
and after CUG):
In London: Les Miserables,
Miss Saigon,
Chicago, and Beauty
and the Beast .
Also visited British Museum and National Gallery and numerous bookstores and visited
a friend in Skelmersdale (northeast of Liverpool). Noted that Europeans have a fascination
for World Cup 98 (soccer) which started in France as I arrived.
Afterwards: Leonberg, Ludwigsberg,
and Esslingen visited (outside of Stuttgart)... strongest
recommendation is for Esslingen.