[16510] in Athena Bugs

home help back first fref pref prev next nref lref last post

Ultra10 hard hangs / broken deadman / 105181

daemon@ATHENA.MIT.EDU (John Hawkinson)
Sat Nov 28 10:20:55 1998

Date: Sat, 28 Nov 1998 10:20:49 -0500
To: bugs@MIT.EDU
Cc: tlyu@MIT.EDU
From: John Hawkinson <jhawk@MIT.EDU>

[[ cc to release-team thought about and omitted ]]

Tom initiated a discussion on -c sipb on Tuesday wherein he found a
reliable way to hang his Ultra10 (something to do with the Flash3
player). I recommended he boot a deadman kernel ("set snooping=1" in
/etc/system) to see if he could instrument anything. This caused the
machine to panic at boot time (not very helpful).

Investigating in the cluster, I do indeed see the same problem. (HINT:
when mucking with /etc/system, always make a copy so you can get out
with boot -a). Consulting Sun's bug database, it looks to be:

| Bug Id: 4119338
|  Category: kernel
|  Subcategory: other
|  State: closed
|  Synopsis: deadman kernel doesn't work on Sol2.6/sunfire
|  Description:
| Customer tried deadman kernel(/etc/system:snooping=1) on sol2.6/sunfire, but
| system was paniced in the middle of the boot after 'snooping=1' set-up.
| 
| panic[cpu0]/thread=0x3002be80: Kernel panic at trap level 2, trap reason 0x2
| TL=0x1 TT=0x4e TICK=0x80000008cf4228d9
| 	  TPC=0x1002cc24 TnPC=0x1002cc28 TSTATE=0x4400001e04
| TL=0x2 TT=0x68 TICK=0x80000008cf4228b7
| 	  TPC=0x10008c8c TnPC=0x10008c90 TSTATE=0x4400001504
| kadb: Called from within the PROM, exiting...
| Type  'go' to resume
| {0} ok 
| 
| I tried reproduce this trouble but not reproduced.
| 
|  Work around:
| 
| 	  Integrated in releases: 
|  Duplicate of: 4080160
|  Patch id: 105181-06
|  See also: 
|  Summary:

105181, is, of course, jhawk's favorite Sun patch ;-) It's the kernel
jumbo fix-everything-and-still-have-time-for-tea patch, as well as
rating high since I just got a bugfix of my own into -10.

Changes since -05, which is what Athena has installed:

| Problem Description:			
| 					  
| 4170500 solaris ntp_adjtime broken, useless for PPS sync of the system clock
| 4151480 under Solaris 2.6, adb reports wrong information for o registers for v9
| arch					
| 4147079 stubs mechanism for modules is faulty 
| 4139770 fcntl() returns EINVAL error in BCP mode when NFS file is read
| 4131439 deadlock_panic from pi_willto	
| 4118425 sfmmu_tsb_miss() may get a recursive mutex panic
| 4117624 if lockd is restarted, clients receiving signals have problems with
| locks					
| 4108806 rename of automounted directory results in panic
| 					  
| 4169916 Excessive ECC errors		
| 4174959 System hard hangs w/ oracle causing sigbcmd or hostint not to break to
| 'ok'					
| 					  
| (from 105181-09)			
| 					  
| 4162055 invalid socket return error code: ECONNRESET should be ECONNREFUSED
| 4151212 system crashes in page table steal
| 4148073 successful fork() sometimes does not return zero in mt/multi lwp child
| proc					
| 4141788 system hangs due to pagefault loop in shared memory
| 4122617 device driver providing devmap not unloadable because of leaked hold
| count					
| 4122292 multithreaded httpd process deadlock during cfork()
| 4119745 realitexpire() algorithm is too slow when system time is changed
| 4107724 implement workarounds for spitfire errata 32 and 54
| 4065248 UFS caching can adversely affect application performance
| 					  
| (from 105181-08)			
| 					  
| 4144929 kernel patch 105181-05 causes dbx on setuid programs to get EBUSY
| 4127499 SunFire should not be as verbose in printing CE ECC messages
| 4098732 recursive mutex enter in kstrgetmsg()
| 4043763 in MT applications, close() blocks if fd is in use by another thread
| 					  
| (from 105181-07)			
| 					  
| 4132927 open system call does not audit if creat bit set.
| 4125580 system panics in cstat64 with type prvnodeops vnode
| 4122408 Backup performance with Netbackup 3.0 is far below expectations.
| 4119498 HSI/P - Performance problem upto 200 m/s interframe delay.
| 4115951 Diskless Ultra-1s unable to perform system crash dump across network
| 					  
| (from 105181-06)			
| 					  
| 4080160 tickint_clnt_add miscalculates interval between handler calls
| 4089777 processes can hang or crash while forking with ISM on sun4u
| 4098645 setcontext() uses >25% of the stack & segkp_fault: accessing
| redzone panic.				
| 4102334 sunfire PDB node panics with xc_one() timeout,  no core
| 4119193 ASSERT() panic due to race condition in /proc-supported watchpoints
| 					  
| 4134357 availrmem not being reduced during Starfire memory detach
| 					  
| 4136544 getting "flusher thread" hang during dr drain
| 4137584 CE reporting incorrect P numbers


Anyhow, perhaps this is obviated by the discussion in release-77[1525] et
seq:

| [1525]  daemon@ATHENA.MIT.EDU (Jonathon Weiss) Release_7.7_team 10/23/98 00:55 (18 lines)
| Subject: patches patch release?
| From: Jonathon Weiss <jweiss@MIT.EDU>

But perhaps it's not. 

Could a January patch-release please consider taking 105181-10?

As a practical matter, I think that 105181 is one of those few patches
that it's important to keep current on new revs of (in fact, looking
at the patch report, I can't see any others whose numbers I can
remember nearly so well).

Thanks.

Tom, I'm busy enough lately that I'm not likely to be able to install
105181-10 on an ultra 10 and try to debug your initial problem. There's
a copy of the patch in

/afs/sipb.mit.edu/service/patches/SunOS/5.6/sandbox/105181-10.tar.Z

--jhawk

home help back first fref pref prev next nref lref last post