[1118] in linux-net channel archive

home help back first fref pref prev next nref lref last post

Oops: SCSI tape, Networking, 1.3.28, 1.2.13

daemon@ATHENA.MIT.EDU (Grant R. Guenther)
Tue Sep 19 16:52:00 1995

From: "Grant R. Guenther" <guenther@empress.com>
To: Linus.Torvalds@helsinki.ft, iialan@iifeak.swan.ac.uk,
        linux-kernel@vger.rutgers.edu, linux-net@vger.rutgers.edu,
        linux-scsi@vger.rutgers.edu
Date: Tue, 19 Sep 1995 09:12:49 -0400 (EDT)

First the hardware configuration:

	i486/33, 16M
	AHA 1542
	Archive Python DAT tape drive
	WD8013 ethercard
	Panasonic CDrom on SMS (Lasermate clone) card
	IDE disks: Conner CFS1275A, Seagate 94354-230

Software:

	Slackware 2.0? and 2.3
	Kernels 1.2.13 and 1.3.28

Operational context:

	This machine runs a network backup script every night.  The
	script uses rsh to run cpio on about 30 machines, capturing and
        storing output onto the DAT (it's a bit more complex than
        that - but the rest is just local processing through a pipeline
        of filters.)

	The machine is a workstation during the day, but it is rebooted
        and X is not active when the backup runs.

Ancient problem:

	This system worked about 80% of the time,  rather too often
        the ST tape driver would lock up leaving processes in D wait.
        
New problem:

	Starting early last week, the system began to panic at the same
        time every night.  This was running on an old slackware and
        kernel 1.2.13.  After the oops, the networking would be dead.

	(There's no obvious correlation between the panic and the
        state of the backup process.  The amount of data varies
        radically on each machine every day, but the trouble seems to 
        happen at about the same time.)

        Once before - in another life - I encountered something like this
        that turned out to be corrupt binaries, so I decided to eliminate
        that possibility by completely reinstalling Linux.  Slackware 2.3
        and the latest kernel (since I noticed that some fixes have gone
        into the ST driver recently).  Result: NO CHANGE !  At 2am this
	morning the following oops:

 Unable to handle kernel paging request at virtual address ce7659b8
 current->tss.cr3 = 00101000, [r3 = 00101000
 *pde = 00000000
 Oops: 0000
 EIP:    0010:0013e5c7
 EFLAGS: 00010206
 eax: 8310176d   ebx: 00308b4c   ecx: 83100088   edx: 001c720c
 esi: 00308b0c   edi: 00e31214   ebp: 001c720c   esp: 001bb8d8
 ds: 0018   es: 0018   fs: 002b   gs: 0018   ss: 0018
 Process swapper (pid: 0, process nr: 0, stackpage=001b9aa0)
 Stack: 00308b20 003ee858 00308b4c 00e31214 00e31214 00141664 001416a5 00e31214 
        001c720c 00308b4c 00000001 1bdb2bc0 00e31214 003ee844 00e312a0 00000000 
        00308b4c 001c720c 00146828 bc24d16f 9d1132d3 00e31214 003ee844 1bdb2bc0 
 Call Trace: 00141664 001416a5 00146828 001a3f2e 001a0214 0013e32b 00138806 
        00117276 0010a78d 001b0018 001b0018 00109934 0010a809 00109443 
 Code: 28 88 4b 42 66 8b 46 02 86 c4 66 39 45 3e 73 29 6a 00 55 53 
 Aiee, killing interrupt handler
 Unable to handle kernel paging request at virtual address c0001004
 current->tss.cr3 = 00101000, [r3 = 00101000
 *pde = 00102067
 *pte = 00000000
 Oops: 0002
 EIP:    0010:0011b492
 EFLAGS: 00010046
 eax: 00000000   ebx: 00000000   ecx: 001da118   edx: 001e8000
 esi: fffff000   edi: 00001000   ebp: 00000000   esp: 001bb7c4
 ds: 0018   es: 0018   fs: 002b   gs: 0018   ss: 0018
 Process swapper (pid: 0, process nr: 0, stackpage=001b9aa0)
 Stack: 00001000 00102004 00001000 00400000 001da100 001e6002 00000202 00118cd0 
        00001000 00000000 001bbaa0 00000000 001bc110 001bb89c 00400000 00101000 
        00000000 40000000 00101000 0011d273 001bc110 00000000 40000000 001bc110 
 Call Trace: 00118cd0 0011d273 0011573d 00115957 0010af1e 0010ac65 02000000 
        01800000 00110153 0010fefa 0010ff08 0010a9cb 00140018 0013e5c7 00141664 
        001416a5 00146828 001a3f2e 001a0214 0013e32b 00138806 00117276 0010a78d 
        001b0018 001b0018 00109934 0010a809 00109443 
 Code: 89 4f 04 8b 90 18 a1 1d 00 89 17 89 7a 04 89 b8 18 a1 1d 00 
 kfree of non-kmalloced memory: 001bbae0, next= 00000000, order=0
 kfree of non-kmalloced memory: 001bbad0, next= 00000000, order=0
 kfree of non-kmalloced memory: 001bbf04, next= 00000000, order=0
 idle task may not sleep

Here's what ksymoops has to say:

EIP: 13e5c7 T _ip_queue_xmit+87/220
Trace: 141664 t _tcp_send_ack+264/2d0
Trace: 1416a5 t _tcp_send_ack+2a5/2d0
Trace: 146828 T _tcp_rcv+2388/23d0
Trace: 1a3f2e t _wd_block_input+be/100
Trace: 1a0214 t _pty_write+60/180
Trace: 13e32b T _ip_rcv+43b/4f0
Trace: 138806 T _net_bh+116/160
Trace: 117276 T _do_bottom_half+3e/a8
Trace: 10a78d t handle_bottom_half+d/20
Trace: 1b0018 t _aha1542_intr_handle+380/4a0
Trace: 1b0018 t _aha1542_intr_handle+380/4a0
Trace: 109934 T _sys_idle+44/50
Trace: 10a809 T _system_call+59/a0
Trace: 109443 T _start_kernel+1b3/1c0

EIP: 11b492 T _free_pages+ca/1c0
Trace: 118cd0 T _zap_page_range+120/1c0
Trace: 11d273 T _exit_mmap+63/b0
Trace: 11573d t _exit_mm+21/50
Trace: 115957 T _do_exit+4b/c0
Trace: 10af1e T _die_if_kernel+2b2/2e0
Trace: 10ac65 T _page_fault+155/15c
Trace: 2000000 
Trace: 1800000 
Trace: 110153 T _do_page_fault+24b/2d0
Trace: 10fefa T _si_meminfo+1aa/1b8
Trace: 110153 T _do_page_fault+24b/2d0
Trace: 10a9cb t error_code+4b/60
Trace: 140018 T _icmp_rcv+18/160
Trace: 13e5c7 T _ip_queue_xmit+87/220
Trace: 141664 t _tcp_send_ack+264/2d0
Trace: 1416a5 t _tcp_send_ack+2a5/2d0
Trace: 146828 T _tcp_rcv+2388/23d0
Trace: 1a3f2e t _wd_block_input+be/100
Trace: 1a0214 t _pty_write+60/180
Trace: 13e32b T _ip_rcv+43b/4f0
Trace: 138806 T _net_bh+116/160
Trace: 117276 T _do_bottom_half+3e/a8
Trace: 10a78d t handle_bottom_half+d/20
Trace: 1b0018 t _aha1542_intr_handle+380/4a0
Trace: 1b0018 t _aha1542_intr_handle+380/4a0
Trace: 109934 T _sys_idle+44/50
Trace: 10a809 T _system_call+59/a0
Trace: 109443 T _start_kernel+1b3/1c0

Does anybody care to suggest what is going on ?  

I can't say for sure, but this *may* be coincidental with the installation
of a FreeBSD machine on the same network.  (The network is already very
heterogeneous: Linux, HP-UX, AIX, Solaris, SunOS, WfWg, NT, OSF-1, AUX, 
you get the idea ...)  We started installing our 1st FreeBSD machine at 
about the same time. It's 2.1.0-950726-SNAP.  I'm going to see what 
happens if I take it off-line tonight.

All suggestions are welcome,  thanks in advance !

----------------------------------------------------------------------------
Grant R. Guenther, System Administrator                 guenther@empress.com
Empress Software, 3100 Steeles Ave. E., Markham, Ont.           905-513-8888
----------------------------------------------------------------------------

home help back first fref pref prev next nref lref last post