[272] in Zephyr_Bugs
Re: zephyr is hosed...
marc@ATHENA.MIT.EDU (marc@ATHENA.MIT.EDU)
Tue Apr 23 19:41:29 1991
Well, I might as well tell the whole story.
At about 1:00-1:10 I realized zephyr was extremely unhappy. I started
looking into it. b11zephyr was braindumping to arilinn, so they both
showed up as SERV_TARDY. neskaya was handling all the traffic. That
is, until neskaya grew so huge that it hit the 16meg hard limit,
malloc lost, and neskaya logged a dying gasp before depositing a 17meg
core dump in /usr/tmp. As soon as I realized this, I tried to find
someone to deal with this. Anne Salemme and Mark Rosenstein weren't
around, and the ops pager was busy. Jerry Larivee called Aida in E40
and told her to find someone and to call me. I talked to Young Sul,
and advised him to kill of the zephyrd on arilinn. This freed up the
daemon on b11zephyr, and it began to deal with the messages it had
queued up. Fortunately, no state seemed to be lost. Lucien called me
soon after, and I advised giving b11zephyr some time to settle down,
and then to start up neskaya again. This was done, b11zephyr bdumped
to neskaya, and since then things have been fine.
More details:
1) arilinn was running with debug on. This means it was generating on
the order of two 500K compressed log files an hour. In fact, it
logged several lines for each piece of the brain dump. So, the dump
was taking a long time. It had gone for over an hour when the arilinn
zephyrd was killed. In other words, b11zephyr and arilinn were
basically out of commission while the bdump was occurring.
2) zephyr server dynamics make 6.003 look like Saturday in the Park on
the Fourth of July. It is wise to look at *everything* before doing
anything. Lucien, I don't want to be condescending, but I want to
make sure you understand what you were about to do. Starting a server
on neskaya w/o touching arilinn or b11zephyr would have been bad.
There would have been no state when neskaya started up, because it
would have no way to get a brain dump from anywhere. Then when
(finally) the bdump ended, the data would have been completely
different, and most likely, all state from both servers would have
vanished.
3) Useful stull from log on arilinn:
Apr 23 12:58:05 arilinn zephyrd[5696]: clt_ack type sent for 1808 to 18.72.0.248/2103
Apr 23 12:58:05 arilinn zephyrd[5696]: packet sent
Apr 23 12:58:05 arilinn zephyrd[5696]: disp:ACKED 'zephyr_admin' '1' 'DUMP_AVAIL' notice to '' from 'b11zephyr.MIT.EDU' 18.145.0.7/2103/2103
Apr 23 12:58:05 arilinn zephyrd[5696]: server_dispatch
Apr 23 12:58:05 arilinn zephyrd[5696]: bdump_get: bdump v1 avail 18.145.0.7
Apr 23 12:58:05 arilinn zephyrd[5696]: bdump_get: connected
Apr 23 12:58:05 arilinn zephyrd[5696]: get new tickets: 672068246 672425885 672397100
Apr 23 12:58:06 arilinn zephyrd[5696]: bdump_get: SendKerberosData ok
Apr 23 12:58:07 arilinn zephyrd[5696]: bdump recv loop
Apr 23 12:58:07 arilinn zephyrd[5696]: brl fdready
Apr 23 12:58:07 arilinn zephyrd[5696]: disp:ACKED 'zephyr_admin' '' 'IHEARDYOU' notice to '' from 'neskaya.MIT.EDU' 18.72.0.248/2103/2103
Apr 23 12:58:07 arilinn zephyrd[5696]: server_dispatch
Apr 23 12:58:07 arilinn zephyrd[5696]: bdump_offer
Apr 23 12:58:07 arilinn zephyrd[5696]: bdump_offer: address is 18.70.0.161/1748Apr 23 12:58:07 arilinn zephyrd[5696]: bdump:HMCTL 'ZEPHYR_CTL' 'HM' 'BOOT' 'b11zephyr.MIT.EDU' ''
Apr 23 12:58:07 arilinn zephyrd[5696]: boot 18.70.0.82 (server 18.145.0.7)
Apr 23 12:58:07 arilinn zephyrd[5696]: hostm_flush 18.70.0.82
Apr 23 12:58:07 arilinn zephyrd[5696]: hostm_flush clt_dereg 18.70.0.82/1478
Apr 23 12:58:08 arilinn zephyrd[5696]: sabet@ATHENA.MIT.EDU/3
Apr 23 12:58:08 arilinn zephyrd[5696]: taken@ATHENA.MIT.EDU/2
...
So, there was a brain dump occurring. From the log on b11zephyr:
Apr 23 12:57:48 b11zephyr zephyrd[197]: subscr unauth celine@ATHENA.MIT.EDU ops
Apr 23 13:42:14 b11zephyr zephyrd[197]: subscr. register hqbv@ATHENA.MIT.EDU/18.71.0.20/0 failed: Illegal port specified in subscription
There's 45 minutes of no activity. That's between when the dump
started, and when arilinn was killed.
The core file: -rw-r--r-- 1 root 17830912 Apr 23 13:33 core
So there was 10 minutes with zero service, and about 5-10 more when
service was sporadic as b11zephyr recovered. By 3pm, stuff was
mostly normal.