[262] in linux-net channel archive

home help back first fref pref prev next nref lref last post

Re: lcp-echo-failure doesn't work?

daemon@ATHENA.MIT.EDU (Michael H. Warfield)
Tue May 2 20:07:38 1995

To: linux-net@vger.rutgers.edu, linux-serial@vger.rutgers.edu,
        linux-ppp@vger.rutgers.edu, ale@cc.gatech.edu
Date: Tue, 2 May 1995 17:20:31 -0400 (EDT)
From: "Michael H. Warfield" <mhw@wittsend.com>
Cc: mhw@wittsend.com (Michael H. Warfield)
In-Reply-To: <m0s6NMn-000Q2yC@bill.pharmcomp.com> from "Dan Hollis" at May 2, 95 12:12:00 pm

Hello all!

	Because I have seen complaints like this on the net, ppp, and serial
lists as well as both the niksula and vger listservers, I'm going to post
this to all of the lists where I have seen these complaints.  The complaints
on the individual lists may have seemed few, but in combination indicate
a problem that many of us have run into.  I have also seen some complaints
similar to this in some of the newsgroups, but I didn't save them (came
in before I started experiencing this problem) and I haven't stuck my nose
in the groups for a while.

> Howdy. We're using ppp 2.1.2c to bridge our internal network to the internet.
> We're using Linux 1.2.4 with a Boca BB-1008 serial board.

> The BB-1008 unfortunately does not have DTR or CD control. I have tried to
					----- Extreme Bummer -----
> use the lcp-echo-interval and lcp-echo-failure options to detect if the line
> is dropped, but it doesn't seem to be working.

	This little goodie has been complained about in the past.  I just
recently discovered why it's broken on Linux and fixed it for myself.
I've now had it in operation long enough to verify that, yes this is the
cause of the failure, and yes the fix corrects the problem.  It doesn't
fix the problem with losing the PPP link, but at least I can now detect
the fact that the link is dead.

	My setup is using an internal modem which does support nice things
like DTR, DSR, and CD.  However, I kept running into a problem where my
PPP link would die.  I tried adding the lcp-echo-interval and lcp-echo-failure
and they had no effect at all.  This turned out to be due to a design flaw
when pppd is compiled for Linux.

> Here's how our script is calling pppd:

> /usr/lib/ppp/pppd -detach /dev/$DEVICE 38400 crtscts defaultroute
> lcp-echo-interval 5 lcp-echo-failure 12 mru 296

> This should have pppd drop the connection after 1 minute of lcp failures,
> right? It doesn't work. pppd just sits there forever and never detects the
> line being down. Am I doing something wrong?

	Well yes and no...  On Linux it will drop the line after 1 minute
of NO RECEIVED TRAFFIC.  If it hasn't seen any traffic in the
lcp-echo-interval, it will send out an lcp-echo packet to cause some received
traffic.  Now here is where the flaw comes in.  If your failure causes you
to be unable to SEND but still able to RECEIVE, the ppp daemon will never
send an lcp-echo packet and will never detect the link is down if you have a
resonably active link!

	The code for the pppd has a compiler "#ifdef LINUX" around some
code which only sends the lcp-echo if the last packet (any packet at all)
received was older than lcp-echo-interval ago.  This seems logical.  ( :-( )
After all, why waste time sending an echo packet when you can see network
traffic.  WRONG!  This opens up an entire failure mode which the pppd can
no longer detect.  He no longer can detect a failure in the transmit side
of the link if there is sufficient traffic coming in on the receive side.
There are plenty of UDP packets and ICMP packets to meet this requirement.

	In my case I had the thing configured to send out lcp-echo packets
every two seconds and fail after missing 4 of them.  There was still
enough nominal traffic for my named daemon, routed, and other sundry
denizens that the pppd could never detect a link down due to a transmit
failure!  This drove me NUTS (yes, yes, all my friends will tell you that
it's not a drive, just a short putt) trying to figure out why I could not
send a packet, but the pppd sat there thinking everything was just fine.

	By changing that "#ifdef LINUX" to "#ifdef HELLFREEZES" (just
for testing without deleting something I might not want to) and recompiling
the pppd now can detect failed links!  8 seconds off the air and he exits,
permitting me to automagically restart the link.  YEAH!!!  :-) :-)

	At least that now prevents my server from dropping dead until I
manually kill the pppd to restart the link.  I went from having to do
that once or twice a day to never.  Double YEAH!!!  :-) :-) !

	Sorry I don't have exact diff's but the change is to the "#ifdef"
in the "LcpEchoCheck" routine in lcp.c for the pppd sources.  It's around
line 1500 or there abouts.

	This does NOT fix the problem with the PPP link dropping dead
periodically!  I have seen this complained about by several individuals
over the last several months.  The usual remark they get is "oh you must
be having a modem problem".  Wrong again.  At this point, I have been able
to confirm that the modem is in an operational state.  Flow control is
NOT in a blocking state, the modem continues to receive data fine (implying
that the modem has not gone brain-dead and the error correcting link is
still fully functional) and that interrupts are proceeding properly.
The higher level code seems to have lost track of what it is doing and is
no longer able to send data!

	I've mentioned this to a couple of others, including Alan Cox
when I was looking at some things in the latest network snapshots.  So
far, the best I can figure out is that it is some sort of timing problem.
It seems to primarily occur, on my system, when there are a high number of
connections being established and broken (you guessed it - WWW and Netscape).

	As has been reported by others, this problem has seemed to come
and go over a large number of patch levels.  Some have indicated that
it got much worse around 1.1.6[345] or there abouts.  I can confirm that.
It also seemed to get somewhat better with the 1.2.4 version but I can't
find anything in those patches that would acount for it.  It seems much
worse on a 28.8 link than a 14.4 link (I have one of each), but that may
be a bogus observation since I don't have the same level of Web traffic
over the 14.4 link; just telnet, ftp, and smtp and at a lower frequency.

	So many things seem to affect the frequency of the problem and
the problem is so infrequent now that it is difficult to say what
makes it worse and what makes it better.  At one point, I told Alan
that I thought a change he made had "fixed" the problem.  That baffled
him and I soon noticed that the problem returned on its own given
sufficient time.  I've aquired quite a taste for crow when I jump the
gun like that and I appreciated Alan putting up with some of my ramblings.

	Right now, it's just a minor annoyance, since pppd can now detect
the problem and recover from it.  It would be nice to figure out where
the problem is and fix it though.  I've been pawing through serial code,
ppp code, and network code but have come up empty handed.  Anyone with some
more suggestions as to where to look, let me know and I check'm out.

> -Dan
> .----------------------------------------------.
> |Dan Hollis -- Pharmacy Computer Services, Inc.|
> |dhollis@pharmcomp.com - (503)476-3139 ext. 215|
> `----------------------------------------------'
> 

	Regards,
	Mike
-- 
 Michael H. Warfield	| (404) 925-8248	| mhw@WittsEnd.com
  (The Mad Wizard)	| NIC whois: MHW9	| mathcs.emory.edu!wittsend!mhw
An optimist believes we live in the best of all possible worlds.
A pessimist is sure of it!                      | http://www.wittsend.com/mhw/

home help back first fref pref prev next nref lref last post