[267] in linux-net channel archive
Re: lcp-echo-failure doesn't work?
daemon@ATHENA.MIT.EDU (Al Longyear)
Wed May 3 15:43:54 1995
Date: Wed, 3 May 1995 11:18:47 -0700
To: "Michael H. Warfield" <mhw@wittsend.com>, linux-net@vger.rutgers.edu,
ale@cc.gatech.edu, linux-serial@vger.rutgers.edu
From: longyear@netcom.com (Al Longyear)
At 05:20 PM 5/2/95 -0400, Michael H. Warfield wrote:
>
> Because I have seen complaints like this on the net, ppp, and serial
>lists as well as both the niksula and vger listservers, I'm going to post
>this to all of the lists where I have seen these complaints. The complaints
>on the individual lists may have seemed few, but in combination indicate
>a problem that many of us have run into. I have also seen some complaints
>similar to this in some of the newsgroups, but I didn't save them (came
>in before I started experiencing this problem) and I haven't stuck my nose
>in the groups for a while.
>
>> Howdy. We're using ppp 2.1.2c to bridge our internal network to the internet.
>> We're using Linux 1.2.4 with a Boca BB-1008 serial board.
>
>> The BB-1008 unfortunately does not have DTR or CD control. I have tried to
> ----- Extreme Bummer -----
>> use the lcp-echo-interval and lcp-echo-failure options to detect if the line
>> is dropped, but it doesn't seem to be working.
>
> This little goodie has been complained about in the past. I just
>recently discovered why it's broken on Linux and fixed it for myself.
>I've now had it in operation long enough to verify that, yes this is the
>cause of the failure, and yes the fix corrects the problem. It doesn't
>fix the problem with losing the PPP link, but at least I can now detect
>the fact that the link is dead.
>
> My setup is using an internal modem which does support nice things
>like DTR, DSR, and CD. However, I kept running into a problem where my
>PPP link would die. I tried adding the lcp-echo-interval and lcp-echo-failure
>and they had no effect at all. This turned out to be due to a design flaw
>when pppd is compiled for Linux.
>
>> Here's how our script is calling pppd:
>
>> /usr/lib/ppp/pppd -detach /dev/$DEVICE 38400 crtscts defaultroute
>> lcp-echo-interval 5 lcp-echo-failure 12 mru 296
>
>> This should have pppd drop the connection after 1 minute of lcp failures,
>> right? It doesn't work. pppd just sits there forever and never detects the
>> line being down. Am I doing something wrong?
>
> Well yes and no... On Linux it will drop the line after 1 minute
>of NO RECEIVED TRAFFIC. If it hasn't seen any traffic in the
>lcp-echo-interval, it will send out an lcp-echo packet to cause some received
>traffic. Now here is where the flaw comes in. If your failure causes you
>to be unable to SEND but still able to RECEIVE, the ppp daemon will never
>send an lcp-echo packet and will never detect the link is down if you have a
>reasonably active link!
I grant you that receiving frames will not cause the echo 'solicitation'
frame to be sent. It always upset me that Morningstar would send echo frames
and then disconnect if I did not respond to them in time. (And that is
difficult when the modem was in the process of sending a 2500 byte ftp-data
frame.)
It did not make sense to me, the person who put the code into the lcp.c
module many months back, that "if I was sending data and the remote was
receiving data then how can it declare me dead as if I was not sending it
anything?"
And, it still does not make sense. If you have a problem with sending
frames, then you have a problem sending frames. There is no simple solution.
You need to find why the networking software is in this strange mode. If you
can't receive frames then it is probable that the peer has disconnected and
the inability to send frames will result; but you wont receive the frames
either.
(The lcp-echo-request frame is not TCP. There is no ack frame which is
required prior to its buffer being released. Of course, there may be other
outstanding TCP frames, but that is a different matter.)
There was a bug in the lcp-echo-failure logic. It was calling the wrong
procedure. The code was originally written for ppp-2.0 and when the version
went to 2.1, the procedure to disconnect the link was no updated. (I did not
know that it was changed.)
The result is that when pppd tried to terminate the link, it would call the
wrong procedure. This would take down the lcp layer, but it would not
disconnect. That is an old bug which was corrected in the 'b' version of the
code.
> The code for the pppd has a compiler "#ifdef LINUX" around some
>code which only sends the lcp-echo if the last packet (any packet at all)
>received was older than lcp-echo-interval ago. This seems logical. ( :-( )
>After all, why waste time sending an echo packet when you can see network
>traffic. WRONG! This opens up an entire failure mode which the pppd can
>no longer detect. He no longer can detect a failure in the transmit side
>of the link if there is sufficient traffic coming in on the receive side.
>There are plenty of UDP packets and ICMP packets to meet this requirement.
But I still contend that the peer is alive if you are receiving frames. Why
you are in the position of not being able to send it frames is a different
matter. That is worth investigation. It may be a problem much more basic to
the networking software in general.
The '#ifdef LINUX' around the code was because the other ports of pppd did
not support the ioctl to read the time information. It did not make sense to
disable the feature for them, but they lacked the ability to determine the
time since the last non-ppp frame was received.
> At least that now prevents my server from dropping dead until I
>manually kill the pppd to restart the link. I went from having to do
>that once or twice a day to never. Double YEAH!!! :-) :-) !
I can accept the prima-facia evidence that it solved the problem. However,
the underlying cause is that you could not send frames. This is wrong.
> Sorry I don't have exact diff's but the change is to the "#ifdef"
>in the "LcpEchoCheck" routine in lcp.c for the pppd sources. It's around
>line 1500 or there abouts.
There is only one #ifdef in the procedure LcpEchoCheck. It is not hard to
spot. Just search for the procedure and you should see the #ifdef.
> This does NOT fix the problem with the PPP link dropping dead
>periodically! I have seen this complained about by several individuals
>over the last several months. The usual remark they get is "oh you must
>be having a modem problem".
Ok. Let me agree with them.
The pppd process has no special code in it which says "Oh, we have been
running for x minutes. It is time to disconnect now." There is no code in
either the daemon nor the drivers (by drivers, I include the tty logic.)
The pppd process will log the nature of its termination for the normal
conditions of HUP and INT. If the pppd process should terminate for some
other reason (swap failure, etc.) then the link will simply go down as the
file to the tty is closed and the tty drivers will drop DTR signal due to
HUPCL being set on the tty.
The pppd process uses ALARM for timers. It uses IO for a asynchronous read
operations. Other than those four, there are no other signals caught by the
pppd process. If an invalid signal is sent to the process, then it is
possible that the DFL action would be to terminate the pppd process. This
would occur without a log message.
If the problem was an unknown signal, then it is easy to find. Write a
signal handler for all signals and print the signal number which was
received. (They could all be the same routine as the signal number is the
first parameter to the signal handler.) Let the pppd process replace this
handler with the proper one for the four signals which it expects to process.
I too have been seeing these "pppd disconnects the modem" messages. It has
prompted people to try to run pppd from init and all sorts of kludgey (IMHO)
logic.
The reason that is say 'kludgey logic' is that the common solution seems to
always be "let's put a band-aid on the problem and make it work" rather than
the more logical question of "Why is this happening?" I have yet to see a
message along the lines "when x happens and then y happens, the link is
disconnected." And by that statement, it must be something which is
reproducable or be well defined. There has been nothing which seems to help
solve the problem.
If the modem does disconnect at the remote site, there is usually a reason.
It may be the failure to retrain. It may be due to a lack of response. It
may be due to an impossible condition in their TCP stack due to a bogus
frame being generated by Linux. I do not know. It will probably take some
cooperation with the system at the other end of the telephone.
It could be that you have call waiting and received an incoming call. It
could be that the caller had call waiting and received a call. It could be
that the modem had a momentary fluctuation on the DCD signal. It could be
many things. None of these are certain.
Many 'modern' modems include a status display. My ZyXEL has "ATI2". This
will show the statistics and the reason why the modem disconnected for the
last call. It may be of help to look at this information when the modem
disconnects for no apparent reason.
[There was a bug which was reciently pointed out to me about the vj header
compression and that it did not set the toss condition correctly. This may
cause problems with the peer's TCP stack. I do not know.]
There is the echo-failure logic. However, when that goes 'off', a message is
written to the system log for that reason and the link is terminated.
I, too, have had pppd disconnect me. The trace has always shown my failure
to respond to the lcp-echo frame from Morningstar's ppp which we run on the
peer system. That is logical. That is a valid reason. I don't happen to like
it, but it is a logical reason. There is none of this "and now something
strange happens . . . ."
>Wrong again. At this point, I have been able
>to confirm that the modem is in an operational state. Flow control is
>NOT in a blocking state, the modem continues to receive data fine (implying
>that the modem has not gone brain-dead and the error correcting link is
>still fully functional) and that interrupts are proceeding properly.
>The higher level code seems to have lost track of what it is doing and is
>no longer able to send data!
Ok. Then why is it happening? There must be a reason. It just does not
happen on its own.
> I've mentioned this to a couple of others, including Alan Cox
>when I was looking at some things in the latest network snapshots. So
>far, the best I can figure out is that it is some sort of timing problem.
>It seems to primarily occur, on my system, when there are a high number of
>connections being established and broken (you guessed it - WWW and Netscape).
I appreciate that. However, the breaking of connections and the making of
connections should not effect the PPP driver. It effects the nature of the
IP frames over the wire. The disconnection of the modem (caused by pppd, the
Linux tty driver, or the remote modem) is a failure at the link level. It is
not a function of the IP frames which are flowing over the link.
> Right now, it's just a minor annoyance, since pppd can now detect
>the problem and recover from it. It would be nice to figure out where
>the problem is and fix it though. I've been pawing through serial code,
>ppp code, and network code but have come up empty handed. Anyone with some
>more suggestions as to where to look, let me know and I check'm out.
You have discovered the same logic which I did. There is no obvious solution
to the problem because there is no definition of the problem other than the
sporadic symptoms that seem to occur.
Please, take this with a 'ton of salt', and recognize that this is not a
suitable answer . . . .
One possiblity could be (far fetched) that you are sending the sequence +++
in the data of an IP frame. This is the modem attention signal and for many
modems which do not use a 'guard time' would cause them to drop out of the
on-line state. A solution would be to change the attention character to 0x7d
({). This character is *never* sent as a pair. It is always sent as the
sequence 0x7D 0x5D. If it never occurs as a pair then it can not be a
tripple and the modem will not see its escape sequence.
Again, that is far fetched but not beyond the realm of possibility. I,
personally, have never seen this to be a problem. However, I use a ZyXEL
modem which does 'adaptive timing' and is not susceptible to the problem.
On a more personal note, let me say that I do not like the current state of
the ppp logic. It is not stable. It dies for no apparent reason. The last
problem associated with the device getting a transmit buffer when the tty
had closed the channel was not obvious. Unfortunately, all of the obvious
problems have been found and solved.
I am frustrated because I can not find the answer to the disconnections. I
do not like shipping 'bad' code, if 'bad' is the word for it. I wish that it
was not so. However, there is only so much which I can do to solve the
problem when I do not know what the problem is other than vague statements.
It bothers the engineer in me to have to agree with them but offer them no
solution.
--
Al Longyear longyear@netcom.com longyear@sii.com
The public pgp 2.6 key is available by fingering longyear@netcom.com.