[4194] in testers
Re: emacs 20.3/20.4 beta core dump
daemon@ATHENA.MIT.EDU (John Hawkinson)
Sun Jun 6 23:36:11 1999
Date: Sun, 6 Jun 1999 23:36:02 -0400 (EDT)
Message-Id: <199906070336.XAA27626@x15-cruise-basselope.mit.edu>
To: Richard Stallman <rms@gnu.org>
Cc: Greg Hudson <ghudson@MIT.EDU>, testers@MIT.EDU
In-Reply-To: <199906061836.MAA29101@wijiji.santafe.edu> from rms, and
<199906062229.SAA00938@r93aag001866.sbo-smr.ma.cable.rcn.com>
from ghudson, and also "[4139] in testers"
From: John Hawkinson <jhawk@MIT.EDU>
Hi,
Greg failed to cc me on his reply to you, but let me try to
fill in the information he didn't provide in response to your inquiry.
[Incidently, is it normal practice to drop gnu-emacs-bug from the cc
on this sort of thing?]
I think it also didn't mention that the buffer in question is
operating under selective display. This appears to be the source of
the bug, though perhaps this was obvious from the line numbering in
compute_motion()...
Preferatory note: I just observed that I did all the below debugging
against 20.3 instead of the 20.4 beta, this was my oversight. If you'd
prefer I repeat it against 20.4 I can do so.
| What is the immediate cause of the call to abort?
| Please examine the data that affect this statement:
|
| if (charpos < BUF_BEG (b) || charpos > BUF_Z (b))
| abort ();
|
| I suspect that charpos is too large, since it is larger
| than the upper-bound that was specified for compute_motion.
| But is that true?
Yes, charpos is one greater then it ought to be:
(gdb) print charpos
$2 = 18523
(gdb) print b->zv
$3 = 18522
(gdb) print charpos - b->zv
$4 = 1
| The next step is to run this under GDB and make it fail in a live
| process. Then you will hit a breakpoint when abort is called.
|
| Then you can pop off innermost the stack frames including this one
|
| #6 0x14aa20 in compute_motion (from=18165, fromvpos=0, fromhpos=0,
| did_motion=0, to=18379, tovpos=32768, tohpos=32768, width=79, hscroll=0,
| tab_offset=0, win=0x4324d8) at indent.c:1418
|
| then put a breakpoint at compute_motion, and jump back to
| the call to compute_motion on line 1697.
OK. Slightly different backtrace under gdb since we didn't go through
kill():
(gdb) where
#0 abort () at emacs.c:274
#1 0x118ebc in buf_charpos_to_bytepos (b=0x4e1a40, charpos=18523)
at marker.c:139
#2 0x14aa20 in compute_motion (from=18378, fromvpos=0, fromhpos=0,
did_motion=0, to=18522, tovpos=32768, tohpos=32768, width=79, hscroll=0,
tab_offset=0, win=0x4324d8) at indent.c:1418
#3 0x14c1cc in vmotion (from=18522, vtarget=-35, w=0x4324d8) at indent.c:1697
#4 0x7c414 in Frecenter (arg=35) at window.c:3183
(gdb) frame 2
#2 0x14aa20 in compute_motion (from=18378, fromvpos=0, fromhpos=0,
did_motion=0, to=18522, tovpos=32768, tohpos=32768, width=79, hscroll=0,
tab_offset=0, win=0x4324d8) at indent.c:1418
1418 pos_byte = CHAR_TO_BYTE (pos);
(gdb) return
Make compute_motion return now? (y or n) y
#0 0x14c1cc in vmotion (from=18522,
vtarget=-35, w=0x4324d8) at indent.c:1697
1697 pos = *compute_motion (XFASTINT (prevline), 0,
(gdb) jump 1697
Continuing at 0x14c14c.
Breakpoint 3, compute_motion (from=18378, fromvpos=0, fromhpos=0,
did_motion=0, to=18522, tovpos=32768, tohpos=32768, width=79, hscroll=0,
tab_offset=0, win=0x4324d8) at indent.c:1028
1028 register int hpos = fromhpos;
| Then step thru compute_motion and see why it advances the position
| past the upper limit.
Naively one might try to use watchpoints (though this always loses for me).
Sure enough, we seem to have some sort of lossage:
(gdb) watch pos
Watchpoint 6: pos
(gdb) c
Continuing.
Watchpoint 6: pos
Old value = 0
New value = 18378
compute_motion (from=18378, fromvpos=0, fromhpos=0, did_motion=0, to=18522,
tovpos=32768, tohpos=32768, width=79, hscroll=0, tab_offset=0,
win=0x4324d8) at indent.c:1088
1088 pos_byte = prev_pos_byte = CHAR_TO_BYTE (from);
(gdb) c
Continuing.
Watchpoint 6 deleted because the program has left the block in
which its expression is valid.
0xef406840 in _mutex_lock ()
(gdb) where
#0 0xef406840 in _mutex_lock ()
(gdb) c
Continuing.
Breakpoint 1, abort () at emacs.c:274
274 kill (getpid (), SIGABRT);
(gdb)
I'll try to make sure that makes its way to the gdb maintainers at
some point [Greg: I know cfox has seen this under 8.3, so it's not an
isolated instance].
Anyhow, single-stepping reveals that we lost the first time
we reach the selective display code:
(gdb) list 1407,1421
1407 if (selective > 0
1408 && indented_beyond_p (pos, pos_byte, selective))
1409 {
1410 /* If (pos == to), we don't have to take care of
1411 selective display. */
1412 if (pos < to)
1413 {
1414 /* Skip any number of invisible lines all at once */
1415 do
1416 {
1417 pos = find_before_next_newline (pos, to, 1) + 1;
1418 pos_byte = CHAR_TO_BYTE (pos);
1419 }
1420 while (pos < to
1421 && indented_beyond_p (pos, pos_byte, selective));
specifically after a few iterations of the do() group, we find ourselves at
line 1417 with pos set to 18485. to (the end of the buffer) is 18522. Executing
this find_before_next_newline() sets 'pos' to 18523, which is of course one
beyond the end of the buffer. CHAR_TO_BYTE() causes the core dump.
The obvious question is "Why it it adding 1 to find_before_next_newline()'s
result". The answer might be "Because find_before_next_newline() should always
return a position immediately followed by a newline, and the end of the buffer
is not necessarily a newline." Or not?
Stepping through find_before_next_newline():
853 {
854 int shortage;
855 int pos = scan_buffer ('\n', from, to, cnt, &shortage, 1);
856
857 if (shortage == 0)
858 pos--;
859
860 return pos;
861 }
shortage is indeed set to 1. Is the sense of this test wrong? I would assume
that if we found a shortage (scan buffer failed to find a newline), that we
want to return one less character than otherwise? At least, that would seem
right if we're going to indescriminatenly add 1 to its result and and if it
returns the last character in the buffer on a failure.
For some reason I cannot seem to make gdb let me edit the comparison
instruction to negate the sense of the test -- I think it's just
that I never properly figured out how to edit the running code under
gdb. Anyhow, setting a breakpoint at 857 and negating shortage,
emacs seems to behave much better. Unfortunately, with gdb outputting
all this stuff and hitting breakpoint constantly, it's really really
slow ;-)
I'm not wonderfully confident that this is the correct fix, though. I
may be misunderstanding the problem. I think I must be wrong, because
M-> doesn't take me to the end of the buffer as it should.
[...time passes...]
I got gdb to patch the code but making a copy of the executable and
reloading it with 'set write on'; I had hoped this was not necessary.
Reversing the sense of the test on line 857 certainly makes the
problem go away, but it breaks emacs severely. The most clear
symptom being that end-of-line takes me to the start of the next line
instead of the end of the current line.
I suppose another test might be for pos to be decremented in all
cases. This seems to work much better. There's no obvious screwiness
if I do this. I don't know if it might destabilize something important,
though. Greg, for your reference, the patch ends up being:
(gdb) set *(int*)(&find_before_next_newline+60)=0x01000000
[...more time passes...]
No, this patch doesn't work right either, this causes us to
lose before the beginning of the buffer. Oh well. Perhaps
removing the +1 from line 1417...
(gdb) set *(int*)(&compute_motion+3412)=0x01000000
Nope, this seems to cause infinite loops. Oh well.
Your advice appreciated.
Thanks.
--jhawk