[294] in linux-net channel archive
Re: csum_partial_copyffs (1.3.0) loses big on Pentium
daemon@ATHENA.MIT.EDU (Tom May)
Tue May 9 13:36:14 1995
Date: Tue, 9 May 1995 09:20:04 -0700
From: ftom@netcom.com (Tom May)
To: mea@mea.cc.utu.fi
CC: linux-net@vger.rutgers.edu
In-reply-to: <95May9.131703+0200eet_dst.69-4+16@mea.utu.fi>
>> Hi,
>>
>> It looks like the new function csum_partial_copyffs() in the 1.3.0 net
>> code is a win on a 486, but it is an extremely bad lose on a Pentium.
>....
>> And, if you paid any attention to those results, you may have noticed
>> that the 486 is running the old code *FASTER* than the Pentium on the
>> "mixed" and "large" packet tests. If anybody can explain what's going
>> on, please do.
> I do venture a guess that it is about pipeline stall.
> That is, Pentium has two integer units, which share
> common register file. Now if unit 1 is changing register
> X, unit 2 must delay an instruction needing data from that
> register, until data arrives there.
In worst case the second execution unit will never be used and all
intructions will be executed by the main execution unit which should
give performance similar to a 486. But here the 486 performance is
measurably better, and by a larger factor than the 66/60 clock ration.
> This is also why the original (it is still the same, I think)
> bogomips-loop produces surprisingly low figures for Pentiums.
> There is a prooven speed enchangement of adding a couple NOPs
> to the loop so that P5 won't go to pipeline stall.
Looking at the code I would say that adding a nop simply allows the
Pentium's branch-prediction circuitry to work. In my experiments, it
can't predict which way a jumped-to jump will go so it assumes it will
fall through. When it doesn't fall through you get a slow jump.
Adding a nop, although it takes an extra cycle, allows the jump to be
predicted so it will execute more quickly for a total increase in
execution speed.
> ... but you Assembler Hackers knew this already, didn't you ?
I thought so . . .
Tom.