[739] in java-interest
Interpreting a more efficient bytecode
daemon@ATHENA.MIT.EDU (sbk@netcom.com)
Tue Jul 18 00:17:42 1995
From: sbk@netcom.com
To: java-interest@java.Eng.Sun.COM
Date: Mon, 17 Jul 1995 20:02:56 -0700 (PDT)
Cc: sbk@netcom.com ()
Talking of code generation and how it relates to platform
independence, what is the thinking on speeding up interpretation by
using some intermediate form used in a code generator phase instead of
plain java bytecodes?
This would be platform independent. It would also not incur the cost
of transforming the output of the code generator into machine code,
which might make it an attractive scheme for speeding up
interpretation in general.
The reason interpreting from a code generator might make a difference
is that the type state requirement of java bytecodes implies that the
stack depth be known statically. All the stack operations are probably
quickly factored out during code generation, and the interpreter can
profitably use this information.
For instance, a byte code sequence like this:
<html>
<hr><pre>
aload_1
iload 4
aload_2
iload 4
iaload
aload_3
iload 4
iaload
iadd
iastore
iinc 4 1
iload 4
</pre><hr>
</html>
(which could be generated as the inside of a loop that goes
> for (i=0; i<10000; i++) a[i] = b[i] + c[i]; :-)
can be replaced with this equivalent sequence of 3 address
instructions (since the stack depth is known)
<html>
<hr><pre>
S1 := V1
S2 := V4
S3 := V2
S4 := V4
S3 := S3[S4]
S4 := V3
S5 := V4
S4 := S4[S5]
S3 := S4 + S3
S1[S2] = S3
V4 += 1
S1 := V4
</pre><hr>
</html>
The assumption is that the Vx variables are the variable registers of
the Java VM. Stack variables are Sx, and simply serve as temporaries.
Using usual techniques to optimize out the redundant operations:
<html>
<hr><pre>
T1 = V2 [ V4 ]
S4 = V3 [ V4 ]
S3 = S4 + T1
V1[ V4 ] = S3
V4 += 1
</pre><hr>
</html>
The original sequence of 12 instructions is down to 5.
Finally, one could also potentially save on the opcode dispatch cost
by adding call threading a la the Forth community, but this is clearly
an optimization independent of the code generator. Something like this
might be portable.
typedef union _vminsn *(*HandlerProcT)(union _vminsn *);
typedef union _vminsn
{
int *addr; int const_val; HandlerProcT handler;
}
VmInsn;
while(1) pc = pc->handler(pc);
pc points to a buffer of VmInsn's that are either handler procedures
or addresses for operand/results for the handlers, and this buffer is
created at some point from the code generator.
An iadd handler, for example, might look like:
VmInsn *iadd(VmInsn *pc)
{
*((pc+1)->addr) = *((pc+2)->addr) + *((pc+3)->addr);
return(pc+4);
}
Anyway, having a caffiene overdose and staying up late after downing
too much cola trying to beat the heat over the weekend, I hacked a
small C-based interpreter that does the above, and tested it on an
array addition test that adds two arrays to a third several
times. Array bounds checking is also done.
Here are the results, Sparc 10 running 5.4. I've also added as a point
of reference, Thomas Hickey's bytecode->C++ translator result, which
runs native after compiling the generated C++, but the translator
itself does no optimizations.
C, gcc2.5.6 -O : 2.0u 0.0s 0:02 86% 0+0k 0+0io 0pf+0w
bytecode->C++: 10.0u 0.0s 0:10 94% 0+0k 0+0io 0pf+0w
using intermediate code: 19.0u 0.0s 0:20 92% 0+0k 0+0io 0pf+0w
java interpreter: 60.0u 0.0s 1:03 94% 0+0k 0+0io 0pf+0w
So the gist seems to be that for the (relatively small) cost of
converting java byte code to a better intermediate format, and
changing the mechanism to interpret code, interpretation could be
improved a fair bit without needing to generate native code.
-KB-
PS: If anyone is interested, I can mail them the source. Don't want to
put it on the net -- like the soda that caused it to be written, it
has little nutritional content and burps at unexpected moments.
-
Note to Sun employees: this is an EXTERNAL mailing list!
Info: send 'help' to java-interest-request@java.sun.com