BYTECODE TECHNIQUES
This issue is about bytecode. Programmers coding in the Java(tm)
programming language rarely view the compiled output of their
programs. This is unfortunate, because the output, Java bytecode,
can provide valuable insight when debugging or troubleshooting
performance problems. Moreover, the JDK makes viewing bytecode easy.
This tip shows you how to view and interpret Java bytecode. It presents
the following topics related to bytecode:
* Getting Started With javap
* How Bytecode Protects You From Memory Bugs
* Analyzing Bytecode to Improve Your Code
This tip was developed using Java(tm) 2 SDK, Standard Edition,
v 1.3.
This issue of the JDC Tech Tips is written by Stuart Halloway,
a Java specialist at DevelopMentor (http://www.develop.com/java).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
GETTING STARTED WITH JAVAP
Most Java programmers know that their programs are not typically
compiled into native machine code. Instead, the programs are
compiled into an intermediate bytecode format that is executed by
the Java(tm) Virtual Machine*. However, relatively few
programmers have ever seen bytecode because their tools do not
encourage them to look. Most Java debugging tools do not allow
step-by-step execution of bytecode; they either show source code
lines or nothing.
Fortunately, the JDK(tm) provides javap, a command-line tool
that makes it easy to view bytecode. Let's see an example:
public class ByteCodeDemo {
public static void main(String[] args) {
System.out.println("Hello world");
}
}
After you compile this class, you could open the .class file in a
hex editor and translate the bytecodes by referring to the virtual
machine specification. Fortunately, there is an easier way. The
JDK includes a command line disassembler called javap, which will
convert the byte codes into human-readable mnemonics. You can get
a bytecode listing by passing the '-c' flag to javap as follows:
javap -c ByteCodeDemo
You should see output similar to this:
public class ByteCodeDemo extends java.lang.Object {
public ByteCodeDemo();
public static void main(java.lang.String[]);
}
Method ByteCodeDemo()
0 aload_0
1 invokespecial #1
4 return
Method void main(java.lang.String[])
0 getstatic #2
3 ldc #3
5 invokevirtual #4
8 return
From just this short listing, you can learn a lot about bytecode.
Begin with the first instruction in the main method:
0 getstatic #2
The initial integer is the offset of the instruction in the method.
So the first instruction begins with a '0'. The mnemonic for the
instruction follows the offset. In this example, the 'getstatic'
instruction pushes a static field onto a data structure called the
operand stack. Later instructions can reference the field in this
data structure. Following the getstatic instruction is the field
to be pushed. In this case the field to be pushed is
"#2 ." If you examined the
bytecode directly, you would see that the field information is not
embedded directly in the instruction. Instead, like all constants
used by a Java class, the field information is stored in a shared
pool. Storing field information in a constant pool reduces the
size of the bytecode instructions. This is because the
instructions only have to store the integer index into the
constant pool instead of the entire constant. In this example,
the field information is at location #2 in the constant pool.
The order of items in the constant pool is compiler dependent,
so you might see a number other than '#2.'
After analyzing the first instruction, it's easy to guess the
meaning of the other instructions. The 'ldc' (load constant)
instruction pushes the constant "Hello, World." onto the operand
stack. The 'invokevirtual' invokes the println method, which pops
its two arguments from the operand stack. Don't forget that an
instance method such as println has two arguments: the obvious
string argument, plus the implicit 'this' reference.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
HOW BYTECODE PROTECTS YOU FROM MEMORY BUGS
The Java programming language is frequently touted as a "secure"
language for internet software. Given that the code looks so
much like C++ on the surface, where does this security come
from? It turns out that an important aspect of security is the
prevention of memory-related bugs. Computer criminals exploit
memory bugs to sneak malicious code into otherwise safe programs.
Java bytecode is a first line of defense against this sort of
attack, as the following example demonstrates:
public float add(float f, int n) {
return f + n;
}
If you add this function to the previous example, recompile it, and
run javap, you should see bytecode similar to this:
Method float add(float, int)
0 fload_1
1 iload_2
2 i2f
3 fadd
4 freturn
At the beginning of a Java method, the virtual machine places
method parameters in a data structure called the local variable
table. As its name suggests, the local variable table also
contains any local variables that you declare. In this example,
the method begins with three local variable table entries, these
are for the three arguments to the add method. Slot 0 holds the
this reference, while slots 1 and 2 hold the float and int
arguments, respectively.
In order to actually manipulate the variables, they must be loaded
(pushed) onto the operand stack. The first instruction, fload_1,
pushes the float at slot 1 onto the operand stack. The second
instruction, iload_2, pushes the int at slot 2 onto the operand
stack. The interesting thing about these instructions is in the 'i'
and 'f' prefixes, which illustrate that Java bytecode instructions
are strongly typed. If the type of an argument does not match the
type of the bytecode, the VM will reject the bytecode as unsafe.
Better still, the bytecodes are designed so that these type-safety
checks need only be performed once, at class load time.
How does this type-safety enhance security? If an attacker could
trick the virtual machine into treating an int as a float, or vice
versa, it would be easy to corrupt calculations in a predictable
way. If these calculations involved bank balances, the security
implications would be obvious. More dangerous still would be
tricking the VM into treating an int as an Object reference. In
most scenarios, this would crash the VM, but an attacker needs to
find only one loophole. And don't forget that the attacker doesn't
have to search by hand--it would be pretty easy to write a program
that generated billions of permutations of bad byte codes, trying
to find the lucky one that compromised the VM.
Another case where bytecode safeguards memory is array
manipulation. The 'aastore' and 'aaload' bytecodes operate on
Java arrays, and they always check array bounds. These bytcodes
throw an ArrayIndexOutOfBoundsException if the caller passes the
end of the array. Perhaps the most important checks of all apply
to the branching instructions, for example, the bytecodes that
begin with 'if.' In bytecode, branching instructions can only
branch to another instruction within the same method. The only
way to transfer control outside a method is to return, throw an
exception, or execute one of the 'invoke' instructions. Not only
does this close the door on many attacks, it also prevents nasty
bugs caused by dangling references or stack corruption. If you have
ever had a system debugger open your program to a random location
in code, you're familiar with these bugs.
The critical point to remember about all of these checks is that
they are made by the virtual machine at the bytecode level, not
just by the compiler. A compiler for a language such as C++ might
prevent some of the memory errors discussed above, but its
protection applies only at the source code level. Operating
systems will happily load and execute any machine code, whether
the code was generated by a careful C++ compiler or a malicious
attacker. In short, C++ is object-oriented only at the source code
level, however Java's object-oriented features extend down to the
compiled code.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ANALYZING BYTECODE TO IMPROVE YOUR CODE
The memory and security protections of Java bytecode are there for
you whether you notice them or not, so why bother looking at the
bytecode? In many cases, knowing how the compiler translates your
code into bytecode can help you write more efficient code, and can
sometimes even prevent insidious bugs. Consider the
following example:
file://return the concatenation str1+str2
String concat(String str1, String str2) {
return str1 + str2;
}
file://append str2 to str1
void concat(StringBuffer str1, String str2) {
str1.append(str2);
}
Try to guess how many function calls each method requires to
execute. Now compile the methods and run javap. You should see
output like this:
Method java.lang.String concat1(java.lang.String, java.lang.String)
0 new #5
3 dup
4 invokespecial #6
7 aload_1
8 invokevirtual #7
11 aload_2
12 invokevirtual #7
15 invokevirtual #8
18 areturn
Method void concat2(java.lang.StringBuffer, java.lang.String)
0 aload_1
1 aload_2
2 invokevirtual #7
5 pop
6 return
The concat1 method makes five method calls: new, invokespecial,
and three invokevirtuals. That is quite a bit more work than the
concat2 method, which makes only a single invokevirtual call. Most
Java programmers have been warned that because Strings are
immutable it is more efficient to use StringBuffers for
concatenation. Using javap to analyze this makes the point in
dramatic fashion. If you are unsure whether two language
constructs are equivalent in performance, you should use javap
to analyze the bytecode. Beware of the just-in-time (JIT)
compiler, though. Because the JIT compiler recompiles the
bytecodes into native machine code, it can apply additional
optimizations that your javap analysis does not reveal. Unless
you have the source code for your virtual machine, you need to
supplement your bytecode analysis with performance benchmarks.
A final example illustrates how examining bytecode can help
prevent bugs in your application. Create two classes as
follows. Make sure they are in separate files.
public class ChangeALot {
public static final boolean debug=false;
public static boolean log=false;
}
public class EternallyConstant {
public static void main(String [] args) {
System.out.println("EternallyConstant beginning execution");
if (ChangeALot.debug)
System.out.println("Debug mode is on");
if (ChangeALot.log)
System.out.println("Logging mode is on");
}
}
If you run the class EternallyConstant you should get the message:
EternallyConstant beginning execution.
Now try editing the ChangeALot file, modifying the debug and log
variables to both be true. Recompile only the ChangeALot file.
Run EternallyConstant again, and you will see the following
output:
EternallyConstant beginning execution
Logging mode is on
What happened to the debugging mode? Even though you set debug to
true, the message "Debug mode is on" didn't appear. The answer is
in the bytecode. Run javap on the EternallyConstant class, and you
will see this:
Method void main(java.lang.String[])
0 getstatic #2
3 ldc #3
5 invokevirtual #4
8 getstatic #5
11 ifeq 22
14 getstatic #2
17 ldc #6
19 invokevirtual #4
22 return
Surprise! While there is an 'ifeq' check on the log field, the
code does not check the debug field at all. Because the debug
field was marked final, the compiler knew that the debug field
could never change at runtime. Therefore, it optimized the 'if'
statement branch by removing it. This is a very useful
optimization indeed, because it allows you to embed debugging
code in your application and pay no runtime penalty when the
switch is set to false. Unfortunately, this optimization can
lead to major compile-time confusion. If you change a final field,
you have to remember to recompile any other class that might
reference the field. That's because the 'reference' might have
been optimized away. Java development environments do not always
detect this subtle dependency, something that can lead to very
odd bugs. So, the old C++ adage is still true for the Java
environment. "When in doubt, rebuild all."
Knowing a little bytecode is a valuable assist to any programmer
coding in the Java programming language. The javap tool makes it
easy to view bytecodes. Occasionally checking your code with javap
can be invaluable in improving performance and catching
particularly elusive bugs.
There is substantially more complexity to bytecode and the VM
than this tip can cover. To learn more, read Inside the Java
Virtual Machine by Bill Venners.
. . . . . . . . . . . . . . . . . . . . . . .
* As used in this document, the terms "Java virtual machine"
or "JVM" mean a virtual machine for the Java platform.
|