BYTECODE TECHNIQUES

This issue is about bytecode. Programmers coding in the Java(tm) 
programming language rarely view the compiled output of their 
programs. This is unfortunate, because the output, Java bytecode, 
can provide valuable insight when debugging or troubleshooting 
performance problems. Moreover, the JDK makes viewing bytecode easy. 
This tip shows you how to view and interpret Java bytecode. It presents 
the following topics related to bytecode:
 
         * Getting Started With javap 
         * How Bytecode Protects You From Memory Bugs
         * Analyzing Bytecode to Improve Your Code
                
This tip was developed using Java(tm) 2 SDK, Standard Edition, 
v 1.3.


This issue of the JDC Tech Tips is written by Stuart Halloway,
a Java specialist at DevelopMentor (http://www.develop.com/java).


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

GETTING STARTED WITH JAVAP


Most Java programmers know that their programs are not typically 
compiled into native machine code. Instead, the programs are 
compiled into an intermediate bytecode format that is executed by 
the Java(tm) Virtual Machine*. However, relatively few 
programmers have ever seen bytecode because their tools do not 
encourage them to look. Most Java debugging tools do not allow 
step-by-step execution of bytecode; they either show source code 
lines or nothing.  


Fortunately, the JDK(tm) provides javap, a command-line tool 
that makes it easy to view bytecode. Let's see an example:


public class ByteCodeDemo {
    public static void main(String[] args) {
        System.out.println("Hello world");
    }
}

After you compile this class, you could open the .class file in a 
hex editor and translate the bytecodes by referring to the virtual 
machine specification. Fortunately, there is an easier way. The 
JDK includes a command line disassembler called javap, which will 
convert the byte codes into human-readable mnemonics. You can get 
a bytecode listing by passing the '-c' flag to javap as follows: 

javap -c ByteCodeDemo

You should see output similar to this: 

public class ByteCodeDemo extends java.lang.Object {
    public ByteCodeDemo();
    public static void main(java.lang.String[]);
}
Method ByteCodeDemo()
   0 aload_0
   1 invokespecial #1 
   4 return
Method void main(java.lang.String[])
   0 getstatic #2 
   3 ldc #3 
   5 invokevirtual #4 
   8 return

From just this short listing, you can learn a lot about bytecode.  
Begin with the first instruction in the main method:

   0 getstatic #2 
   
The initial integer is the offset of the instruction in the method. 
So the first instruction begins with a '0'. The mnemonic for the 
instruction follows the offset. In this example, the 'getstatic' 
instruction pushes a static field onto a data structure called the 
operand stack. Later instructions can reference the field in this 
data structure. Following the getstatic instruction is the field 
to be pushed. In this case the field to be pushed is
"#2 ." If you examined the 
bytecode directly, you would see that the field information is not 
embedded directly in the instruction. Instead, like all constants 
used by a Java class, the field information is stored in a shared 
pool. Storing field information in a constant pool reduces the 
size of the bytecode instructions. This is because the 
instructions only have to store the integer index into the 
constant pool instead of the entire constant. In this example, 
the field information is at location #2 in the constant pool. 
The order of items in the constant pool is compiler dependent, 
so you might see a number other than '#2.' 

After analyzing the first instruction, it's easy to guess the 
meaning of the other instructions.  The 'ldc' (load constant) 
instruction pushes the constant "Hello, World." onto the operand 
stack. The 'invokevirtual' invokes the println method, which pops 
its two arguments from the operand stack. Don't forget that an 
instance method such as println has two arguments: the obvious 
string argument, plus the implicit 'this' reference.  

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
HOW BYTECODE PROTECTS YOU FROM MEMORY BUGS

The Java programming language is frequently touted as a "secure" 
language for internet software. Given that the code looks so 
much like C++ on the surface, where does this security come 
from? It turns out that an important aspect of security is the 
prevention of memory-related bugs. Computer criminals exploit 
memory bugs to sneak malicious code into otherwise safe programs.  
Java bytecode is a first line of defense against this sort of 
attack, as the following example demonstrates: 

    public float add(float f, int n) {
        return f + n;
    }

If you add this function to the previous example, recompile it, and 
run javap, you should see bytecode similar to this:

Method float add(float, int)
   0 fload_1
   1 iload_2
   2 i2f
   3 fadd
   4 freturn

At the beginning of a Java method, the virtual machine places 
method parameters in a data structure called the local variable 
table. As its name suggests, the local variable table also 
contains any local variables that you declare. In this example, 
the method begins with three local variable table entries, these 
are for the three arguments to the add method. Slot 0 holds the 
this reference, while slots 1 and 2 hold the float and int 
arguments, respectively.  

In order to actually manipulate the variables, they must be loaded 
(pushed) onto the operand stack. The first instruction, fload_1, 
pushes the float at slot 1 onto the operand stack. The second 
instruction, iload_2, pushes the int at slot 2 onto the operand 
stack. The interesting thing about these instructions is in the 'i' 
and 'f' prefixes, which illustrate that Java bytecode instructions 
are strongly typed. If the type of an argument does not match the 
type of the bytecode, the VM will reject the bytecode as unsafe.  
Better still, the bytecodes are designed so that these type-safety 
checks need only be performed once, at class load time.  

How does this type-safety enhance security? If an attacker could 
trick the virtual machine into treating an int as a float, or vice 
versa, it would be easy to corrupt calculations in a predictable 
way. If these calculations involved bank balances, the security 
implications would be obvious. More dangerous still would be 
tricking the VM into treating an int as an Object reference. In 
most scenarios, this would crash the VM, but an attacker needs to 
find only one loophole. And don't forget that the attacker doesn't 
have to search by hand--it would be pretty easy to write a program 
that generated billions of permutations of bad byte codes, trying 
to find the lucky one that compromised the VM.

Another case where bytecode safeguards memory is array 
manipulation. The 'aastore' and 'aaload' bytecodes operate on 
Java arrays, and they always check array bounds. These bytcodes 
throw an ArrayIndexOutOfBoundsException if the caller passes the 
end of the array. Perhaps the most important checks of all apply 
to the branching instructions, for example, the bytecodes that 
begin with 'if.' In bytecode, branching instructions can only 
branch to another instruction within the same method. The only 
way to transfer control outside a method is to return, throw an 
exception, or execute one of the 'invoke' instructions. Not only 
does this close the door on many attacks, it also prevents nasty 
bugs caused by dangling references or stack corruption. If you have
ever had a system debugger open your program to a random location 
in code, you're familiar with these bugs.

The critical point to remember about all of these checks is that 
they are made by the virtual machine at the bytecode level, not 
just by the compiler. A compiler for a language such as C++ might 
prevent some of the memory errors discussed above, but its 
protection applies only at the source code level. Operating 
systems will happily load and execute any machine code, whether 
the code was generated by a careful C++ compiler or a malicious 
attacker. In short, C++ is object-oriented only at the source code 
level, however Java's object-oriented features extend down to the 
compiled code.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
ANALYZING BYTECODE TO IMPROVE YOUR CODE

The memory and security protections of Java bytecode are there for 
you whether you notice them or not, so why bother looking at the 
bytecode? In many cases, knowing how the compiler translates your 
code into bytecode can help you write more efficient code, and can 
sometimes even prevent insidious bugs. Consider the 
following example:

    file://return the concatenation str1+str2
    String concat(String str1, String str2) {
        return str1 + str2;
    }
 
    file://append str2 to str1
    void concat(StringBuffer str1, String str2) {
        str1.append(str2);
    }

Try to guess how many function calls each method requires to 
execute. Now compile the methods and run javap. You should see 
output like this:

Method java.lang.String concat1(java.lang.String, java.lang.String)
   0 new #5 
   3 dup
   4 invokespecial #6 
   7 aload_1
   8 invokevirtual #7 
  11 aload_2
  12 invokevirtual #7 
  15 invokevirtual #8 
  18 areturn

Method void concat2(java.lang.StringBuffer, java.lang.String)
   0 aload_1
   1 aload_2
   2 invokevirtual #7 
   5 pop
   6 return

The concat1 method makes five method calls: new, invokespecial, 
and three invokevirtuals. That is quite a bit more work than the 
concat2 method, which makes only a single invokevirtual call. Most 
Java programmers have been warned that because Strings are 
immutable it is more efficient to use StringBuffers for 
concatenation. Using javap to analyze this makes the point in 
dramatic fashion.  If you are unsure whether two language 
constructs are equivalent in performance, you should use javap 
to analyze the bytecode. Beware of the just-in-time (JIT) 
compiler, though. Because the JIT compiler recompiles the 
bytecodes into native machine code, it can apply additional 
optimizations that your javap analysis does not reveal. Unless 
you have the source code for your virtual machine, you need to 
supplement your bytecode analysis with performance benchmarks.

A final example illustrates how examining bytecode can help 
prevent bugs in your application. Create two classes as 
follows. Make sure they are in separate files.

public class ChangeALot {
    public static final boolean debug=false;
    public static boolean log=false;
}

public class EternallyConstant {
    public static void main(String [] args) {
        System.out.println("EternallyConstant beginning execution"); 
        if (ChangeALot.debug)
            System.out.println("Debug mode is on");
        if (ChangeALot.log)
            System.out.println("Logging mode is on");
    }
}
 
If you run the class EternallyConstant you should get the message:

    EternallyConstant beginning execution.

Now try editing the ChangeALot file, modifying the debug and log 
variables to both be true. Recompile only the ChangeALot file.  
Run EternallyConstant again, and you will see the following 
output:

    EternallyConstant beginning execution
    Logging mode is on
    
What happened to the debugging mode? Even though you set debug to 
true, the message "Debug mode is on" didn't appear. The answer is 
in the bytecode. Run javap on the EternallyConstant class, and you 
will see this: 

Method void main(java.lang.String[])
   0 getstatic #2 
   3 ldc #3 
   5 invokevirtual #4 
   8 getstatic #5 
  11 ifeq 22
  14 getstatic #2 
  17 ldc #6 
  19 invokevirtual #4 
  22 return
  
Surprise!  While there is an 'ifeq' check on the log field, the 
code does not check the debug field at all. Because the debug 
field was marked final, the compiler knew that the debug field 
could never change at runtime. Therefore, it optimized the 'if' 
statement branch by removing it. This is a very useful 
optimization indeed, because it allows you to embed debugging 
code in your application and pay no runtime penalty when the 
switch is set to false. Unfortunately, this optimization can 
lead to major compile-time confusion. If you change a final field, 
you have to remember to recompile any other class that might 
reference the field. That's because the 'reference' might have 
been optimized away. Java development environments do not always 
detect this subtle dependency, something that can lead to very 
odd bugs. So, the old C++ adage is still true for the Java 
environment. "When in doubt, rebuild all."

Knowing a little bytecode is a valuable assist to any programmer
coding in the Java programming language. The javap tool makes it 
easy to view bytecodes. Occasionally checking your code with javap 
can be invaluable in improving performance and catching 
particularly elusive bugs.

There is substantially more complexity to bytecode and the VM 
than this tip can cover. To learn more, read Inside the Java 
Virtual Machine by Bill Venners.  

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .

* As used in this document, the terms "Java virtual machine" 
  or "JVM" mean a virtual machine for the Java platform.
Any comments? email to:
richard@dreamscity.net