Do you remember watching those movies where a hacker would sit down at a terminal, type a bunch of jargon into a prompt, and within seconds the screen suddenly would flash "ACCESS GRANTED!" in large flashing letters?
With a few magic keystrokes, the entire system would somehow just start transferring funds to foreign bank accounts, or spit out data to a external disk drive. Well, that might account for the money laundering and espionage back in the 1980s, but what about all of those illegal patches for software, illegal registration codes, and common software piracy issues today? This article takes a look at how you can protect your applications from attack by applying obfuscation techniques to convolute your source code.
Although it may not be obvious yet, understanding how compilers work is one of the linchpins of protecting your source code. We all know that the full, unobfuscated source code to a piece of software outlines every detail of its implementation, including things like how it knows if you have entered a valid registration number to officially purchase the product; how it determines if a valid password has been entered for a particular user; how it determines if you're playing a song on an authorized machine; and so on. A compiler simply takes that piece of source code, ensures that it meets the grammar rules of a particular language, and transforms it into the mysterious binary file that executes on your machine.
In all, there are four basic steps in the compilation process. For each step, I'll give an analogy from the English language.
Lexical Analysis: Tokenizes the source code and determines if individual tokens are valid.
In English, the malformed sentence, "Are delicious the burritos" consists of four words delimited by spaces that can be found in a standard English dictionary.
In code, the lexical analyzer tokenizes code based on pattern-matching rules defined by regular expressions. The regular expressions define the reserved words, identifiers, syntax punctuation, and so on.
Parsing: Repeatedly calls the lexical analyzer and determines if the sequence of tokens in the source code is valid based on grammar rules.
In English, "Are delicious the burritos" doesn't follow standard grammar rules (unless you're Yoda), but "The burritos are delicious" is a well-formed sentence and does follow standard grammar rules.
In code, grammar rules are enforced along similar lines. For example, := 1 + foo; is not a valid statement in Java even though each of the tokens is recognizable.
Semantic Analysis: Ensures that the sequence of tokens makes sense based on context-sensitive information.
In English, "Burritos go to the gym five times a week" is a well-formed sentence following grammar rules, but doesn't make any sense.
In code, if a unique procedure declaration requires a single floating point value as a parameter and you give it a string value, the semantics don't make sense.
Code Generation: Translates the source code into machine code. Optional optimizations can occur just before code generation if particular compiler flags are set.
From English to Spanish, "The burritos are very good" becomes "Los burritos son muy deliciosos".
In code, once you have proper lexical analysis, grammar structure, and semantic analysis, a straightforward transformation can change the source code to machine or assembly code. (Machine code is simply a 1-to-1 transformation from assembly code.)

An overview of the compilation process.
It's important to understand how the first three steps protect us from creating possibly disastrous executable code in the code-generation step, and to understand the unique role of each step in this module process.
A full explanation of compilers and language theory would involve a digression to the theory of computation, span several textbooks, and be a painful (but enlightening) education, so we'll skip it for now. But if you're interested in compiler theory or get the itch to try creating your own compiler or interpreter, check out dinosaur.compilertools.net. It is a good starting point to the very same tools used to develop many of the compilers and virtual machines in use today.
Lex and Yacc (or their GNU counterparts Flex and Bison) can completely automate the lexing and parsing steps, which would be otherwise extremely taxing. Semantic analysis is the step where human brain power must kick into overdrive, but tools like Memphis can help. Once you get past that phase, you'll find that the code-generation phase is cake. Rest assured that if you can successfully create your own compiler/interpreter for even a very small subset of language, then a lot of things in the world will never appear the same. Proceed with extreme caution (grin).
The moral of the story: the compilation process is fairly mechanical and machine code is simply a binary translation of source code; there's no black magic involved, though it might seem so sometimes. Machine code can be transformed back into somewhat readable source code again with enough blood, sweat, and tears. You don't have to be a compiler guru to protect your source code, but it does provide the fundamental understanding for why protection techniques like obfuscation can be very effective for protecting your source code from hacking.
Most people understand how encryption works, but obfuscation might raise a few eyebrows. Both of these techniques are used for securing information, but each of them accomplishes it for slightly different purposes. Let's take a moment to review each of them.
|
Related Reading
Mac OS X for Java Geeks |
Encryption is often used to protect information being stored or communicated as a message. The technique we're all familiar with is a simple cipher where one letter is exchanged for another. For example, consider a cipher where each letter becomes its successor in the alphabet. A message that says "hello" would then become "ifmmp". While simple ciphers might sometimes be tedious to solve by hand, they stand no chance against statistical analysis or even someone willing to spend enough time to work through it during a boring meeting or a plane ride. More advanced ciphers such as Blowfish, however, can be very secure because they encrypt the data in way that's much more difficult to reverse.
Another widely known encryption technique you should be aware of is RSA encryption. In short, it allows others to encrypt and send you information by using two public keys that you have made known to the entire world--but once the message is encrypted, it can only be decrypted by a private key that you keep secret. A nice overview of RSA encryption process can be found here. And to think that mathematicians used to get laughed at for studying prime numbers all of those years.
If you haven't already, convince yourself that no encryption technique is bulletproof. All encryptions can be reversed given the right key(s). With that said, the guiding principle of cryptography is deterrence: to make obtaining the key by brute force or sophomoric techniques such a miniscule chance that it might as well be impossible. This is usually accomplished by making the total number of possible combinations that must be tried by brute force methods an extremely large number. Simple ciphers have their place in the world for fooling spam bots and with secret decoder rings, but industrial techniques are desirable for any real application these days. While we're not diverting into an in-depth cryptography example here, you should check out MulleCipher, a very nice framework with complete source code and example application that's easily adaptable for your projects.
The mechanical process that can be followed to obtain source code from machine code is what leads us to obfuscation: convoluting the code so much that even if you have its source, you still probably won't be able to understand it. We'll look at some automated tools to accomplish obfuscation, but let's first consider a very simple example to nail down the concept. A typical "Hello World!" program in Java looks like the following:
public class HelloWorld {
public static void main(String args[]) {
System.out.println("Hello World!");
}
}
So far, so good. Now, let's make things a little more interesting. Would you be able to determine what the following piece of code does without running it? I've simply added a couple of loops and conditionals.
public class HelloWorld {
public static void main(String args[]) {
double d1 = 0.0134654879927;
double d2 = 0.0234987519084;
for (int i1 = 0; i1 < 72; i1++) {
d1 = d2 + 0.00000001020102;
}
for (int i1 = 0; i1 < 59; i1++) {
d2 = d1 + 0.00000001120102;
}
//System.out.println(d1+d2);
if ((d1+d2) > 0.04699753441986 ) {
System.out.println("Hello World!");
}
else if ((d1+d2) < 0.04699753441186) {
System.out.println("Goodbye World!");
}
//This chain of alternatives could go on for a
//long time...
}
}
Generic variable names, some annoying loops, and a couple of conditionals sure can make a difference! For the cost of determining what this simple piece of code does, are you even willing to paste it in and run it? Would you be willing to pull out a calculator and do the arithmetic? What if you could only use your brain and no additional tools? Somewhere along the line, you'd reach a point where the benefit wouldn't be worth the reward anymore and give up. Along these same lines, the vast majority of software pirates won't spend 500 hours reverse engineering and patching a simple $10 shareware application. A little deterrence will go a long way.
|
The University of Arizona has developed a very nice Java tool called Sandmark that is used to test and study software watermarking. As part of that study, it implements many well-known obfuscation algorithms and provides a GUI for a tool called Soot, an optimization tool that can also decompile bytecode. These two tools together allow you to take Java code, obfuscate it, and then decompile it to see the effect of the obfuscations. Let's get started with an example to put all of this together:
Download Sandmark. Get the executable sandmark.jar file and all the supporting jar files: BCEL.jar, bloat-1.0.jar, dynamicjava.jar, and junit.jar. Place these files in /Library/Java/Home.
Download Soot. You'll want the three precompiled jar files: sootclasses-2.2.1.jar, jasminclasses-2.2.1.jar, and polyglotclasses-1.3.jar available from the main page. Place them in /Library/Java/Home.
Download the jDecompile script. Add this script to your $PATH by typing export PATH=$PATH:/path/to/file/jDecompile. Also, change its permissions to executable with chmod u+x jDecompile. If you decide to use this tool a lot, you'll want to permanently add it to your path by modifying your .bashrc file. Make sure you're running Sandmark from the same Terminal you used to place this script in your path or else Sandmark won't find the script and you'll get errors.
To start up Sandmark, navigate to the sandmark.jar file with Terminal and execute it by typing java -jar sandmark.jar. The toolbar up top expands with a button on the far right, and each tab has its own specialized help menu, which actually is pretty helpful. For instance, the jDecompile script you downloaded is an adaptation of a script from the "Decompile" tab that I tailored for OS X.
Sandmark has an easy-to-use GUI interface and a great help system.
To do a quick obfuscation of source code, you can choose a particular obfuscation algorithm from the "Obfuscate" tab, or by letting Sandmark apply a variety of obfuscation algorithms on the "Quick Protect" tab.
For an overview of the algorithms available in Sandmark, check here. The only caveat is that Sandmark expects a jar file. (If you'd like an overview of creating and working with jar files before jumping into an example with Sandmark, check here.)
Save the following Java code to a file called IfElseDemo.java:
public class IfElseDemo {
public static void main(String[] args) {
int testscore = 76;
char grade;
if (testscore >= 90) {
grade = 'A';
} else if (testscore >= 80) {
grade = 'B';
} else if (testscore >= 70) {
grade = 'C';
} else if (testscore >= 60) {
grade = 'D';
} else {
grade = 'F';
}
System.out.println("Grade = " + grade);
}
}
Let's apply an obfuscation algorithm to this simple example and then decompile the obfuscated bytecode to see the difference.
Compile IfElseDemo in Terminal.
Save a file called IfElseDemo.java containing the example code above.
Type javac IfElseDemo.java to compile.
Type java IfElseDemo to verify that the code runs.
In "mainClass", place this line: "Main-Class: IfElseDemo" (no quotes).
Back on the command line, type jar cmf mainClass IfElseDemo.jar IfElseDemo.class.
Verify that the jar file is created and type java -jar IfElseDemo.jar to verify that it executes properly.
Obfuscate the IfElseDemo.jar in Sandmark.
Choose the "Obfuscate" tab, and select the "Merge Local Integers" algorithm. Since our example code is primarily dependent upon integers for its logic, this looks like a good choice.
Name the output file IfElseDemo_obfuscated.jar.
Click on "Obfuscate".
Verify that IfElseDemo_obfuscated.jar exists and execute it with java -jar IfElseDemo_obfuscated.jar.
Decompile IfElseDemo_obfuscated.jar with Sandmark to see the difference.
Choose the "Decompile" tab by extending the tabs with the arrow button on the far right.
Choose the IfElseDemo_obfuscated.jar as your input file.
Type "IfElseDemo" (no quotes) into the "Class" text box.
Leave the "Classpath" text box blank.
Click on "Decompile".
If all goes well, a preview of the obfuscated source code opens up that is quite a bit harder to understand. If you have trouble with the decompiling portion, make sure your path is set correctly for the Terminal window in which you're running Sandmark.
Although this example doesn't unlock any of the secrets of the universe, it does illustrate how effective obfuscation can be for even a simple example. Now imagine applying various obfuscation techniques to thousands of lines of more complex code.
If you take a look at the algorithms Sandmark offers, you'll notice that there's scores of confusing possibilities. Refactoring inheritance hierarchies, introducing confusing arithmetic operations, and introducing buggy variations of existing code blocks that never get executed are just a few of the possibilities. Keep in mind that you might want to just obfuscate the sensitive portions of your code, because the obfuscation can impose size and performance penalties. The penalties may or may not make a difference; it's a trade-off you have to measure and consider.
In a world where everyone follows license agreements and no one wants to reverse engineer government secrets, obfuscation techniques wouldn't be of much use. Since we don't live in the shire, however, security measures have their place and are just one of the many things that keeps the world spinning. Hopefully, you now have a better feel for the compilation process and understand how obfuscation is a powerful tool you can use to protect your code from exploitation and hacking.
Matthew Russell is a computer scientist from middle Tennessee; and serves Digital Reasoning Systems as the Director of Advanced Technology. Hacking and writing are two activities essential to his renaissance man regimen.
Return to MacDevCenter.com.
Copyright © 2009 O'Reilly Media, Inc.