oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Protect Your Source Code: Obfuscation 101

by Matthew Russell

Do you remember watching those movies where a hacker would sit down at a terminal, type a bunch of jargon into a prompt, and within seconds the screen suddenly would flash "ACCESS GRANTED!" in large flashing letters?

With a few magic keystrokes, the entire system would somehow just start transferring funds to foreign bank accounts, or spit out data to a external disk drive. Well, that might account for the money laundering and espionage back in the 1980s, but what about all of those illegal patches for software, illegal registration codes, and common software piracy issues today? This article takes a look at how you can protect your applications from attack by applying obfuscation techniques to convolute your source code.

Overview of Compilation

Although it may not be obvious yet, understanding how compilers work is one of the linchpins of protecting your source code. We all know that the full, unobfuscated source code to a piece of software outlines every detail of its implementation, including things like how it knows if you have entered a valid registration number to officially purchase the product; how it determines if a valid password has been entered for a particular user; how it determines if you're playing a song on an authorized machine; and so on. A compiler simply takes that piece of source code, ensures that it meets the grammar rules of a particular language, and transforms it into the mysterious binary file that executes on your machine.

In all, there are four basic steps in the compilation process. For each step, I'll give an analogy from the English language.

  • Lexical Analysis: Tokenizes the source code and determines if individual tokens are valid.

    • In English, the malformed sentence, "Are delicious the burritos" consists of four words delimited by spaces that can be found in a standard English dictionary.

    • In code, the lexical analyzer tokenizes code based on pattern-matching rules defined by regular expressions. The regular expressions define the reserved words, identifiers, syntax punctuation, and so on.

  • Parsing: Repeatedly calls the lexical analyzer and determines if the sequence of tokens in the source code is valid based on grammar rules.

    • In English, "Are delicious the burritos" doesn't follow standard grammar rules (unless you're Yoda), but "The burritos are delicious" is a well-formed sentence and does follow standard grammar rules.

    • In code, grammar rules are enforced along similar lines. For example, := 1 + foo; is not a valid statement in Java even though each of the tokens is recognizable.

  • Semantic Analysis: Ensures that the sequence of tokens makes sense based on context-sensitive information.

    • In English, "Burritos go to the gym five times a week" is a well-formed sentence following grammar rules, but doesn't make any sense.

    • In code, if a unique procedure declaration requires a single floating point value as a parameter and you give it a string value, the semantics don't make sense.

  • Code Generation: Translates the source code into machine code. Optional optimizations can occur just before code generation if particular compiler flags are set.

    • From English to Spanish, "The burritos are very good" becomes "Los burritos son muy deliciosos".

    • In code, once you have proper lexical analysis, grammar structure, and semantic analysis, a straightforward transformation can change the source code to machine or assembly code. (Machine code is simply a 1-to-1 transformation from assembly code.)

compilation process
An overview of the compilation process.

It's important to understand how the first three steps protect us from creating possibly disastrous executable code in the code-generation step, and to understand the unique role of each step in this module process.

A full explanation of compilers and language theory would involve a digression to the theory of computation, span several textbooks, and be a painful (but enlightening) education, so we'll skip it for now. But if you're interested in compiler theory or get the itch to try creating your own compiler or interpreter, check out It is a good starting point to the very same tools used to develop many of the compilers and virtual machines in use today.

Lex and Yacc (or their GNU counterparts Flex and Bison) can completely automate the lexing and parsing steps, which would be otherwise extremely taxing. Semantic analysis is the step where human brain power must kick into overdrive, but tools like Memphis can help. Once you get past that phase, you'll find that the code-generation phase is cake. Rest assured that if you can successfully create your own compiler/interpreter for even a very small subset of language, then a lot of things in the world will never appear the same. Proceed with extreme caution (grin).

The moral of the story: the compilation process is fairly mechanical and machine code is simply a binary translation of source code; there's no black magic involved, though it might seem so sometimes. Machine code can be transformed back into somewhat readable source code again with enough blood, sweat, and tears. You don't have to be a compiler guru to protect your source code, but it does provide the fundamental understanding for why protection techniques like obfuscation can be very effective for protecting your source code from hacking.

Encryption and Obfuscation

Most people understand how encryption works, but obfuscation might raise a few eyebrows. Both of these techniques are used for securing information, but each of them accomplishes it for slightly different purposes. Let's take a moment to review each of them.

Related Reading

Mac OS X for Java Geeks
By Will Iverson


Encryption is often used to protect information being stored or communicated as a message. The technique we're all familiar with is a simple cipher where one letter is exchanged for another. For example, consider a cipher where each letter becomes its successor in the alphabet. A message that says "hello" would then become "ifmmp". While simple ciphers might sometimes be tedious to solve by hand, they stand no chance against statistical analysis or even someone willing to spend enough time to work through it during a boring meeting or a plane ride. More advanced ciphers such as Blowfish, however, can be very secure because they encrypt the data in way that's much more difficult to reverse.

Another widely known encryption technique you should be aware of is RSA encryption. In short, it allows others to encrypt and send you information by using two public keys that you have made known to the entire world--but once the message is encrypted, it can only be decrypted by a private key that you keep secret. A nice overview of RSA encryption process can be found here. And to think that mathematicians used to get laughed at for studying prime numbers all of those years.

If you haven't already, convince yourself that no encryption technique is bulletproof. All encryptions can be reversed given the right key(s). With that said, the guiding principle of cryptography is deterrence: to make obtaining the key by brute force or sophomoric techniques such a miniscule chance that it might as well be impossible. This is usually accomplished by making the total number of possible combinations that must be tried by brute force methods an extremely large number. Simple ciphers have their place in the world for fooling spam bots and with secret decoder rings, but industrial techniques are desirable for any real application these days. While we're not diverting into an in-depth cryptography example here, you should check out MulleCipher, a very nice framework with complete source code and example application that's easily adaptable for your projects.


The mechanical process that can be followed to obtain source code from machine code is what leads us to obfuscation: convoluting the code so much that even if you have its source, you still probably won't be able to understand it. We'll look at some automated tools to accomplish obfuscation, but let's first consider a very simple example to nail down the concept. A typical "Hello World!" program in Java looks like the following:

public class HelloWorld {

      public static void main(String args[]) {
         System.out.println("Hello World!");

So far, so good. Now, let's make things a little more interesting. Would you be able to determine what the following piece of code does without running it? I've simply added a couple of loops and conditionals.

public class HelloWorld {

      public static void main(String args[]) {
        double d1 = 0.0134654879927;
        double d2 = 0.0234987519084;

        for (int i1 = 0; i1 < 72; i1++) {
            d1 = d2 + 0.00000001020102;

        for (int i1 = 0; i1 < 59; i1++) {
            d2 = d1 + 0.00000001120102;

        if ((d1+d2) > 0.04699753441986 ) {
            System.out.println("Hello World!");
        else if ((d1+d2) < 0.04699753441186) {
            System.out.println("Goodbye World!");
        //This chain of alternatives could go on for a
        //long time...


Generic variable names, some annoying loops, and a couple of conditionals sure can make a difference! For the cost of determining what this simple piece of code does, are you even willing to paste it in and run it? Would you be willing to pull out a calculator and do the arithmetic? What if you could only use your brain and no additional tools? Somewhere along the line, you'd reach a point where the benefit wouldn't be worth the reward anymore and give up. Along these same lines, the vast majority of software pirates won't spend 500 hours reverse engineering and patching a simple $10 shareware application. A little deterrence will go a long way.

Pages: 1, 2

Next Pagearrow