MacDevCenter    
 Published on MacDevCenter (http://www.macdevcenter.com/)
 See this if you're having trouble printing code examples


Protect Your Source Code: Obfuscation 101

by Matthew Russell
04/08/2005

Do you remember watching those movies where a hacker would sit down at a terminal, type a bunch of jargon into a prompt, and within seconds the screen suddenly would flash "ACCESS GRANTED!" in large flashing letters?

With a few magic keystrokes, the entire system would somehow just start transferring funds to foreign bank accounts, or spit out data to a external disk drive. Well, that might account for the money laundering and espionage back in the 1980s, but what about all of those illegal patches for software, illegal registration codes, and common software piracy issues today? This article takes a look at how you can protect your applications from attack by applying obfuscation techniques to convolute your source code.

Overview of Compilation

Although it may not be obvious yet, understanding how compilers work is one of the linchpins of protecting your source code. We all know that the full, unobfuscated source code to a piece of software outlines every detail of its implementation, including things like how it knows if you have entered a valid registration number to officially purchase the product; how it determines if a valid password has been entered for a particular user; how it determines if you're playing a song on an authorized machine; and so on. A compiler simply takes that piece of source code, ensures that it meets the grammar rules of a particular language, and transforms it into the mysterious binary file that executes on your machine.

In all, there are four basic steps in the compilation process. For each step, I'll give an analogy from the English language.

compilation process
An overview of the compilation process.

It's important to understand how the first three steps protect us from creating possibly disastrous executable code in the code-generation step, and to understand the unique role of each step in this module process.

A full explanation of compilers and language theory would involve a digression to the theory of computation, span several textbooks, and be a painful (but enlightening) education, so we'll skip it for now. But if you're interested in compiler theory or get the itch to try creating your own compiler or interpreter, check out dinosaur.compilertools.net. It is a good starting point to the very same tools used to develop many of the compilers and virtual machines in use today.

Lex and Yacc (or their GNU counterparts Flex and Bison) can completely automate the lexing and parsing steps, which would be otherwise extremely taxing. Semantic analysis is the step where human brain power must kick into overdrive, but tools like Memphis can help. Once you get past that phase, you'll find that the code-generation phase is cake. Rest assured that if you can successfully create your own compiler/interpreter for even a very small subset of language, then a lot of things in the world will never appear the same. Proceed with extreme caution (grin).

The moral of the story: the compilation process is fairly mechanical and machine code is simply a binary translation of source code; there's no black magic involved, though it might seem so sometimes. Machine code can be transformed back into somewhat readable source code again with enough blood, sweat, and tears. You don't have to be a compiler guru to protect your source code, but it does provide the fundamental understanding for why protection techniques like obfuscation can be very effective for protecting your source code from hacking.

Encryption and Obfuscation

Most people understand how encryption works, but obfuscation might raise a few eyebrows. Both of these techniques are used for securing information, but each of them accomplishes it for slightly different purposes. Let's take a moment to review each of them.

Related Reading

Mac OS X for Java Geeks
By Will Iverson

Encryption

Encryption is often used to protect information being stored or communicated as a message. The technique we're all familiar with is a simple cipher where one letter is exchanged for another. For example, consider a cipher where each letter becomes its successor in the alphabet. A message that says "hello" would then become "ifmmp". While simple ciphers might sometimes be tedious to solve by hand, they stand no chance against statistical analysis or even someone willing to spend enough time to work through it during a boring meeting or a plane ride. More advanced ciphers such as Blowfish, however, can be very secure because they encrypt the data in way that's much more difficult to reverse.

Another widely known encryption technique you should be aware of is RSA encryption. In short, it allows others to encrypt and send you information by using two public keys that you have made known to the entire world--but once the message is encrypted, it can only be decrypted by a private key that you keep secret. A nice overview of RSA encryption process can be found here. And to think that mathematicians used to get laughed at for studying prime numbers all of those years.

If you haven't already, convince yourself that no encryption technique is bulletproof. All encryptions can be reversed given the right key(s). With that said, the guiding principle of cryptography is deterrence: to make obtaining the key by brute force or sophomoric techniques such a miniscule chance that it might as well be impossible. This is usually accomplished by making the total number of possible combinations that must be tried by brute force methods an extremely large number. Simple ciphers have their place in the world for fooling spam bots and with secret decoder rings, but industrial techniques are desirable for any real application these days. While we're not diverting into an in-depth cryptography example here, you should check out MulleCipher, a very nice framework with complete source code and example application that's easily adaptable for your projects.

Obfuscation

The mechanical process that can be followed to obtain source code from machine code is what leads us to obfuscation: convoluting the code so much that even if you have its source, you still probably won't be able to understand it. We'll look at some automated tools to accomplish obfuscation, but let's first consider a very simple example to nail down the concept. A typical "Hello World!" program in Java looks like the following:

public class HelloWorld {

      public static void main(String args[]) {
         System.out.println("Hello World!");
      }
   }

So far, so good. Now, let's make things a little more interesting. Would you be able to determine what the following piece of code does without running it? I've simply added a couple of loops and conditionals.

public class HelloWorld {

      public static void main(String args[]) {
        double d1 = 0.0134654879927;
        double d2 = 0.0234987519084;


        for (int i1 = 0; i1 < 72; i1++) {
            d1 = d2 + 0.00000001020102;
        }

        for (int i1 = 0; i1 < 59; i1++) {
            d2 = d1 + 0.00000001120102;
        }

        //System.out.println(d1+d2);
        
        if ((d1+d2) > 0.04699753441986 ) {
            System.out.println("Hello World!");
        }
        else if ((d1+d2) < 0.04699753441186) {
            System.out.println("Goodbye World!");
        }
        //This chain of alternatives could go on for a
        //long time...

      }
}

Generic variable names, some annoying loops, and a couple of conditionals sure can make a difference! For the cost of determining what this simple piece of code does, are you even willing to paste it in and run it? Would you be willing to pull out a calculator and do the arithmetic? What if you could only use your brain and no additional tools? Somewhere along the line, you'd reach a point where the benefit wouldn't be worth the reward anymore and give up. Along these same lines, the vast majority of software pirates won't spend 500 hours reverse engineering and patching a simple $10 shareware application. A little deterrence will go a long way.

Obfuscation Example

The University of Arizona has developed a very nice Java tool called Sandmark that is used to test and study software watermarking. As part of that study, it implements many well-known obfuscation algorithms and provides a GUI for a tool called Soot, an optimization tool that can also decompile bytecode. These two tools together allow you to take Java code, obfuscate it, and then decompile it to see the effect of the obfuscations. Let's get started with an example to put all of this together:

To start up Sandmark, navigate to the sandmark.jar file with Terminal and execute it by typing java -jar sandmark.jar. The toolbar up top expands with a button on the far right, and each tab has its own specialized help menu, which actually is pretty helpful. For instance, the jDecompile script you downloaded is an adaptation of a script from the "Decompile" tab that I tailored for OS X.

sandmark
Sandmark has an easy-to-use GUI interface and a great help system.

To do a quick obfuscation of source code, you can choose a particular obfuscation algorithm from the "Obfuscate" tab, or by letting Sandmark apply a variety of obfuscation algorithms on the "Quick Protect" tab.

For an overview of the algorithms available in Sandmark, check here. The only caveat is that Sandmark expects a jar file. (If you'd like an overview of creating and working with jar files before jumping into an example with Sandmark, check here.)

Save the following Java code to a file called IfElseDemo.java:

public class IfElseDemo {
    public static void main(String[] args) {

        int testscore = 76;
        char grade;

        if (testscore >= 90) {
            grade = 'A';
        } else if (testscore >= 80) {
            grade = 'B';
        } else if (testscore >= 70) {
            grade = 'C';
        } else if (testscore >= 60) {
            grade = 'D';
        } else {
            grade = 'F';
        }
        System.out.println("Grade = " + grade);
    }
}

Let's apply an obfuscation algorithm to this simple example and then decompile the obfuscated bytecode to see the difference.

If all goes well, a preview of the obfuscated source code opens up that is quite a bit harder to understand. If you have trouble with the decompiling portion, make sure your path is set correctly for the Terminal window in which you're running Sandmark.

Although this example doesn't unlock any of the secrets of the universe, it does illustrate how effective obfuscation can be for even a simple example. Now imagine applying various obfuscation techniques to thousands of lines of more complex code.

If you take a look at the algorithms Sandmark offers, you'll notice that there's scores of confusing possibilities. Refactoring inheritance hierarchies, introducing confusing arithmetic operations, and introducing buggy variations of existing code blocks that never get executed are just a few of the possibilities. Keep in mind that you might want to just obfuscate the sensitive portions of your code, because the obfuscation can impose size and performance penalties. The penalties may or may not make a difference; it's a trade-off you have to measure and consider.

Final Thoughts

In a world where everyone follows license agreements and no one wants to reverse engineer government secrets, obfuscation techniques wouldn't be of much use. Since we don't live in the shire, however, security measures have their place and are just one of the many things that keeps the world spinning. Hopefully, you now have a better feel for the compilation process and understand how obfuscation is a powerful tool you can use to protect your code from exploitation and hacking.

Matthew Russell is a computer scientist from middle Tennessee; and serves Digital Reasoning Systems as the Director of Advanced Technology. Hacking and writing are two activities essential to his renaissance man regimen.


Return to MacDevCenter.com.

Copyright © 2009 O'Reilly Media, Inc.