MacDevCenter    
 Published on MacDevCenter (http://www.macdevcenter.com/)
 See this if you're having trouble printing code examples


Build an eDoc Reader for Your iPod, Part 3

by Matthew Russell
01/07/2005

This installment finishes up the series on your eDoc reader by incorporating the functionality to extract the text from PDF documents. You'll accomplish this task by using an open source Java package called PDFBox and the not-so-well-documented Cocoa-Java bridge.

In order to empower you as a more capable developer, I'm writing this article a little differently than some others you've read. Instead of just handing you a picture-perfect project and explaining how it works, I'm going to lead you through the development process as I went through it and troubleshoot problems as they come up.

My intentions are to give you more insight into some of the unfamiliar problems you may encounter when bridging different languages. When we're finished here, you'll have a deployable standalone application that makes reading your eDocs easy, and you'll be able to incorporate existing Java into your other projects with minimal hassle.

Getting Started

The ability to work with PDF files provides your application with a uniform approach to work with almost any text format you can imagine. For example, instead of removing the tags from HTML or saving RTF as text, you could just as easily print these as PDF files and then extract the text from the PDFs. But why use Java to do this?

In short, there's a nice Java package written that does the job, but there's not an Objective-C equivalent (if there were, it would probably be preferable to use it instead). Consequently, we have a couple of options here. Two of them are obvious: rewrite the Java package in Objective-C, or interface to the Java package. We'll choose the latter.

First and foremost, back up your existing work! The quickest way is to just compress it into a tarball. You can do this to your entire project directory in Terminal. Typing man tar gives you more information on tar if you need it. If you really want to be slick, however, you can use CVS from Terminal or Source Code Management (SCM) from Xcode. With that said, go to PDFBox.org and download version 0.6.7. If there's a newer release, feel free to try it, but version 0.6.7 has been verified to work with the procedure outlined in this article. PDFBox is released under the BSD license, which very non-restrictive in terms of use. Read it and understand what will mean if you decide to distribute any work you produce with PDFBox. Once your download completes, extract the files and place them in a convenient location.

Crossing the Cocoa-Java Bridge

Related Reading

Learning Cocoa with Objective-C
By James Duncan Davidson, Apple Computer, Inc.

Next, open your working project from part two of this series in Xcode and follow along very closely as we get the Java bridge working. Seemingly minute differences in some of the following steps can result in frustrating error messages that aren't the most obvious ones to fix that you'll ever see. To incorporate Java classes into our project, we'll create another target that groups these files. This is handy, because the result of all of these files in the target is a single JAR file that encapsulates things nicely. Do the following to add another target to your project:

Given the new JavaClasses target, we'll now add some classes to it. For now, just get the files added. We'll backstep and define their bodies in a moment.

In order to be official about informing the compiler about the Java method we'll be invoking from our Objective-C code, we define an Objective-C header file that contains this information. Let's add this file to the project.

What have we done so far? We defined a target called JavaClasses, specified some Java classes and added them to it, and then added an Objective-C header file to the PodReader target that informs the compiler that this Java magic isn't really magic at all--it's simply a matter of referencing some code from elsewhere. With that said, let's fill in these empty files.

The first body we'll define is for PDFBoxBridge.java. Download it here. If you look at your PDFBox package you downloaded from PDFBox.org, you'll notice that PDFBoxBridge.java is almost exactly the same as installDir/PDFBox-0.6.7a/src/org/pdfbox/ExtractText.java with just a couple of differences annotated in the header comments and inline. The idea was to change as little as possible and to clearly document any changes made to the file.

The Critical Cocoa-Java Link

The file PDFBoxWrapper.java offers a little more than first meets the eye. For now, notice that this class offers a public method that receives an NSArray of arguments (the PDF source file, among other possible things) and extracts the text from the PDF file using the PDFBoxBridge class from above.

Take note that we pass in an NSArray to extractText and have to first convert it to a Java array "by hand," since extractText is a Java method, and its signature expects a Java array. Still, we're able to "seamlessly" pass in an NSArray to PDFBoxWrapper by importing the Java NSArray class. Once we have the Java NSArray, we can easily manipulate it to get a pure Java array of String[]. The lesson to take from this is that we cannot seamlessly pass an NSArray (even if it's filled only with NSStrings) directly to a Java method that requires an array of String[].

/*  PDFBoxWrapper.java*/   
import com.apple.cocoa.foundation.NSArray;


public class PDFBoxWrapper {
    
    private PDFBoxBridge bridge;
    
    public PDFBoxWrapper() {
        bridge = new PDFBoxBridge();
    }
    
    private String[] convertNSArrayToJavaArray(NSArray array) {
        
        NSArray newArray = new NSArray(array);
        String[] javaArray = new String[newArray.count()];        
        newArray.getObjects(javaArray);
        
        return javaArray;
    }    
    
    public void extractText( NSArray args ) throws Exception {
        try{
            bridge.extractText(this.convertNSArrayToJavaArray(args));
        }
        catch (Exception e) {
            System.out.println("Exception in extractText in PDFBoxWrapper");
        }
    }
    
}

At this point, the Java classes are in place, but how do we relate our Objective-C code to the Java code? The simple header file below takes care of this for us. It specifies the method signature and vouches that the class PDFBoxWrapper exists, offers the method extractText:, and is accessible.

/*JavaInterfaces.h*/
@interface PDFBoxWrapper : NSObject
{}

- (void)extractText:(NSArray*)array;
@end

We're now ready to replace the stub we left for handling PDF files back in the AppController.m source file, which we designed last time. You should change the first block of your if statement in copyIt: to be the following code block. We'll address the "prelude shell script" comment in detail in a moment.

[progressIndicator startAnimation:nil];
    if ([[fileName pathExtension] isEqualToString:@"pdf"]) {
        
        //load the vm
        id vm = NSJavaSetupVirtualMachine();

        //embedded jar files. Loading the PDFBox jar files here didn't work.
        //In short, the best fix was to add a prelude shell script
        NSMutableArray* jarFiles = [NSMutableArray arrayWithObject:
            [[NSBundle mainBundle] pathForResource:@"JavaClasses" ofType:@"jar"]];
        
        NSJavaClassesFromPath(jarFiles, nil, YES, nil);
        
        PDFBoxWrapper* wrapper = NSJavaObjectNamedInPath(@"PDFBoxWrapper", nil);
        
        NSString* filePrefix = 
            [[[sourceFile stringValue] 
                lastPathComponent] 
                stringByDeletingPathExtension];

        NSString* tempFile =
            [[@"/tmp/" stringByAppendingString:filePrefix] 
                       stringByAppendingString: @".podreaderTEMP"];
        
        NSArray* args = [[NSArray alloc] 
            initWithObjects:[sourceFile stringValue], 
            tempFile, 
            nil];
            
        [wrapper extractText:args];        
    
        //used alloc, so "args" has a retain count of 1. release it
        [args release];
    
        [chunker chunkIt:tempFile
                   toDir:[[destDir stringValue] stringByExpandingTildeInPath] 
           withSeparator:[separatorValue stringValue]];
        
    }
    else if ([[fileName pathExtension] isEqualToString:@"txt"]) {

This should be fairly straight forward once you've looked it over. We start up a Java Virtual Machine (JVM), specify that a JAR file called JavaClasses (remember that target you added?) should be in the application bundle and tell it to load the classes from that location. We proceed to use the class you've seen called PDFBoxWrapper to extract the text from the PDF document to a temporary file located in the /tmp directory. From there things are just like before: we use the temporary file created and process it using the same text handling routine like last time.

Compiling and Running the Application in Xcode

Now, you're very close to having compilable code within Xcode, but you're not quite there. Take a moment and ask yourself what's missing. If you need a hint, change the "Active Target" using the popup button in the upper left-hand corner of Xcode to "JavaClasses" and try to compile. Take note of the error messages you receive.


Change the active target from "PodReader" (currently set) to "JavaClasses"

A moment later: while in deep meditation it occurred to you that all the Java references in PDFBoxBridge.java are undefined at the moment, and they must be defined for successful compilation to occur. By inspecting PDFBoxBridge.java, you astutely observed that the JAR file PDFBox-0.6.7a.jar (distributed with PDFBox), in particular, must be included in the project. Even though references to log4j-1.2.8.jar are commented out, it turns out that PDFBox-0.6.7a.jar includes references to log4j-1.2.8.jar, so it must also be added to the project. The reason for adding these references is to resolve the compile time errors.

To add these files, control-click on the JavaClasses target in the left-hand pane of Xcode, and then choose Add -> Existing Files.... Navigate to where you extracted PDFBox and add the log4j-1.2.8.jar file from the /PDFBox-0.6.7a/external directory. When adding the file, check the box at the top to copy the file into the project and make sure only the JavaClasses target is checked. Repeat the process for the PDFBox-0.6.7a.jar file in the /PDFBox-0.6.7a/lib directory. In the left-hand pane, drag each of the files that appears in the very top to Resources to keep things nice and tidy. Note that in the JavaClasses target each of these JAR files appears under Frameworks & Libraries. If you plan on passing your work onto anyone else, you're required by the BSD license to add a copy of the PDFBox LICENSE file to your source at this time. Check your project directory in Finder or Terminal to make sure that these files are actually copied. At this point, you can successfully compile the JavaClasses target.


Add the JAR files packaged with PDFBox to your project

Change the build Active Target in the upper left-hand corner of Xcode to PodReader and try to build. You should get an error or two. Being the good detective that you are, you see that in AppController.m there's an undefined reference to our PDFBoxWrapper class, so you add #import "JavaInterfaces" to the top of AppController.m, recompile, and all is well for compilation. Without that directive, the file JavaInterfaces.h doesn't exist, as far as the compiler is concerned.

Runtime Issues

Successful compilation! Are we done yet? There's one more thing we need to do to successfully run the project in Xcode, but don't take my word for it. Go ahead and run it to see for yourself.

To your dismay, you've noticed that the app fires up like a champ and you can still process text files normally, but PDF files don't work. You'll get an error message similar to the following: +[NSArray arrayWithObject:]: attempt to insert nil. Not exactly the most informative error message you've ever seen, but it can all be worked out.

If you look back to the new code you inserted in AppController.m, you see that there's one place where you pass the message arrayWithObject: to an NSArray, but you're not passing nil to it; you're passing an NSString returned from a pathForResrouce: ofType: message, right? Well, the problem is that there's no JavaClasses.jar resource in your application bundle. If you check all of the subfolders under Targets in Xcode's left-hand pane, you won't see it there, and this verifies the problem. To fix this problem, simply drag JavaClasses.jar from above in the Products folder down into the Bundle Resources folder.

That's all done, but after another build and run, there's some exception spewage that contains the line "java.lang.NoClassDefFoundError" when you try to process a PDF file. Compilation works, so this must be related to the Java runtime. Specifically, the JVM cannot find the PDFBox JAR files referenced. Let this be an illustration in the differences between compile time and runtime. Since, at the moment, we only want to run the application in Xcode, you can specify the CLASSPATH to point to the JAR files by double-clicking PodReader under Executables in Xcode's left-hand pane to reveal the PodReader Info window. Choose the Arguments tab and set the CLASSPATH in the environment to point to your two PDFBox-related JAR files that Xcode copied into your project directory. Separate their paths with a colon.


How to set runtime environment variables during development in Xcode

You can now successfully run your application. You'll get a couple of warnings from log4j, but those are PDFBox-related and we don't need to resolve them for our project. Depending on the type of PDF file you extract (like a white paper with a lot of formulas), you might get some characters that are garbage. If you're really set on removing those characters, you have three options: do it by hand, modify PDFBox to do it as it is extracting the text, or use Objective-C to do it once the text is extracted. This "garbage" can't necessarily be filtered out by PDFBox in general, because "garbage" is very context-dependent, especially with Unicode character sets.

Deploying the Application

If you only plan on using your PodReader from within the confines of Xcode, then you're all done and can quit for the day. If you plan on deploying the application so that you can place it on your desktop or in your Applications folder, there's still work to be done.

For a deployable application to be "OS X-like," it needs to be able to stand alone and not require a lot of manual intervention by the user. This requires a few more modifications to what we have working in Xcode. For starters, right-click on the blue project icon in the top of the left-hand pane, choose the Styles tab, and change the active build style to Deployment. This changes the linking for the executable, among other things. Clean and rebuild each of the targets under the Build menu, and verify that your project still runs.

Now go into your project's build directory using Terminal and launch your project outside of the protective veil of Xcode by typing open PodReader.app. Try to extract text from a PDF document and note the response you get: the Copy It button quickly loses focus and the progress indicator continues spinning. Even after you return from your haircut, it's still spinning.


How to change the build style from Development to Deployment

At this juncture, things are difficult to troubleshoot. But wait, there's Console.app, which keeps logs of application output and other interesting tidbits. In Terminal, type open /Applications/Utilities/Console.app and look to the bottom. It's the same exception spewage about java.lang.NoClassDefFoundError that we saw before. In Xcode, we could just specify the CLASSPATH in the project settings and be done with it. If you really wanted to be cheesy, you could distribute the PDFBox JAR files along with your application and require that all users of it set their CLASSPATHs to find them. If you're the only one using it, that would be fine for you to do as a one-time thing, but there's a better way. Keep reading!

As a great alternative to the Cheez Whiz approach, we'll embed these JAR files in your application bundle and have them be completely transparent to you and any other user. All you need to do to make them part of your bundle is to drag each of them to the Bundle Resources folder in the left-hand pane of Xcode, but that's only half of the solution. The JVM still needs a reference to them in the CLASSPATH. A few unsuccessful attempts you can suffer through to define your CLASSPATH include trying to load them the same way you load your JavaClasses.jar file in your AppController.m file, or by setting LSEnvironment values in your application's Info.plist that you can find in Xcode. A nice way to remedy the situation, however, is through what some call a prelude script.

Create a Prelude Script

In short, we're going to have the application loader execute a script to set the CLASSPATH environment variable to point to the application's embedded JAR files at runtime and then transfer control to the binary executable. This is actually pretty simple, but you'll need to understand a rough sketch of how an application loads and runs to fire that synapse that's flickering.

Before you proceed, skim over Bundles to understand how a bundled application works. Now understanding the anatomy of a bundle, we can change the value of CFBundleExecutable in Info.plist (located under Resources in Xcode's left-hand pane) to execute a shell script that points the CLASSPATH to the JAR files located in PodReader.app/Contents/Resources and then turn control over to the PodReader binary residing in PodReader.app/Contents/MacOS that would normally have been executed. All in all, this takes care of our problem.

In Xcode, copy the JAR files related to PDFBox into the Bundle Resources folder if you haven't done so already. Next, open up Info.plist and change the value PodReader for the key CFBundleExecutable to Prelude. Now, the application loader will attempt to find Prelude and execute it. To generate and set correct permissions for the Prelude script with each build, do the following:

#For the shell script build phase
prelude_src="prelude.sh"
prelude_dst="Prelude"
x_dir="$TARGET_BUILD_DIR/$PRODUCT_NAME.app/Contents/MacOS"
/bin/cp "$SRCROOT/$prelude_src" "$x_dir/$prelude_dst" 
/bin/chmod +x "$x_dir/$prelude_dst"



Add and define a shell script build phase to handle the prelude script details

So far, you've told Xcode that at the very end of the build phase for the target PodReader, it should copy the file prelude.sh (currently undefined) to the PodReader.app/Contents/MacOS directory of the bundle and set its permissions to be executable. All that remains is to define prelude.sh. From the main menu, choose File -> New File -> New Empty File in Project and name the file "prelude.sh". Make sure both boxes are unchecked for the targets PodReader and JavaClasses. Finally, move prelude.sh down to the Other Sources folder, and define its body as:

#!/bin/bash
#get the current working directory
#$0 is this script's absolute filename, like: /dir/dir/dir/file
#the % operator specifies to remove the final / and everything following it,
#   leaving, in this example,  /dir/dir/dir. 
#`pwd` does not accomplish this same effect.
here="${0%/*}"

#the bundle's executable
cmd=PodReader

#required jar files are bundled in the app, so append them to the CLASSPATH
export CLASSPATH=$CLASSPATH:$here/../Resources/log4j-1.2.8.jar
export CLASSPATH=$CLASSPATH:$here/../Resources/PDFBox-0.6.7a.jar

#the output of "echo" appears in Console for debugging purposes if you need it
#echo "classpath is $CLASSPATH"

#surrender control to the bundle's product now that CLASSPATH is set
#for more info on any bash commands type "man commandName" in Terminal
exec $here/$cmd 

Success at Last

At this point, clean all targets from the Build menu, and rebuild JavaClasses and then PodReader. You should now have a fully deployable application that runs via double-clicking or the open PodReader.app approach in Terminal. If you still experience troubles, use Terminal to go inside of the application bundle and ensure that Info.plist is set correctly, and that Prelude exists with correct permissions and resides in the same directory as the PodReader binary. Finally, use Console.app to check for exception spewage and to inspect the results from echo $CLASSPATH in prelude.sh.

Join the Open Source Project

From here, there are literally hundreds of directions you can go if you want to keep spicing up this application. Some of the obvious choices are: drag-and-drop ability, interfacing to Project Gutenberg to browse/download books, handling other document formats, adding a sound like "ding!" to indicate that a long document is finished processing, reading an RSS or Atom feed to sync the latest news, and so on. Because of all of the potential, I've opened this project up under the GPL and continued some more work on it since this writing. If you want to contribute and continue learning more Cocoa, check out The PodReader Project page. Full source and working binaries are also located there.

Matthew Russell is a computer scientist from middle Tennessee; and serves Digital Reasoning Systems as the Director of Advanced Technology. Hacking and writing are two activities essential to his renaissance man regimen.


Return to the Mac DevCenter

Copyright © 2009 O'Reilly Media, Inc.