macdevcenter.com
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Build an eDoc Reader for Your iPod, Part 3

by Matthew Russell
01/07/2005

This installment finishes up the series on your eDoc reader by incorporating the functionality to extract the text from PDF documents. You'll accomplish this task by using an open source Java package called PDFBox and the not-so-well-documented Cocoa-Java bridge.

In order to empower you as a more capable developer, I'm writing this article a little differently than some others you've read. Instead of just handing you a picture-perfect project and explaining how it works, I'm going to lead you through the development process as I went through it and troubleshoot problems as they come up.

My intentions are to give you more insight into some of the unfamiliar problems you may encounter when bridging different languages. When we're finished here, you'll have a deployable standalone application that makes reading your eDocs easy, and you'll be able to incorporate existing Java into your other projects with minimal hassle.

Getting Started

The ability to work with PDF files provides your application with a uniform approach to work with almost any text format you can imagine. For example, instead of removing the tags from HTML or saving RTF as text, you could just as easily print these as PDF files and then extract the text from the PDFs. But why use Java to do this?

In short, there's a nice Java package written that does the job, but there's not an Objective-C equivalent (if there were, it would probably be preferable to use it instead). Consequently, we have a couple of options here. Two of them are obvious: rewrite the Java package in Objective-C, or interface to the Java package. We'll choose the latter.

First and foremost, back up your existing work! The quickest way is to just compress it into a tarball. You can do this to your entire project directory in Terminal. Typing man tar gives you more information on tar if you need it. If you really want to be slick, however, you can use CVS from Terminal or Source Code Management (SCM) from Xcode. With that said, go to PDFBox.org and download version 0.6.7. If there's a newer release, feel free to try it, but version 0.6.7 has been verified to work with the procedure outlined in this article. PDFBox is released under the BSD license, which very non-restrictive in terms of use. Read it and understand what will mean if you decide to distribute any work you produce with PDFBox. Once your download completes, extract the files and place them in a convenient location.

Crossing the Cocoa-Java Bridge

Related Reading

Learning Cocoa with Objective-C
By James Duncan Davidson, Apple Computer, Inc.

Next, open your working project from part two of this series in Xcode and follow along very closely as we get the Java bridge working. Seemingly minute differences in some of the following steps can result in frustrating error messages that aren't the most obvious ones to fix that you'll ever see. To incorporate Java classes into our project, we'll create another target that groups these files. This is handy, because the result of all of these files in the target is a single JAR file that encapsulates things nicely. Do the following to add another target to your project:

  • Create a new Java target called JavaClasses:
    • Control-click on Targets in the left-hand pane.
    • Choose Add -> New Target.
    • Choose Package from the Java menu.
    • Name the target "JavaClasses."
    • Ensure that JavaClasses is added to the Project PodReader.

Given the new JavaClasses target, we'll now add some classes to it. For now, just get the files added. We'll backstep and define their bodies in a moment.

  • Add a Java file called PDFBoxBridge.java to the JavaClasses target:
    • From the main menu: File -> New File-> (Pure Java) Java class.
    • Name the file "PDFBoxBridge.java."
    • Uncheck the target PodReader, if it's checked.
    • Check the target JavaClasses if it's not checked.
    • Drag this file from the top down to Other Sources in the left-hand pane.
  • Add a Java file called PDFBoxWrapper.java to the JavaClasses target:
    • From the main menu: File -> New File -> (Pure Java) Java class.
    • Name the file "PDFBoxWrapper.java."
    • Uncheck the target PodReader if it's checked.
    • Check the target JavaClasses if it's not checked.
    • Drag this file from the top down to Other Sources in the left-hand pane

In order to be official about informing the compiler about the Java method we'll be invoking from our Objective-C code, we define an Objective-C header file that contains this information. Let's add this file to the project.

  • Create a new empty file called JavaInterfaces.h:
    • From the main menu: File -> New File -> Empty File in Project.
    • Name the file "JavaInterfaces.h."
    • Check the target PodReader if it's not checked.
    • Uncheck the target JavaClasses if it's checked.
    • Drag this file from the top down to Other Sources in the left-hand pane.

What have we done so far? We defined a target called JavaClasses, specified some Java classes and added them to it, and then added an Objective-C header file to the PodReader target that informs the compiler that this Java magic isn't really magic at all--it's simply a matter of referencing some code from elsewhere. With that said, let's fill in these empty files.

The first body we'll define is for PDFBoxBridge.java. Download it here. If you look at your PDFBox package you downloaded from PDFBox.org, you'll notice that PDFBoxBridge.java is almost exactly the same as installDir/PDFBox-0.6.7a/src/org/pdfbox/ExtractText.java with just a couple of differences annotated in the header comments and inline. The idea was to change as little as possible and to clearly document any changes made to the file.

The Critical Cocoa-Java Link

The file PDFBoxWrapper.java offers a little more than first meets the eye. For now, notice that this class offers a public method that receives an NSArray of arguments (the PDF source file, among other possible things) and extracts the text from the PDF file using the PDFBoxBridge class from above.

Take note that we pass in an NSArray to extractText and have to first convert it to a Java array "by hand," since extractText is a Java method, and its signature expects a Java array. Still, we're able to "seamlessly" pass in an NSArray to PDFBoxWrapper by importing the Java NSArray class. Once we have the Java NSArray, we can easily manipulate it to get a pure Java array of String[]. The lesson to take from this is that we cannot seamlessly pass an NSArray (even if it's filled only with NSStrings) directly to a Java method that requires an array of String[].

/*  PDFBoxWrapper.java*/   
import com.apple.cocoa.foundation.NSArray;


public class PDFBoxWrapper {
    
    private PDFBoxBridge bridge;
    
    public PDFBoxWrapper() {
        bridge = new PDFBoxBridge();
    }
    
    private String[] convertNSArrayToJavaArray(NSArray array) {
        
        NSArray newArray = new NSArray(array);
        String[] javaArray = new String[newArray.count()];        
        newArray.getObjects(javaArray);
        
        return javaArray;
    }    
    
    public void extractText( NSArray args ) throws Exception {
        try{
            bridge.extractText(this.convertNSArrayToJavaArray(args));
        }
        catch (Exception e) {
            System.out.println("Exception in extractText in PDFBoxWrapper");
        }
    }
    
}

At this point, the Java classes are in place, but how do we relate our Objective-C code to the Java code? The simple header file below takes care of this for us. It specifies the method signature and vouches that the class PDFBoxWrapper exists, offers the method extractText:, and is accessible.

/*JavaInterfaces.h*/
@interface PDFBoxWrapper : NSObject
{}

- (void)extractText:(NSArray*)array;
@end

We're now ready to replace the stub we left for handling PDF files back in the AppController.m source file, which we designed last time. You should change the first block of your if statement in copyIt: to be the following code block. We'll address the "prelude shell script" comment in detail in a moment.

[progressIndicator startAnimation:nil];
    if ([[fileName pathExtension] isEqualToString:@"pdf"]) {
        
        //load the vm
        id vm = NSJavaSetupVirtualMachine();

        //embedded jar files. Loading the PDFBox jar files here didn't work.
        //In short, the best fix was to add a prelude shell script
        NSMutableArray* jarFiles = [NSMutableArray arrayWithObject:
            [[NSBundle mainBundle] pathForResource:@"JavaClasses" ofType:@"jar"]];
        
        NSJavaClassesFromPath(jarFiles, nil, YES, nil);
        
        PDFBoxWrapper* wrapper = NSJavaObjectNamedInPath(@"PDFBoxWrapper", nil);
        
        NSString* filePrefix = 
            [[[sourceFile stringValue] 
                lastPathComponent] 
                stringByDeletingPathExtension];

        NSString* tempFile =
            [[@"/tmp/" stringByAppendingString:filePrefix] 
                       stringByAppendingString: @".podreaderTEMP"];
        
        NSArray* args = [[NSArray alloc] 
            initWithObjects:[sourceFile stringValue], 
            tempFile, 
            nil];
            
        [wrapper extractText:args];        
    
        //used alloc, so "args" has a retain count of 1. release it
        [args release];
    
        [chunker chunkIt:tempFile
                   toDir:[[destDir stringValue] stringByExpandingTildeInPath] 
           withSeparator:[separatorValue stringValue]];
        
    }
    else if ([[fileName pathExtension] isEqualToString:@"txt"]) {

This should be fairly straight forward once you've looked it over. We start up a Java Virtual Machine (JVM), specify that a JAR file called JavaClasses (remember that target you added?) should be in the application bundle and tell it to load the classes from that location. We proceed to use the class you've seen called PDFBoxWrapper to extract the text from the PDF document to a temporary file located in the /tmp directory. From there things are just like before: we use the temporary file created and process it using the same text handling routine like last time.

Compiling and Running the Application in Xcode

Now, you're very close to having compilable code within Xcode, but you're not quite there. Take a moment and ask yourself what's missing. If you need a hint, change the "Active Target" using the popup button in the upper left-hand corner of Xcode to "JavaClasses" and try to compile. Take note of the error messages you receive.


Change the active target from "PodReader" (currently set) to "JavaClasses"

A moment later: while in deep meditation it occurred to you that all the Java references in PDFBoxBridge.java are undefined at the moment, and they must be defined for successful compilation to occur. By inspecting PDFBoxBridge.java, you astutely observed that the JAR file PDFBox-0.6.7a.jar (distributed with PDFBox), in particular, must be included in the project. Even though references to log4j-1.2.8.jar are commented out, it turns out that PDFBox-0.6.7a.jar includes references to log4j-1.2.8.jar, so it must also be added to the project. The reason for adding these references is to resolve the compile time errors.

To add these files, control-click on the JavaClasses target in the left-hand pane of Xcode, and then choose Add -> Existing Files.... Navigate to where you extracted PDFBox and add the log4j-1.2.8.jar file from the /PDFBox-0.6.7a/external directory. When adding the file, check the box at the top to copy the file into the project and make sure only the JavaClasses target is checked. Repeat the process for the PDFBox-0.6.7a.jar file in the /PDFBox-0.6.7a/lib directory. In the left-hand pane, drag each of the files that appears in the very top to Resources to keep things nice and tidy. Note that in the JavaClasses target each of these JAR files appears under Frameworks & Libraries. If you plan on passing your work onto anyone else, you're required by the BSD license to add a copy of the PDFBox LICENSE file to your source at this time. Check your project directory in Finder or Terminal to make sure that these files are actually copied. At this point, you can successfully compile the JavaClasses target.


Add the JAR files packaged with PDFBox to your project

Change the build Active Target in the upper left-hand corner of Xcode to PodReader and try to build. You should get an error or two. Being the good detective that you are, you see that in AppController.m there's an undefined reference to our PDFBoxWrapper class, so you add #import "JavaInterfaces" to the top of AppController.m, recompile, and all is well for compilation. Without that directive, the file JavaInterfaces.h doesn't exist, as far as the compiler is concerned.

Pages: 1, 2

Next Pagearrow