macdevcenter.com
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

Multiprocessor Work Sharing with Cocoa

by Drew McCormack
05/23/2003

Many hands make light work.

With many Apple systems sporting dual processors, and rumors rife that four-processor systems are not far away, it's becoming more important that Mac programmers consider how multiprocessing can be tapped in their applications. Sure, the user benefits from having a multiprocessor system anyway, because the Mac OS X kernel divides the running applications amongst the available processors, but the kernel can't share the work of a single-threaded, numerically intensive application between processors. If your application hasn't been written to take full advantage of multithreading, it will not run any faster on two processors than on one. If your code includes any number crunching, you probably should consider how it could be restructured to take full advantage of that second processor, which will otherwise lie dormant, twiddling its thumbs while processor uno works its behind off.

There are different reasons to consider multithreading your app. Probably the most common reason is simply to allow the user interface to continue to respond while an expensive operation is performed. The expensive operation gets split off in a worker thread, allowing the main thread to continue responding to the user. Cocoa's Distributed Objects (DO) classes make this a relatively painless process, and third-party classes like ThreadWorker help to reduce the pain still further. But this coarse-grained multithreading isn't the be-all-end-all of parallel processing. Big calculations can be carved up into bite-sized pieces, and then you have entered the realm of fine-grained parallelism.

Fine-Grained Parallelism and OpenMP

Symmetric Multi-Processor (SMP) systems, like the dual-processor Power Macs, are actually not new. They have been used in scientific fields for some time, with some systems combining hundreds of processors in a single shared-memory machine. (Makes your shiny new G4 tower seem small, doesn't it?) Scientific programmers have had to find ways of getting the most out of such monsters, and a standard for programming shared-memory architectures has evolved: OpenMP. OpenMP is basically a set of compiler directives that facilitate both coarse- and fine-grained parallel programming, without requiring the programmer to get into the nitty-gritty of multithreading. The compiler or precompiler inserts the code to generate the multiple threads required to carry out the calculation in the manner requested by the programmer through the directives.

To give you some idea how this works, here is a simple example of fine-grained parallelism, where the iterations in a loop are divided amongst multiple threads, which can then be assigned to different processors by the kernel.

#define N 90

float c, a[N];

...

// a and c get initialized here

...

unsigned i;

#pragma omp for
for ( i = 0; i < N; ++i ) {
    a[i] *= c;
}

You can probably see why OpenMP is gaining popularity in scientific programming circles. By adding a single line of code, in the form of a pragma, this otherwise serial code will take full advantage of multiple processors when they're available. The pragma in question simply instructs the compiler that the iterations of the for loop can be performed independently (i.e., in any order). The compiler, not knowing anything about the relative length of each iteration, simply divides the iterations as evenly as possible between the available threads. On a 10-processor computer, each processor would perform nine of the 90 iterations. In this case, with each iteration of equal duration, the speed-up of the code should be considerable, though almost certainly less than the 10-times factor that you might have been hoping for.

Work Sharing with Cocoa

OpenMP is great for low-level scientific programming, and it is available to Mac OS X developers in the form of the VAST precompiler. Although not the best approach for Cocoa development, we can certainly learn a lot about multiprocessor work sharing from OpenMP, and in this article I am going to develop a few classes that can be used to mimic some of the features of OpenMP in any Cocoa program. I'll also present some code to test the work-sharing classes, and use it to demonstrate certain aspects of multiprocessing.

Schedulers that Schedule Schedulables

When you think about it a bit, what we want is not that difficult. We have an arbitrary task that we want to share between processors. Let us assume this task can be divided up into what we will call "work units." The only restriction on these work units is that they must be able to be performed independently of one another. This requirement aside, the work units could be just about anything. Maybe you need to process a number of images. You could make the processing of a single image equivalent to a work unit. Perhaps you have only one image, but you have a very expensive filter to apply to it. In this case, you may decide that the filtering of a single row of the image constitutes a work unit. Your work units don't even need to be homogeneous. That is, you could have work units that are completely unrelated making up your task, such as downloading some data, filtering an image, and performing a calculation.

Now that we know what work units are, we need something that will set up threads and allocate the work units to them. We need a class that will schedule the work units on the available threads: a Scheduler class. Scheduler is an abstract base class; concrete subclasses of Scheduler will represent particular strategies for scheduling the work units on the threads. We will introduce two of these here: static scheduling, and dynamic scheduling. These two forms of scheduling work also arise in OpenMP.

It may seem hard to believe, but our design is practically complete. We have a Scheduler class, which must be able to set up threads and schedule work units to run on them. The Scheduler doesn't know anything about how the work units are actually performed though, just like the compiler didn't know anything about the iterations of the for loop in the OpenMP example above. The class that "commissions" the work is the one that must actually perform the work units. This class could be anything from a controller class to an NSView subclass. It would thus be unwise to require that our commissioning class be the subclass of some abstract base class like WorkCommissioner. Instead, we will introduce a protocol to which the commissioning class must comply: Schedulable. An object implementing the Schedulable protocol can be scheduled by a Scheduler. Seems logical to me.

The Scheduler Class Interface

We have all of the players in our game. Now to give them an interface, so that they can talk to one another. The Scheduler class is actually the main player in an OO design pattern called "Strategy" (see Design Patterns by Erich Gamma, et al). The abstract Scheduler class acts as a placeholder for different scheduling strategies, which are implemented as descendants of Scheduler. In the Strategy design pattern, a method is defined for executing the strategy, which is then customized in the subclasses to achieve different scheduling policies. In this case, our method will be performScheduleForWorkUnits:.

In addition, Scheduler needs methods to initialize itself with a Schedulable object, and the number of threads it should create. A method to cancel a schedule once it has started could come in handy, too. All of these considerations lead to the following interface, with accessors omitted:

-(id)initForSchedulableObject:(id <Schedulable>)schedObj 
    andNumberOfThreads:(unsigned)numThreads;
-(id)initForSchedulableObject:(id <Schedulable>)schedObj;
-(void)dealloc;

-(void)performScheduleForWorkUnits:(NSSet *)workUnits;
-(void)cancelSchedule;

// Template method. Overload in subclasses
-(NSSet *)_workUnitsToExecuteForRemainingUnits:(NSSet *)remainingUnits;

Related Reading

Cocoa in a Nutshell
A Desktop Quick Reference
By Michael Beam, James Duncan Davidson

The only non-trivial method that I didn't make any mention of was the last one. This is referred to in OO lingo as a "Template Method," and is the second design pattern we have come across in the Scheduler class. (It is actually a nice example of how design patterns get combined in real-life development.) It is an abstract method that is intended to be overloaded in concrete subclasses, effectively defining the scheduling strategy used. Typically, the performScheduleForWorkUnits: method would be abstract in the Strategy design pattern, but what I found was that a lot of what this method does, related to setting up the threads in particular, was the same for all subclasses of Scheduler. Employing the Template Method design pattern allows you to keep the common code in the base class, and only designate the specialized code to the subclass. Note also that the method name has an underscore at the beginning to indicate that it should be treated as a protected method and only called from the Scheduler class and its descendants.

We won't look at the implementation of Scheduler here, though you are certainly free to download it and read it yourself. It is worth considering though what happens in the Scheduler class when you call the performScheduleForWorkUnits: method. First, you pass to this method an NSSet object containing the work units you want performed. NSSet is a bit of an obscure collection class in Cocoa, which you don't see used very often, but in this case it is exactly what we need. It represents an unordered set of distinct objects. NSSet defines the sort of mathematical operations that you would expect from a set class, and these are more efficient for our purposes than, say, fudging the same operations with the NSArray class.

The Scheduler class sets up a number of threads, and then it calls the _workUnitsToExecuteForRemainingUnits: method for each thread in turn. It passes in an NSSet object that contains all of the work units that remain to be performed. In the Scheduler subclass, where this method has been implemented, a new NSSet of work units is created, based on what remains and the particular scheduling strategy that the subclass implements. This set contains the work units that should be scheduled to run in the next thread. These work units are removed from the set of remaining work units, and the Scheduler repeats this process until all of the units have been completed. Whenever a thread finishes its current set of work units, it gets some more, until there are no more units remaining.

To finish off the Scheduler class, making it a good Cocoa citizen, we add a number of delegate methods.

@interface Scheduler (SchedulerDelegateMethods)

// Sent in the main thread
-(void)schedulerWillBeginSchedule:(Scheduler *)sender;

// The following are sent in the worker thread
-(BOOL)scheduler:(Scheduler *)scheduler shouldBeginUnits:(NSSet *)units;
-(void)scheduler:(Scheduler *)scheduler didCompleteUnits:(NSSet *)units;

// Sent in the main thread
-(void)schedulerDidCancelSchedule:(Scheduler *)scheduler;
-(void)schedulerDidFinishSchedule:(Scheduler *)scheduler;

@end

These methods speak for themselves. The only aspect to be wary of is that some are sent in a worker thread, and some in the main thread, as indicated in the comments.

The Schedulable Protocol

Of course, a Scheduler has to have something to schedule, and this is defined in an object that conforms to the Schedulable protocol. The Scheduler knows nothing about the work units to be performed, other than the identifying object that it gets passed. The actual work that a work unit represents is defined by the Schedulable object.

The Schedulable protocol is actually very simple. Here it is:

@protocol Schedulable 
-(void)performWorkUnits:(NSSet *)workUnits forScheduler:(Scheduler *)scheduler;
@end

The method performWorkUnits:forScheduler: will get called by the Scheduler from a worker thread with a NSSet of work units to perform. This method must know what the work units represent, and how to perform them.

Note that in all of this we have made no assumption about what sort of objects the work units in the NSSets are. The only restriction is that they are objects, so that they can be added to an NSSet. The work units could be represented by NSNumber objects representing row indexes, for example, or they could be NSImages that require processing. What the work units are is entirely left up to the Schedulable object's discretion.

Scheduling Strategies

I mentioned above that the Scheduler class is the base class in the Strategy design pattern. We now need to have some concrete subclasses that define different scheduling strategies. We will implement two here: StaticScheduler and DynamicScheduler.

Each of these classes represents a type of scheduling used in OpenMP. Static scheduling is the simplest: you simply take the total number of work units to be completed, and divide them as evenly as possible between the available threads. This division is referred to as static because it takes no consideration of what is happening at runtime.

Dynamic scheduling, as the name implies, does take account of runtime conditions. Individual work units are scheduled on each thread, and new work units assigned whenever a thread is finished with a work unit. The work units are thus not necessarily evenly distributed between the threads upon completion. How many work units each thread will have performed will depend on whether a thread got exclusive access to its processor, or had to share it with another process. It will also depend on the relative lengths of the units: if they are not the same size, threads performing shorter work units will perform more overall than those with long work units. This is not the case with static scheduling, where each thread performs the same number of units, and if one thread gets smaller work units, it will just have to idle waiting for the others to finish.

How a scheduling policy divides its work between processors is closely related to the subject of load balancing. It is generally desirable to balance the workload as evenly as possible between the available processors; otherwise, some processors may be left waiting for others to finish and inefficiently idling away. In general, dynamic scheduling leads to better load balancing than static scheduling, because it divides the work based on execution time and processor availability, rather than assuming all work units are equal in length. But dynamic scheduling doesn't always lead to better load balancing, as we will see below, and it does incur extra overhead, because there is some synchronization of threads required each time a new work unit is retrieved for processing. This occurs a lot more often in dynamic scheduling than in static scheduling.

The Load-Balancing Test Utility

Now that we have a complete multiprocessor work sharing design, we can carry out some experiments to see how the different scheduling strategies work, and which is best under a given set of circumstances. For this purpose, I have written a small test application. If you haven't downloaded the source code yet, do so now. Note that you can run these calculations on a single- or multiprocessor machine.

Load balancing test app screenshot.
The load-balancing test utility.

Open the project "Work Sharing" in Project Builder, and build and run the app. Press the big Run button to start the work-sharing simulation. After a short pause, you should see the progress indicators at the bottom show how much time was used by each processor to complete the work units, 100 in all. You should see that each processor spent the same amount of CPU time to process the work. The wall clock time for all units to complete is the time corresponding to the longest progress indicator at the end of the simulation.

The user interface of the test utility is pretty self-explanatory. You can change the number of processors, the number of work units, and the scheduler class. You also have some control over the length of the work units; you can choose how much variation there is between the unit lengths. The variation is given as a number between 0 and 1, with 0 representing no variation (i.e., all units are of equal length), and 1 representing maximum variation, with units ranging in length from 0 to 2.

Having set the variation in unit length, you can also choose how that variation is distributed over the work units. You can choose a "Random" distribution, for example, where the length of each work unit is chosen randomly in the range defined by the variation. But you can also choose a "Linear" distribution, which sets work unit lengths in a linearly increasing manner, with the first work units short, and the last units long. The third option, "Quadratic," is similar to "Linear," but the distribution is quadratic, which basically means that there are fewer long units. The checkbox is related to whether or not the random number generator is seeded, in the case that a Random distribution is used. This is important if you want to make a direct comparison between two different scheduling strategies. Select the checkbox whenever you wish to make such a comparison using the Random distribution.

Experiments in Scheduling

Here are some simple experiments that you can perform to gain insight into the various scheduling approaches.

  • Set the number of processors to 4, the number of work units to 100 with Random distribution and 0 variation, and the scheduler to StaticScheduler. Run. Now change the variation to 1. Run again. What do you notice? Now change the number of work units to 7 and re-run. Now what do you see, and what does it tell you about how the number of units relates to load balancing?

  • Repeat all of this for DynamicScheduler. What do you notice about how the progress indicators are updated, and how does this relate to the dynamic nature of the scheduler? Is DynamicScheduler any better than StaticScheduler in these cases?

  • Repeat these tests for the each scheduler but with the Linear work-unit length distribution. Which scheduler produces the best load balancing for 100 work units? What about for 7 work units?

  • Repeat these last tests for the Quadratic distribution, rather than the Linear distribution. How much shorter is the wall-clock time of the DynamicScheduler relative to the StaticScheduler, when 100 work units are used?

What You've Hopefully Learned

In your experiments you should have learned the following:

  • If the number of work units is small relative to the number of CPUs, load balancing suffers no matter which scheduler you use. More work units leads to finer granularity, and better load balancing.

  • The DynamicScheduler is not any better than the StaticScheduler when the work unit lengths are randomly selected, and it does suffer from synchronization between threads, as witnessed by the constant updates in the progress indicators.

  • When there is a systematic variation in the work unit length, such as is the case for the Linear and Quadratic distributions, the StaticScheduler performs badly relative to the DynamicScheduler, which can compensate for the variation.

In general, the DynamicScheduler class will be the better choice, but if your work units are very short (making synchronization of threads an important issue), you may consider the StaticScheduler, which avoids this overhead.

Where to Go from Here

Feel free to use the scheduling classes in your own code. And if you are really interested, why not try writing other Scheduler subclasses. OpenMP defines a third type of scheduling called "guided," for example. Or maybe you could create a neural-net scheduler that tries to adapt the schedule according to what it knows about the work units that have already run, accounting for systematic change in the work-unit distribution and availability of processors. The sky's the limit, but don't be surprised if you can't do much better than the two schedulers I have presented. The KISS ("Keep It Simple Stupid") principle very likely applies here.

Drew McCormack works at the Free University in Amsterdam, and develops the Cocoa shareware Trade Strategist.


Return to the Mac DevCenter.