In early February 2002, Apple announced an accelerated implementation of BLAST, a bioinformatics tool from the National Center for Biotechnology Information (NCBI). This software is used for protein and DNA searches in biomedical research and drug discovery. Codeveloped with Genentech, this highthroughput version of BLAST takes advantage of AltiVec technology on the PowerMac G4 processor, running up to five times faster than the standard BLAST implementation.
Apple first introduced PowerMac G4 computer systems using AltiVec  a high performance vector processing expansion to the PowerPC architecture  in the fall of 1999. Architecturally, AltiVec adds 128bitwide vector execution units to the PowerPC architecture. Early versions of the G4 processor had a single AltiVec unit, while more recent versions have up to four units (simple, complex, floating, and permute). These vector units operate independently from the traditional integer and floatingpoint units.
For all it’s marketing hurrah, the concept of AltiVec is exceedingly simple: it processes data in multiples, working on a whole mouthful of data instead of tackling it one nibble at a time like the integer and floatingpoint units.
How big the mouthful is depends on the type of data. Within the AltiVec unit, data is held by 128bit "short vector" registers, each capable of holding four singleprecision (32bit) floatingpoint words, four 32bit integer words, eight 16bit integer halfwords, or 16 8bit integer bytes. AltiVec operates on all data elements in a vector register simultaneously, with every instruction. This is known as SIMD  single instruction, multiple data  parallel processing.
For example, in computations involving singleprecision floatingpoint numbers, AltiVec can offer 4way parallelism by simultaneously operating on four elements of a vector with each instruction.
Consider a case where we have 20,000 pieces of floatingpoint data to process. In traditional "scalar" processing (with the floatingpoint unit), we would operate on each piece of data separately. This would require 20,000 cycles to perform each operation. With AltiVec, we could operate on four pieces of data at a time, requiring just 5,000 cycles for each operation. Right off the bat, that gives us a factor of four reduction in processor cycles.
To take advantage of the AltiVec unit, 162 new vector operations were added to the PowerPC instruction set, including functions like add, multiplyadd, subtract, absolute value, truncate, round, sum, multiplysum, average, and exponent. Simple combinations of these functions allow for the fast and easy computation of operations such as dot product.
All AltiVec instructions are fully pipelined with singlecycle throughput, and each can address up to three sourcevector registers and one destinationvector register during execution. While the factor of four example given above is a good guideline for floatingpoint data, performance boosts range from less than a factor of four in operations involving lots of loads and stores and/or large data sets (like the floatingpoint and integer units, AltiVec still relies heavily on memory/cache access and bus performance) to much more than a factor of four in cases with multiple compound operations (like matrix multiplication).
One of the coolest aspects of AltiVec is that anyone can experiment with it quite easily; the C compiler included as part of Apple’s OS X Developer tools has AltiVec support built right in. This compiler can be used to step through a few benchmarks that illustrate some of AltiVec’s capabilities. We’ll consider the following four examples: basic vector addition, scaling and translation, 2D rotation, and matrix multiplication. These examples are typical in math, science, and engineering computations, but also applicable to things like image processing, encryption, and graphics manipulation. All cases will use floatingpoint data.
This is a very basic example, where we have two sets of data (x and y) that we want to add together to form a third set of data (z). The data sets x, y, and z are treated as vectors, each containing “n” elements of data:
x = (x1, x2, x3, x4, x5, . . . . , xn )
y = (y1, y2, y3, y4, y5, . . . . , yn )
z = (z1, z2, z3, z4, z5, . . . . , zn )
To implement the vector addition with traditional scalar code, we would do something like:
for (i=1; i<=n; i++) {
z[i]=x[i]+y[i];
}
where x, y, and z are defined as “float.” So, it will take n cycles to go through and add the n elements of data together. With AltiVec, we can use the vec_add
instruction as follows:
for (i=1; i<=n/4; i++) {
z[i]=vec_add(x[i],y[i]);
}
Here, x, y, and z are defined as "vector float." Note that the loop count is reduced by a factor of 4 (upper limit is changed from n to n/4) since the vec_add
function will actually operate on 4 pieces of data at a time. Thus, the total number of loop cycles required to add the n elements of data together is just n/4.
When Apple says that Mhz isn't the entire story, AltiVec has something to do with that. What are your thoughts?  
When benchmarked on a 400Mhz PowerBook G4 with n=1000, the scalar computation performed at 88 MFLOPS while the vector computation performed at 345 MFLOPS. This nets a factor of 3.9 increase in performance by using AltiVec. Part of this great performance is due to the fact that the vector length of 1,000 results in small vectors (about 4KB each) that fit into the processor’s L1 cache.
This performance advantage will decrease as the vector length gets longer and data spills out into the L2 cache and RAM. Since the vector add operation actually consists of two loads, an add and a store, three load/store operations are required for every add. This makes a vector add operation expensive for large amounts of data.

Mathematically, this operation is represented by:
y = ax + b
where x and y are vectors and a and b are constants. In this operation, x is scaled by a factor of a and then shifted by b. With scalar computations, this can be implemented as:
for (i=1; i<=n; i++) {
y[i]=alpha*x[i]+beta;
}
where x, y, alpha and beta are defined as floats. Using AltiVec’s vector multiplyadd instruction, the same operation can be written as:
for (i=1; i<=n/4; i++) {
y[i]=vec_madd(alphaV,x[i],betaV);
}
Here, x and y are vector floats, as are alphaV and betaV:
alphaV = (alpha, alpha, alpha, alpha)
betaV = (beta, beta, beta, beta)
These fourelement vectors were created from alpha and beta in order to use them in the vec_madd
function. On the PowerBook G4, again with n=1000, the results are 198 MFLOPS for the scalar computation and 386 MFLOPS for the vector computation. Here, the performance gain from AltiVec is a factor of 1.9.
In this example, data points in a 2D xy plane are mapped into a new uv coordinate system through a rotation of axes by an angle q. The equations for this transformation are:
u = x cosq + y sinq
v = –x sinq + y cosq
To carry out this mapping transformation for a collection of “n” (x,y) points, the following scalar computation can be used:
for (i=1; i<=n; i++) {
u[i]=x[i]*c+y[i]*s;
v[i]=x[i]*s+y[i]*c;
}
where s = sinq and c = cosq. Using AltiVec’s vector multiplyadd function, the computation can be vectorized as follows:
for (i=1; i<=n/4; i++) {
u[i]=vec_madd(x[i],cV,zeroV);
u[i]=vec_madd(y[i],sV,u[i]);
v[i]=vec_madd(x[i],msV,zeroV);
v[i]=vec_madd(y[i],cV,v[i]);
}
As before, the following vector floats were created for this computation:
cV=(c, c, c, c)
sV=(s, s, s, s)
msV=(–s, –s, –s, –s)
zeroV=(0., 0., 0., 0.)
For n=1000, the two approaches give the following performance: 300 MFLOPS for the scalar computation, and 472 MFLOPS for the vector computation, a factor of 1.6 performance increase from AltiVec.
The last example involves the multiplication of n x n matrices A and B to form the n x n matrix C:
In this multiplication, the ij element of matrix C would be formed by the ith row – jth column dot product between matrices A and B as follows:
Using a scalar computation, the matrix multiplication can be structured as follows:
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
for (k=1; k<=n; k++) {
c[i,j]=c[i,j]+a[i,k]*b[k,j];
}
}
}
This algorithm involves 2n^{3} floatingpoint operations, carried out by a total of 6n^{3} instructions (3 loads, an add, a multiply, and a store are required each time the computation is evaluated). The easiest way to implement this matrix multiplication using AltiVec is to vectorize the inner “k” loop to use a vector multiplyadd function. Conceptually, this would look like:
for (i=1; i<=n; i++) {
for (j=1; j<=n; j++) {
for (k=1; k<=n/4; k++) {
c[j+(i1)*n]=vector_madd(a[k+(i1)*n],b[j+(k1)*n],c[j+(i1)*n]);
}
}
}
Here, the 2D array’s [i,j] index has been replaced by a single index [j+(11)n] that threads through the data in "vector" form. Using the multiplyadd, the a_{ik}b_{kj} multiplication is carried out and added to c_{ij}, repeatedly, in vector form. AltiVec would march through this computation four instructions at a time, until all of the a_{ik}b_{kj} products have been carried out.
In reality, the algorithm is not so simple, and neither is the contraction of indices  a few additional lines of code are required to stuff the multidimensional arrays into Altivec vector functions. But the end result is quite good. On the PowerBook G4 with n=200, the scalar computation runs at 84 MFLOPS, while the AltiVec version runs at 384 MFLOPS, a factor of 4.6 improvement. An even better matrix multiplication algorithm available from Apple boosts that performance to a whopping 681 MFLOPS, an improvement close to a factor of 8.
This is a good example of how AltiVec can really pay off for complex computations. For a more detailed look at the AltiVec matrix multiplication code, see Apple’s Developer Web site.
One final note  each of these examples implied that n was evenly divisible by four, but that is not a requirement. Arrays can be padded with dummy elements (rounding the array size up to the next largest multiple of 4), or a scalar "cleanup" loop can be used to pick up any leftovers. An example is shown below:
for (i=1; i<=n/4; i++) {
z[i]=vec_add(x[i],y[i]);
}
m=mod(n,4)
for (i=nm+1; i<=n; i++) {
z[i]=x[i]+y[i];
}
Here, m=mod(n,4) is the remainder of n/4, either 0, 1, 2, or 3. In cases where n is evenly divisible by 4, m=0 and the scalar cleanup loop will be bypassed.
This article took a brief look at AltiVec and illustrated some of the performance gains available with this technology. For specific applications, you need to take a close look at the code and see where AltiVec can be used. In many cases, using AltiVec will involve very minor code changes  more of a code tuneup  while other cases will require indepth code restructuring and rewriting to allow for vectorization. How well AltiVec works will depend on many factors, but if your code hinges on key computations that can be vectorized, AltiVec will surely make a difference.
Craig Hunter is an aerospace engineer at NASA Langley Research Center in Hampton, Virg.
Return to the Mac DevCenter.
Copyright © 2009 O'Reilly Media, Inc.