SDSU CS 662 Theory of Parallel Algorithms
Speedup

[To Lecture Notes Index]
San Diego State University -- This page last updated February 20, 1996, 1996
----------

Contents of Speedup Lecture

  1. Parallel Speedup
  2. References
  3. How much Speedup is Possible?
    1. Amdahl's Law - Serial Bottlenecks
    2. Principle of Unitary Speedup
    3. Superlinear Speedup in Theory
    4. Superlinear Speedup in Practice

Parallel Speedup
References


S. G. Akl, M. Cosnard, and A. G. Ferreira, Data-movement-intensive problems: two folk theorems in parallel computation revisited, Theoretical Computer Science 95 (1992) 323-337

G. S. Almasi and A. Gottlieb, Highly Parallel Computing, Benjamin/Cummings, 1989

comp.parallel newsgroup, January 1995

Cosnard and Trystram, Parallel Algorithms and Architectures, International Thompson Computer Press, 1995

How much Speedup is Possible?


Amdahl's Law - Serial Bottlenecks


Definitions
Ts(N) = time required by best sequential algorithm to solve a problem of size N
Tp(N) = time required by parallel algorithm using p processors to solve a problem of size N
Sp(N) = Ts(N)/Tp(N)
Sp(N) is the speedup achieved by the parallel algorithm



Amdahl's law

Assume
An operation takes one time unit
A fraction 0 < f < 1 of the operations must be done sequentially
Then Sp(N) <= 1/f

Proof:

We have
f*Ts(N) = number of operations that must be done sequentially
(1-f)*Ts(N) = number of operations that can be done in parallel

We get
Tp(N) = f*Ts(N) + (1-f)*Ts(N)/p
so
Sp(N)
= Ts(N)/(f*Ts(N) + (1-f)*Ts(N)/p)
= 1/(f + (1-f)/p)


But (1-f)/p < 1


Thus Sp(N) <= 1/f
Amdahl's Law with 5% Sequential Operations

Maximum speedup obtainable on an algorithm if 5% of its operations must be done sequentially

Amdahl's Law Expanded


On many algorithms the fraction of the operations that must be done sequentially is not a constant f, but a function, f(N), of the size of the input of the algorithm .

So the law states
Sp(N) <= 1/f(N)

Example - Sorting N Numbers

Assume
The only operations that must be done sequentially are the reading of the N numbers from disk

Total amount of work for sorting = [[Theta]]( N*lg(N) )

Sequential operations = [[Theta]]( N )

So
f(N) = [[Theta]]( N/{N*lg(N)} ) = [[Theta]]( 1/lg(N) )

Example Multiplying Two N*N Matrices

Two N*N matrices have a total of 2*N*N elements

Assume
The only operations that must be done sequentially are the reading of the 2*N*N numbers from disk


The straight forward method of multiplying two matrices takes
[[Theta]]( N*N*N ) operations

Sequential operations = [[Theta]]( N*N )

So
f(N) = [[Theta]]( N*N/{N*N*N} ) = [[Theta]]( 1/N )




How realistic are these two examples?

Lee's Generalized Amdahl's Law

Let qk be the percentage of the program that can be executed with k processors.

Let t1 be the time to run the program sequentially.

We have:
and

So

Setting qk = 1/p w get
So

Stone's table (1973)
Speed UpExamples
a*pMatrix computations
Sorts, Linear recursions, polynomial evaluation
Search for an element in a set

Principle of Unitary Speedup


Lemma 1:
If N processors can perform a computation in one step, then P processors can perform the same computation in ceiling(N/P) steps for 1 <= P <= N

proof:
Each of the N original processors performs one operation
Call that operation I[j ] for j = 1, ..., N
Each of the P processors performs the operation of ceiling(N/P) original processors
So each of the P processors must perform ceiling(N/P) operations
Note the P'th processor may perform fewer operations

Corollary:
If P processors can perform a computation in one step, then one processors can perform the same computation in P steps

Folk Theorem 1. Unitary Speedup
For any algorithm of size N and any number of processors P we have Sp(N) <= P. That is Ep(N) <= 1

proof:
Ts(N) = time required by best sequential algorithm to solve a problem of size N
Tp(N) = time required by parallel algorithm using p processors to solve a problem of size N
Sp(N) = Ts(N)/Tp(N)
Using the corollary a single processor can perform the same operations as the P processors in P*Tp(N) time
So Ts(N) <= P*Tp(N)
Thus Sp(N) <= P*Tp(N) / Tp(N) = P
Folk Theorem 2. (Brent)
If an algorithm involving a total of N operations can be performed in time T on a PRAM with sufficiently many processors, then it can be performed in time T + (N - T)/P on a PRAM with P processors.
proof:
Let Si be the operations performed at time step i on all of the original processors, i = 1, ..., T
Using P processors, the i'th step can be simulated in ceiling(Si /P) time
But ceiling(Si /P) <= (Si /P) + (P - 1)/P

Uses of Brent's Theorem
It is possible to improve a parallel algorithm by using few processors

Example - Parallel Add

Adding N integers with P = N/2 processors

Assume that N is a power of 2
J = N/2
while J >= 1 do
	for K = 1 to J do in Parallel
		Processor K: A[K] = A[2K-1]+A[2K]
	end for
	J = J/2
end while


Time Complexity [[Theta]]( lg(N) )

Cost [[Theta]]( N*Lg(N) )
Optimal Parallel Add

Let P = ceiling( N/lg(N) ) be the number of processors

for I = 1 to P do in Parallel
	Processor I: 
		B[I] = 0;
		for K = 1 to lg(N) do
			B[I] = A[{(I-1)*N/P}+K] + B[I]
		end for
end for

J = ceiling(N/[2*lg(N)])

while J >= 1 do
	for K = 1 to J do in Parallel
		Processor K: B[K] = B[2K-1]+B[2K]
	end for
	J = J/2
end while


Time Complexity [[Theta]](lg(N)) + [[Theta]]( lg[N/lg(N)] )= [[Theta]]( lg(N) )


Cost [[Theta]]( Lg(N) * N/lg(N) ) = [[Theta]]( N )

Superlinear Speedup in Theory


Problem: Messy List
We are given P distinct integers I1, I2, ..., IP such that Ik <= P for all k.
Note Ik can be negative.
The integers are stored in an array A, such that A[K] = Ik
Modify A so that for 1 <= K <= P we have:
A[Ik] = Ik if and only if 1 <= Ik <= P
A[K] = Ik otherwise


Example

Let P = 4
					A[1]	A[2]	A[3]	A[4]

Original values	4		1		-2		3


Modified values	1		1		3		4
Parallel Solution
for I = 1 to P do in Parallel
	Processor I: A[A[I]] := A[I]

Time required: 1 time unit

Sequential Solution

Theorem.
The problem Messy List cannot be solved in less than 2P -1 time units using the RAM model.

proof:

At some point the solution will perform an operation like:
Read X; if X > 0 then A[X] = X
(1)
Consider the first time we perform such an operation
Since we overwrite A[X] in line (1) we need to save old value of A[X] before we do line (1)
But X can be any index between 1 and P so we need to save P -1 elements of A before we perform (1)
We also need to perform (1) P times.

The speedup is 2P -1

Superlinear Speedup in Practice


Superlinear speedups are found in practice

All examples that I know of are due to the additional resources used in the parallel code over that used in the sequential code

Cache effects are a common cause of superlinear speedups

The following example is from comp.parallel Jan. 1995

Newsgroups: comp.parallel
Subject: Help - explain superlinear speedup?
From: slater@nuc.berkeley.edu (Steve Slater)
I have a program which has superlinear speedup and I can't explain it. Does anyone have any ideas. Here is the summary.
I am using a code which passes messages using p4, on 4 Sparc 2's running SunOS 4.1.3. The code solves coupled matrix equations, much like a heat equation. The processors are each assigned a geometrical region like:
	------------------
	|        |       |
	|   A    |   B   |
	|________|_______|
	|        |       |
	|   C    |   D   |
	|________|_______|
Each job/process analyzes only one region of A through D.
What happens in the code (not really important to my problem though) is a matrix is solved for each A-D, then the boundary conditions are passed between each region (outgoing heat current = incoming for each neighbor) and the matrix equations are solved locally again. The process repeats until the solution converges.
With p4, I first run 4 processes (4 regions) on only 1 machine. The messages are passing through sockets. Then I run on 2 machines, each having 2 processes (2 regions), and finally on 4 machines, each having 1 process (region).
You would expect less than linear speedup since with only one machine, no messages are sent over the ethernet, they are just communicated via sockets. But I get very superlinear speedup like:
1 proc:556 sec 4 unique processes on 1 machine
2 proc:204 sec 2 processes on each of 2 machines
4 proc: 38 sec 1 process on each machine
There was NO memory swapping occurring during the entire execution time. I would periodically check with ps.
Does anyone have any thoughts?

Steve Slater
slater@nuc.berkeley.edu

From: Krste Asanovic <krste@icsi.berkeley.edu>
Subject: Re: Help - explain superlinear speedup?
There are two possible cache effects. The first is that each Sparc-2 only has 64KB of unified cache. If your data set + code fits into 64KB you'll see a marking improvement over the case when it doesn't.
The second is the limited TLB size. I don't have the Sparc-2 MMU numbers handy, but I think it supported 64 entries for 4KB pages, i.e. 256KB mapped simultaneously at most. If a single process's code + data fits into the TLB, you'll see a marked difference.
These differences are exaggerated if your code makes repeated sweeps over these data regions.
--
Krste Asanovic email: krste@icsi.berkeley.edu


Newsgroups: comp.parallel
From: David Bader <dbader@glue.umd.edu>
Subject: Re: Help - explain superlinear speedup?
Superlinear speedup is commonly attributable to caching effects. When you split the problem onto multiple processors, the subproblems are obviously a fraction of the original problem size. With the smaller problem size, you are most likely getting a higher cache hit rate, and the result, even after considering the communications time, is still better than the time on a single processor with more cache misses.

-david


Newsgroups: comp.parallel
From: mtaylor@easynet.com (Michael A. Taylor)
Subject: Re: Help - explain superlinear speedup?
You are reducing the number of process switches and also the cache flushing that occurs with each process switch. Therefore you are executing fewer instructions (less switches) and they execute faster (fewer cache faults).

----------