CS 662: Introduction

CS 662: Theory of Parallel Algorithms
Spring Semester 1995
Introduction

San Diego State University -- This page last updated January 29, 1995

Introduction

Introduction

What is Parallelism in Computers?

Parallelism is a digital computer performing more than one task at the same time

Examples

IO chips

Most computers contain special circuits for IO devices which allow some task to be performed in parallel

Pipelining of Instructions

Some cpu's pipeline the execution of instructions

Multiple Arithmetic units (AU)

Some CPUs contain multiple AU so it can perform more than one arithmetic operation at the same time

We are interested in parallelism involving more than multiple CPUs

Common Terms for Parallelism

Concurrent Processing

A program is divided into multiple processes which are run on a single processor

The processes are time sliced on the single processor

Distributed Processing

A program is divided into multiple processes which are run on multiple distinct machines

The multiple machines are usual connected by a LAN

Machines used typically are workstations running multiple programs

Parallel Processing

A program is divided into multiple processes which are run on multiple processors

The processors normally:

are in one machine
execute one program at a time
have high speed communications between them

Some Timing Results Time to send 100 integers between 2 processes

Concurrent: on Saturn                .029 milliseconds  
Distributed:  Saturn & Rohan      20.0 milliseconds  
Distributed:  Saturn & East Coast 41.4 milliseconds

Intel Paragon Message Passing Time

                                             Old                New                             
Latency                                  100 microsec           40 microsec                     
Throughput                               30MByte/sec            200MByte/sec

Parallel Programming

Issues in parallel programming not found in sequential programming

Task decomposition, allocation and sequencing

Breaking down the problem into smaller tasks (processes) than can be run in parallel

Allocating the parallel tasks to different processors

Sequencing the tasks in the proper order

Efficiently use the processors

Communication of interim results between processors

The goal is to reduce the cost of communication between processors. Task decomposition and allocation affect communication costs

Synchronization of processes

Some processes must wait at predetermined points for results from other processes.

Different machine architectures

Performance Issues

Scalability

Using more nodes should

allow a job to run faster

allow a larger job to run in the same time

Load Balancing

All nodes should have the same amount of work

Avoid having nodes idle while others are computing

Bottlenecks

Communication bottlenecks

Nodes spend too much time passing messages

Too many messages are traveling on the same path

Serial bottlenecks

Communication

Message passing is slower than computation

Maximize computation per message

Avoid making nodes wait for messages



L = latency time to start a message

Tr = transmission time per byte of information

N = number of bytes in message

Time = time to send message

Time = L + N*Tr

Values for Paragon

                Old                        New                             
L               100 microseconds          40 microseconds                 
Tr              30 MBytes/sec             200 MBytes/sec

Parallel Machines

Parameters used to describe or classify parallel computers:

Type and number of processors

Processor interconnections

Global control

Synchronous vs. asynchronous operation

Type and number of processors

The Extremes

Massively parallel

Computer systems with thousands of processors

Parallel Supercomputers

CM-5, Intel Paragon

Coarse-grained parallelism

Few (~10) processor, usually high powered in system

Starting to be common in Unix workstations

Processor interconnections

Parallel computers may be loosely divided into two groups:

Shared Memory (or Multiprocessor)
Message Passing (or Multicomputers)

Shared Memory or Multiprocessor

Individual processors have access to a common shared memory module(s)

Examples

Alliant, Cray series, some Sun workstations

Features

Easy to build and program
Limited to a small number of processors 20 - 30 for Bus based multiprocessor

Bus Based Multiprocessor

Processor-Memory Interconnection Network Multiprocessor

Message Passing or Multicomputers

Individual processors local memory.

Processors communicate via a communication network

Examples

Connection Machine series (CM-2, CM-5), Intel Paragon, nCube, Transputers, Cosmic Cube

Features

Can scale to thousands of processors

Mesh Communication Network

Intel Paragon is a mesh machine

Global Control

or SISD, SIMD, MIMD, MISD

SISD - Single Instruction Single Data

Sequential Computer

MISD - Multiple Instruction Single Data

Each processor can do different things to the same input

Example: Detect shapes in an image.

Each processor searches for a different shape in the input image

Global Control

SIMD - Single Instruction Multiple Data

Each processor does the same thing to different data

Requires global synchronization mechanism

Each processor knows its id number

Not all shared memory computers are SIMD

Example: Adding two arrays A and B.

in Parallel do: K Read A[K] and B[K] Write A[K] + B[K] in C[K]

Global Control

MIMD -Multiple Instruction Multiple Data

Each processor can run different programs on different data

MIMD can be shared memory or message passing

Can simulate SIMD or SISD if there is global synchronization mechanism

Communication is the main issue

Harder to program than SIMD

Synchronous vs. Asynchronous Operation

Does the computer have a common global clock to synchronize the operation of the different processors

Shared memory and message passing computers can be synchronous

Parallel Algorithms-Models of Computation

PRAM

Parallel Random-Access Machine

Most commonly used model for expressing parallel algorithms

Shared Memory

Each processor may have local memory to store local results

MIMD

Model is MIMD, but algorithms tend to be SIMD

Each active processor executes same instruction

Synchronous

Memory Access

Exclusive Read Exclusive Write (EREW) - no simultaneous access to a single shared-memory location


Concurrent Read Exclusive Write (CREW) - simultaneous reads of a single shared-memory location are allowed


Concurrent Read Concurrent Write (CRCW) - simultaneous reads and writes are allowed on a single shared-memory location are allowed

Handling concurrent writes

Common CRCW PRAM allows concurrent writes only when all processors are writing the same value

Arbitrary CRCW PRAM allows an arbitrary processor to succeed at writing to the memory location

Priority CRCW PRAM allows the processor with minimum index to succeed

Examples

read(X, Y) copy X from shared memory to local memory Y

write(U, V) copy U from local memory to local memory V

Testing Primeness - CRCW All Processors must write the same value

Input:

x an integer in shared memory
k = processor id in local memory

Output:

Result in shared memory

for k = 1 to n do in parallel 
	write(1, Result)
	read(x, test)
	if (k divides test) then write(0, Result)
end for

Broadcasting - EREW

Input:

D - an item in shared memory to be broadcast to all processors

A[n] - an array in shared memory
n - number of processors
k = processor id

Result:

Processors will have copy of D in local memory

if k = 1 then
	read(D, d)
	write(d, A[1])
end if

for j = 0 to log(n) - 1 do
	for k = 2j + 1 to 2j+1 do in parallel 
		read(A[k - 2j], d)
		write(d, A[k])
	end for
end for

Time complexity is [[Theta]](log(N))

Metrics

Some definitions:

We should define O(g) as

and write

instead of

Review of Complexity

Counting operations - Unit cost model

Functions verses orders of magnitude

Common orders of magnitude

Counting operations and communication costs

Upper bounds, lower bounds

Worst case, average case

Important Measures

time required by best sequential algorithm to solve a problem of size N          
time required by parallel algorithm using p
 processors to solve a problem of size N

 speedup of the parallel algorithm             
efficiency of the parallel algorithm                         
cost of the parallel algorithm

 = the number of total operations done by the parallel algorithm

Example of Measures

Adding N integers with P processors

A[1..N] is array in shared memory holding values to add

Algorithm 1 P = N/2, N is a power of 2

J = N/2
while J >= 1 do
	for K = 1 to J do in Parallel
		Processor K: A[K] = A[2K-1]+A[2K]
	end for
	J = J/2
end while

Algorithm 2

for I = 1 to P do in Parallel
	Processor I: A[I] = A[{(I-1)*N/P}+1] + 
						A[{(I-1)*N/P}+2] + ... +
						A[(I)*N/P]
end for

J = N/2
while J >= 1 do
	for K = 1 to J do in Parallel
		Processor K: A[K] = A[2K-1]+A[2K]
	end for
	J = J/2
end while

CS 662: Theory of Parallel Algorithms Spring Semester 1995 Introduction

CS 662: Theory of Parallel Algorithms
Spring Semester 1995
Introduction