Parallel external memory

From HandWiki
PEM Model

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of [math]\displaystyle{ P }[/math] processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size [math]\displaystyle{ N }[/math] and [math]\displaystyle{ P }[/math] small internal memories (caches). The processors share the main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size [math]\displaystyle{ M }[/math] which is partitioned in blocks of size [math]\displaystyle{ B }[/math]. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size [math]\displaystyle{ B }[/math].

I/O complexity

The complexity measure of the PEM model is the I/O complexity,[1] which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if [math]\displaystyle{ P }[/math] processors load parallelly a data block of size [math]\displaystyle{ B }[/math] form the main memory into their caches, it is considered as an I/O complexity of [math]\displaystyle{ O(1) }[/math] not [math]\displaystyle{ O(P) }[/math]. A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read/write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts[1] occur. Like in the PRAM model, three different variations of this problem are considered:

  • Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
  • Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
  • Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms[1] solve the CREW and EREW problem if [math]\displaystyle{ P \leq B }[/math] processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writes to the block. This results in a total of [math]\displaystyle{ P }[/math] parallel block transfers. A second approach needs [math]\displaystyle{ O(\log(P)) }[/math] parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round [math]\displaystyle{ P }[/math] processors combine their blocks into [math]\displaystyle{ P/2 }[/math] blocks. Then [math]\displaystyle{ P/2 }[/math] processors combine the [math]\displaystyle{ P/2 }[/math] blocks into [math]\displaystyle{ P/4 }[/math]. This procedure is continued until all the data is combined in one block.

Comparison to other models

Model Multi-core Cache-aware
Random-access machine (RAM) No No
Parallel random-access machine (PRAM) Yes No
External memory (EM) No Yes
Parallel external memory (PEM) Yes Yes

Examples

Multiway partitioning

Let [math]\displaystyle{ M=\{m_1,...,m_{d-1}\} }[/math] be a vector of d-1 pivots sorted in increasing order. Let A be an unordered set of N elements. A d-way partition[1] of A is a set [math]\displaystyle{ \Pi=\{A_1,...,A_d\} }[/math] , where [math]\displaystyle{ \cup_{i=1}^d A_i = A }[/math] and [math]\displaystyle{ A_i\cap A_j=\emptyset }[/math] for [math]\displaystyle{ 1\leq i\lt j\leq d }[/math]. [math]\displaystyle{ A_i }[/math] is called the i-th bucket. The number of elements in [math]\displaystyle{ A_i }[/math] is greater than [math]\displaystyle{ m_{i-1} }[/math] and smaller than [math]\displaystyle{ m_{i}^2 }[/math]. In the following algorithm[1] the input is partitioned into N/P-sized contiguous segments [math]\displaystyle{ S_1,...,S_P }[/math] in main memory. The processor i primarily works on the segment [math]\displaystyle{ S_i }[/math]. The multiway partitioning algorithm (PEM_DIST_SORT[1]) uses a PEM prefix sum algorithm[1] to calculate the prefix sum with the optimal [math]\displaystyle{ O\left(\frac{N}{PB} + \log P\right) }[/math] I/O complexity. This algorithm simulates an optimal PRAM prefix sum algorithm.

// Compute parallelly a d-way partition on the data segments [math]\displaystyle{ S_i }[/math]
for each processor i in parallel do
    Read the vector of pivots M into the cache.
    Partition [math]\displaystyle{ S_i }[/math] into d buckets and let vector [math]\displaystyle{ M_i=\{j_1^i,...,j_d^i\} }[/math] be the number of items in each bucket.
end for

Run PEM prefix sum on the set of vectors [math]\displaystyle{ \{M_1,...,M_P\} }[/math] simultaneously.

// Use the prefix sum vector to compute the final partition
for each processor i in parallel do
    Write elements [math]\displaystyle{ S_i }[/math] into memory locations offset appropriately by [math]\displaystyle{ M_{i-1} }[/math] and [math]\displaystyle{ M_{i} }[/math].
end for

Using the prefix sums stored in [math]\displaystyle{ M_P }[/math] the last processor P calculates the vector B of bucket sizes and returns it.

If the vector of [math]\displaystyle{ d=O\left(\frac{M}{B}\right) }[/math] pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with [math]\displaystyle{ O\left(\frac{N}{PB} + \left\lceil \frac{d}{B} \right\rceil\gt \log(P)+d\log(B)\right) }[/math] I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list A of size N. The following code[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in [math]\displaystyle{ O(\log N) }[/math], and SELECT, which is a cache optimal single-processor selection algorithm.

if [math]\displaystyle{ N \leq P }[/math] then 
    [math]\displaystyle{ \texttt{PRAMSORT}(A,P) }[/math]
    return [math]\displaystyle{ A[k] }[/math]
end if 

//Find median of each [math]\displaystyle{ S_i }[/math]
for each processor i in parallel do 
    [math]\displaystyle{ m_i = \texttt{SELECT}(S_i, \frac{N}{2P})  }[/math]
end for 

// Sort medians
[math]\displaystyle{ \texttt{PRAMSORT}(\lbrace m_1, \dots, m_2 \rbrace, P) }[/math]

// Partition around median of medians
[math]\displaystyle{ t = \texttt{PEMPARTITION}(A, m_{P/2},P) }[/math]

if [math]\displaystyle{ k \leq t }[/math] then 
    return [math]\displaystyle{ \texttt{PEMSELECT}(A[1:t], P, k) }[/math]
else 
    return [math]\displaystyle{ \texttt{PEMSELECT}(A[t+1:N], P, k-t) }[/math]
end if

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

[math]\displaystyle{ O\left(\frac{N}{PB} + \log (PB) \cdot \log(\frac{N}{P})\right) }[/math]

Distribution sort

Distribution sort partitions an input list A of size N into d disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If [math]\displaystyle{ P = 1 }[/math] the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm[1] is used:

// Sample [math]\displaystyle{ \tfrac{4N}{\sqrt{d}} }[/math] elements from A
for each processor i in parallel do
    if [math]\displaystyle{ M \lt  |S_i| }[/math] then
        [math]\displaystyle{ d = M/B }[/math]
        Load [math]\displaystyle{ S_i }[/math] in M-sized pages and sort pages individually
    else
        [math]\displaystyle{ d = |S_i| }[/math]
        Load and sort [math]\displaystyle{ S_i }[/math] as single page
    end if
    Pick every [math]\displaystyle{ \sqrt{d}/4 }[/math]'th element from each sorted memory page into contiguous vector [math]\displaystyle{ R^i }[/math] of samples
end for 

in parallel do
    Combine vectors [math]\displaystyle{ R^1 \dots R^P }[/math] into a single contiguous vector [math]\displaystyle{ \mathcal{R} }[/math]
    Make [math]\displaystyle{ \sqrt{d} }[/math] copies of [math]\displaystyle{ \mathcal{R} }[/math]: [math]\displaystyle{ \mathcal{R}_1 \dots \mathcal{R}_{\sqrt{d}} }[/math]
end do

// Find [math]\displaystyle{ \sqrt{d} }[/math] pivots [math]\displaystyle{ \mathcal{M}[j] }[/math]
for [math]\displaystyle{ j = 1 }[/math] to [math]\displaystyle{ \sqrt{d} }[/math] in parallel do
    [math]\displaystyle{ \mathcal{M}[j] = \texttt{PEMSELECT}(\mathcal{R}_i, \tfrac{P}{\sqrt{d}}, \tfrac{j \cdot 4N}{d}) }[/math]
end for

Pack pivots in contiguous array [math]\displaystyle{ \mathcal{M} }[/math]

// Partition Aaround pivots into buckets [math]\displaystyle{ \mathcal{B} }[/math]
[math]\displaystyle{ \mathcal{B} = \texttt{PEMMULTIPARTITION}(A[1:N],\mathcal{M},\sqrt{d},P) }[/math]

// Recursively sort buckets
for [math]\displaystyle{ j = 1 }[/math] to [math]\displaystyle{ \sqrt{d} + 1 }[/math] in parallel do
    recursively call [math]\displaystyle{ \texttt{PEMDISTSORT} }[/math] on bucket jof size [math]\displaystyle{ \mathcal{B}[j] }[/math]
    using [math]\displaystyle{ O \left( \left \lceil \tfrac{\mathcal{B}[j]}{N / P} \right \rceil \right) }[/math] processors responsible for elements in bucket j
end for

The I/O complexity of PEMDISTSORT is:

[math]\displaystyle{ O \left( \left \lceil \frac{N}{PB} \right \rceil \left ( \log_d P + \log_{M/B} \frac{N}{PB} \right ) + f(N,P,d) \cdot \log_d P \right) }[/math]

where

[math]\displaystyle{ f(N,P,d) = O \left ( \log \frac{PB}{\sqrt{d}} \log \frac{N}{P} + \left \lceil \frac{\sqrt{d}}{B} \log P + \sqrt{d} \log B \right \rceil \right ) }[/math]

If the number of processors is chosen that [math]\displaystyle{ f(N,P,d) = O\left ( \left \lceil \tfrac{N}{PB} \right \rceil \right ) }[/math]and [math]\displaystyle{ M \lt B^{O(1)} }[/math] the I/O complexity is then:

[math]\displaystyle{ O \left ( \frac{N}{PB} \log_{M/B} \frac{N}{B} \right ) }[/math]

Other PEM algorithms

PEM Algorithm I/O complexity Constraints
Mergesort[1] [math]\displaystyle{ O\left(\frac{N}{PB} \log_{\frac{M}{B}} \frac{N}{B}\right) = \textrm{sort}_P(N) }[/math] [math]\displaystyle{ P \leq \frac{N}{B^2}, M = B^{O(1)} }[/math]
List ranking[2] [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] [math]\displaystyle{ P \leq \frac{N/B^2}{\log B \cdot \log^{O(1)} N}, M = B^{O(1)} }[/math]
Euler tour[2] [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] [math]\displaystyle{ P \leq \frac{N}{B^2}, M = B^{O(1)} }[/math]
Expression tree evaluation[2] [math]\displaystyle{ O \left ( \textrm{sort}_P(N) \right ) }[/math] [math]\displaystyle{ P \leq \frac{N}{B^2 \log B \cdot \log^{O(1)}N}, M = B^{O(1)} }[/math]
Finding a MST[2] [math]\displaystyle{ O \left(\textrm{sort}_P(|V|) + \textrm{sort}_P(|E|) \log \tfrac{|V|}{pB} \right) }[/math] [math]\displaystyle{ p \leq \frac{|V|+|E|}{B^2 \log B \cdot \log^{O(1)} N}, M = B^{O(1)} }[/math]

Where [math]\displaystyle{ \textrm{sort}_P(N) }[/math] is the time it takes to sort N items with P processors in the PEM model.

See also

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures. New York, New York, USA: ACM Press. pp. 197–206. doi:10.1145/1378533.1378573. ISBN 9781595939739. 
  2. 2.0 2.1 2.2 2.3 Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. pp. 1–11. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425.