The RAM model

(1)

6. Realistic Computer Models

AEA 2021/2022

Based on

Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 5

(2)

Content

Introduction

Memory Hierarchy Models Fundamental Techniques

External Memory Data Structures Cache-aware Optimization

Cache-Oblivious Algorithms and Data Structures

(3)

The RAM model of computation

TheRandom-access machine (RAM)model or the ”von Neumann model of computation”: a computing device attached to a storage device

I Instructions are executed one after another, with no concurrent operations

I Every instruction takes the same amount of time I Unbounded amount of available memory

I Memory stores words of size O(logn) bits, n the input size I Any desired memory location can be accessed in unit time I For numerical and geometric algorithms, it is sometimes

assumed that words can represent real numbers accurately I Exact arithmetic on arbitrary real numbers can be done in

const. time

(4)

Real Architecture

A hierarchy of storage devices with different access times and storage capacities

(5)

The RAM model

Advantages:

I it hides the ”messy” details of computer architecture from the algorithm designer

I it encapsulates well the comparative performance of algorithms

I it captures the essential behavior of computers, while being simple to work with

I the performance guarantees are not architecture-specific (robust)

(6)

The RAM model

Disadvantages:

I it fails when the input data/intermediate data structure is too large; the dominant part: the time the algorithms spend waiting for the data to be brought from the hard disk to internal memory

Figure:Predicted performance of RAM model vs. its real performance

(7)

Content

Introduction

Memory Hierarchy Models

Fundamental Techniques

(8)

External Memory Model (EM)

I a central processing unit and two levels of memory hierarchy:

theinternal memory has a limited size of M words, the external memory can be accessed using I/Os that moveB contiguous words between the two memories

I the measure of performance: the no. of I/Os it performs I disk parallelism: D parallel disks, in one I/OD arbitrary

blocks can be accessed in parallel

(9)

Parallel Disk Model

Only one block can be accessed per disk during an I/O, rather than allowingD arbitrary blocks to be accessed.

Can be extended to allow parallel processing: P parallel identical processors each withM/P internal memory and equipped with D/P disks.

(10)

External Memory Model

Free to choose any two levels of the memory hierarchy as internal/external

I cache-aware algorithms

I problems with extending EM model to caches (limited associativity and automated replacement)

(11)

Ideal Cache Model

I a faster level of size M

I data transfers in chunks ofB elements

I the memory is managed automatically by an optimal offline cache-replacement strategy and the cache is fully associative

(12)

Cache-Oblivious Model

I In practiceB,M need to be tuned for optimal performance I Ideally, a model that capture the essence of the memory

hierarchy without knowing its specifics, and is efficient on all hierarchy levels simultaneously

Cache-Oblivious Model

I A two level memory hierarchy, but the algorithm does not have any knowledge of the values M andB

I The guarantees on I/O-efficient algorithms hold on any machine with multi-level memory hierarchy, and also on all levels of the memory hierarchy

I I/O-efficient algorithms perform well on different architectures without the need of any machine-specific optimization

(13)

Content

Introduction

(14)

Fundamental Techniques

Spatial locality: data close in address space to the currently accessed item is likely to be accessed soon.

Exploiting spatial locality: since the data transfer in the

EM/cache-oblivious model happens in terms of block of elements, the entire block when accessed should contain as much useful information as possible.

The likelihood of referencing a resource is higher if a resource near it has just been referenced.

I ex: graph clustering and partitioning techniques exploit

”nearness”

(15)

Fundamental Techniques

Temporal locality: an instruction/data item issued/accessed currently is likely to be issued/accessed in the near future.

Exploiting temporal locality: using the data in the internal memory for as much useful work as possible, before it is written back to the external memory.

I ex: divide and conquer: data is divided into chunks small enough to fit into the internal memory

(16)

Fundamental Techniques

Batching: wait before issuing an operation until enough data needs to be processed s.t. the operation’s cost is worthwhile.

Batching the operations: when performing one operation is nearly as costly as performing multiple operations of the same kind, do lazy processing: first batch a large number of operations to be done and then perform them ”in parallel”.

I ex: external priority queue - lazy processing ofdecrease-key operations after collecting them in a batch

(17)

Sorting and Scanning

sum = 0;

fori = 1 to N do sum=sum+A[i];

The number of I/Os required for scanningN data items:

scan(N) = Θ(N/B).

How to obtain theO(N/B) upper-bound for scanning? BringB contiguous elements in internal memory using a single I/O; do a simple memory access, rather than an expensive disk I/O.

(18)

Sorting and Scanning

The I/O complexity of sortingn elements:

sort(n) = Θ(_Bⁿlog_M/B_Bⁿ).

Merge-sort: in the run formation phase, the input data is

partitioned into sorted sequences (”runs”); in themerging phase, the runs are merged until one sorted run remains.

I The external memory sorting algorithmof

Aggarwal&Vitter’88: the 1^st phase produces sorted runs ofM elements and the 2^nd phase does a ^M_B-way merge, leading to O(_Bⁿlog_M/B_Bⁿ) I/Os.

I In the cache-oblivious setting, funnel-sort andlazy funnelsort, lead to sorting algorithms with a similar I/O complexity.

(19)

External Merge-sort

5GB of data (files on disk) 1GB RAM

I divide in 5 files, 1GB each

load in RAM each file, sort, write on disk (F1sorted, . . . ,F5sorted)

I load in RAM first 150MB from each sorted file Fisorted

→250MB free

5-way merge → write to disk

(20)

Sorting and Scanning

scan(n)<sort(n)n, for all practical values of B,M andn on large data sets.

Reading and writing data in sequential order/sorting the data to obtain a requisite layout on the disk is less expensive than accessing data at random.

(21)

Remove duplicates (1)

I Read one page in memory from the input file for i = 1 to numPages(A)do

p← read the next page from A;

using m−1 pages as a read buffer, read pages

i+ 1, . . .numPages(A) and cleanp from duplicates seen in these pages;

write the remaining records of p to the end of the output (beginning ofA)

I θ(n²) I/O operations,n number of pages in the input file

(22)

Remove duplicates (2)

I Read m−1 pages in memory, use 1 page as read buffer for i = 1 to dnumPages(A)/(m−1)edo

B ←read the next m−1 pages fromA;

using 1 page as a read buffer, read pages

i∗(m−1) + 1, . . .numPages(A) and clean B from duplicates seen in these pages;

write the remaining records of B to the end of the output (beginning ofA)

I θ(n²/m) I/O operations

(23)

Content

Introduction

(24)

Stacks and Queues

Represent dynamic sets of elements and support deletion of elements in LIFO, respectively FIFO order; implement them using an array of lengthn and pointers→ one I/O per insert and delete in the worst-case.

Stack: keep a buffer of 2B elements in the internal memory; at any time, it containsk most recently added elements, k ≤2B.

I Insert: no I/Os, except when the buffer runs full (an I/O is used to write the B least recent elements to a block in external memory)

I Remove: no I/Os, except when the buffer is empty (an I/O is used to retrieve the block of B elements most recently written to external memory)

I Any sequence ofB insert/delete operations need at mostone

(25)

Stacks and Queues

Queues: 2 buffers, a read and a write buffer of sizeB consisting of least, respectively most recently inserted elements.

I Insert to the write buffer (when full is written to external memory).

I Remove: work on the read buffer; delete the least recent element without any I/O until the buffer is empty, in which case the appropriate external memory block is read into it.

Amortized complexity of1/B I/Os per operation.

(26)

Linked Lists

An efficient implementation of ordered lists of elements: sequential search, deletion and insertion in arbitrary locations of the list.

Perform one I/O every time a pointer is followed.

An I/O-efficient implementation oflinked lists: keep the elements in blocks and maintain theinvariant: there are more than ²₃B elements in every pair of consecutive blocks.

I Insertion: one I/O, if the block is not full; otherwise, if any of its two neighbors has spare capacity, push an element to that block; otherwise, split the block into two equally sized blocks.

I Deletion: check if the operation results in violating the invariant; if so, merge the two violating blocks.

I O(1)I/O insert, delete, merge and split operations,O(i/B)

(27)

B-tree

Storing binary trees arbitrarily in external memory: O(log₂N) I/Os per query/update.

In external memory, a search tree like theB-tree is the basis for a wide range ofefficient queries on sets. A generalization of

balanced BSTs to a balanced tree of degreeθ(B).

I The n data items are stored in theθ(n/B) leaves (in sorted order), each leaf storing θ(B) elements.

I All leaves are on the same level and the tree has height O(logBn).

(28)

B-tree

O(log_Bn) I/Oinsert, delete and search operations.

I Searching: traverse down the tree from the root to the appropriate leaf in O(logBn) I/Os.

One-dimensional range queries: inO(log_Bn+T/B) I/Os, T the output size.

(29)

B-tree

I Insertion

1. Search the relevant leafl

2. If it is not full, insert the new element

3. Otherwise, splitl into leavesl⁰ andl” of approx. the same size and insert the element in the relevant leaf; the split ofl results in the insertion of a new routing element in the parent ofl, and thus the need for a split may propagate up the tree; the height of the tree grows by one.

Complexity: O(log_Bn) I/Os.

(30)

B-tree

I Deletion: search the appropriate leaf and remove the element;

if it results in too few elements in the leaf, fuse it with one of its siblings; similar to the case of splits in insertion, fuse operations may propagate up the tree and eventually result in the height of the tree decreasing by one.

Complexity: O(log_Bn) I/Os.

(31)

Content

Introduction

(32)

Cache-aware Optimization

Caches are part of the memory hierarchy between processor registers and the main memory.

Cache miss: a required data item is likely to be not in the cache, if the code does not respect the locality properties; several contiguous data words have to be loaded from memory into the cache.

(33)

Cache-aware Optimization

Detecting Poor Cache Performance. Use profiling tools to analyze the performance bottlenecks. Ex:

I gprof¹ - how much CPU time is spent in which program function

I Valgrind tool suite² - the cache simulator cachegrind performs simulations of the L1 and L2 cache to determine the origins of cache misses

I graphically, kprof³ and kcachegrind⁴

1https://sourceware.org/binutils/docs/gprof/

(34)

Fundamental Cache-Aware Techniques

Analyze the reasons for frequent cache-misses and identify the particular class of cache-miss responsible for the problem:

I cold (compulsory) miss: an item is accessed for the first time I capacity miss: an item has been in the cache before the

current access, but has been evicted due to the cache’s limited size

I conflict miss: when an accessed item has been replaced because another one is mapped to its cache line

(35)

Fundamental Cache-Aware Techniques

Loop Interchange and Array Transpose

I Since data is fetched blockwise into the cache, it is essential to access contiguous data consecutively.

I If the data access does not respect the data layout, memory references are not performed on contiguous data→ cache misses;

I solution: access A[i][j] accordant to row-major.

(36)

Fundamental Cache-Aware Techniques

Loop fusion: combines two loops that are executed directly after another (with the same iteration space) into one loop

I this transformation is legal if there are no dependencies from the first loop to the second one

I a higher instruction level parallelism, reduces the loop overhead, and may also improve data locality

(37)

Fundamental Cache-Aware Techniques

Array merging: instead of declaring 2 arrays with the same dimension and type (double a[n], b[n]), combine them to one multidimensional array (double ab[n][2]) or as an array of a structure (a,b,n) - if the elements of aand b are typically accessed together.

(38)

Cache-aware Optimization

Array Padding

Inter-array padding inserts unused variables (pads) between two arrays in order to avoid cross interference.

Introducing pads modifies the set of the second array s.t. both arrays are then mapped to different parts of the cache.

(39)

Cache-Aware Numerical Linear Algebra

Computational kernels in linear algebra that achieve a high cache performance:

I Basic Linear Algebra Subprograms (BLAS)¹ - basic vector and matrix operations

I Linear Algebra Package (LAPACK)² - solvers for linear equations, linear least-square and eigenvalue problems, etc.

I Automatically Tuned Linear Algebra Software (ATLAS)³ - determines the hardware parameters during its installation and adapts its parameters accordingly to achieve a high cache efficiency on a variety of platforms

(40)

Cache-Aware Numerical Linear Algebra

Large systems of linear equationsAx =b,x,b vectors of length n, A∈R^nxn sparse (contains O(n) non-zero entries) cannot be solved by direct methods; iterative algorithms that approximate the linear system solution: the basic splitting methods of Jacobi and

Gauss-Seidel, successive overrelaxation, Krylov subspace and multigrid methods. The latter two are hard to optimize for cache data reuse due to global operations, respectively the traversal of a hierarchical data structure. To overcome these problems:

I reduce the number of iterations by performing more work per iteration to speed up convergence

I algebraic transformations to improve data reuse

I removes data dependencies: avoid global sums, inner products

(41)

Cache-Aware Numerical Linear Algebra

Loop blocking: for the improvement of data access and therefore temporal localityin loop nests.

Changes the way in which the elements of objects (A and the corresponding vector elements) are accessed:

I rather than iterating over one row after the other, the matrix is divided into small block matrices that fit into the cache;

I new inner loops that iterate within the blocks are introduced into the original one;

(42)

Cache-Aware Numerical Linear Algebra

Loop blocking

(43)

Content

Introduction

(44)

Cache-Oblivious Algorithms

Theportabilityof the performance speedup of cache-aware optimization methods from one machine to another is often difficult - interested in algorithms that do not require specific hardware parameters.

To derivecache-oblivious algorithms, usespace-filling curves (bijective mappings from a line to a higher-dimensional space).

I When applied to objects with a regular structure (ex:

structured or semi-structured grids), they often produce high-quality solutions (partitionings of these graphs with high locality).

(45)

Cache-Oblivious Algorithms

Example: a cache-oblivious matrix multiplication algorithm A,B - nxn matrices stored in the memory. Compute the matrix productC :=AB.

fori = 1 to n do for j = 1 to n do

C[i,j]←0.0;

fork = 1 to n do

C[i,j] =C[i,j] +A[i,k]·B[k,j];

Algorithm 1: Naive matrix multiplication

A loop nest where two arrays of lengthn are accessed at the same time, one with stride 1, another with striden. By applying theloop blocking technique, cached entries of all matrices can be reused.

(46)

Cache-oblivious matrix multiplication algorithm

A cache-oblivious blocking of the main loop can be achieved by recursive block building.

How to guide this recursion byspace-filling curves?

I a method based on the Peano curve it increases bothspatial and temporal locality

(47)

Cache-oblivious matrix multiplication algorithm

The idea of cache-efficient computation ofC :=AB: the processing of matrix blocks.

I Each matrix is subdivided recursively into nx×ny block matrices, until all of them are small (some fraction of the cache size).

I Any block sizenx×ny,nx,ny odd.

I 9 recursive blocks; the recursion stops with submatrices of 3x3.

(48)

Cache-oblivious matrix multiplication algorithm

Each submatrix of 3 x 3 is stored in aPeano-like ordering:





a₀ a₅ a₆ a₁ a₄ a₇ a2 a3 a8



·





b₀ b₅ b₆ b₁ b₄ b₇ b2 b3 b8



=





c₀ c₅ c₆ c₁ c₄ c₇ c2 c3 c8





The multiplication of each block is done in the standard way (ex:

c₇:=a₁b₆+a₄b₇+a₇b₈). c_r =P

(p,q)∈Ir a_pb_q,I_r contains the three index pairs. After initializingcr to 0, execute for all triples (r,p,q),c_r ←c_r +a_pb_q, in an arbitrary order.

To do this cache-efficiently, jumps in the indicesr,p,q have to be avoided. It is possible to find an operation order where two

consecutive triples differ by no more than 1 in each element, so that optimalspatialand very goodtemporal locality is obtained.

(49)

Cache-oblivious matrix multiplication algorithm

The analysis of this scheme for the 3×3 example in the ideal cache model, with cache sizeM:

I the spatial locality of the elements is at most a factor of 3 away from the theoretical optimum

I the number of cache line transfersT(n) = 27T(n/3), n a power of 3; for blocks of size k×k, each block admits T(k) = 2· dk²/Be,B the size of a cache line

I this leads to the transfer of O(n³/√

M) data items (or O(n³/B√

M) cache lines) into the cache - asymptotically optimal, and improves the naive algorithm by a factor ofM

(50)

Funnel sort

A cache-oblivious version of Mergesort To sort a (contiguous) array ofn elements:

1. split the input into n^1/3 contiguous arrays, of sizen^2/3 2. sort the arrays recursively

3. merge the n^1/3 sorted sequences using an^1/3-merger.

Ak-merger inputs k sorted sequences and merges them:

recursively merge sorted sequences which become progressively longer as the algorithm proceeds; it suspends work on a merging sub-problem when the merged output sequence becomes ”long enough”.

Invariant: Each invocation of ak-merger outputs the nextk³ elements of the sorted sequence, obtained by merging thek input

(51)

Funnel sort

Ak-merger is built recursively out of√

k-mergers:

I k inputs are partitioned into √

k sets of √

k elements, which form the input to the√

k √

k-mergersL1, ...L^√_k.

I The outputs of the mergers are connected to the inputs of √ k buffers (a queue that can hold 2k^3/2 elements).

I The outputs of the buffers are connected to the √

k inputs of the√

k-merger R.

The base case of the recursion is ak-merger with k= 2, which producesk³ = 8 elements.

(52)

Funnel sort

Ak-merger operates recursively: to output k³ elements, the k-merger invokesR fork^3/2 times.

I before each invocation, the k-merger fills all buffers that are less than half full (contain less than k^3/2 elements)

I to fill buffer i, the algorithm invokes the corresponding left merger L_i once; since L_i outputsk^3/2 elements, the buffer contains at least k^3/2 elements afterLi finishes

(53)

Funnel sort

A careful implementation of a cache-oblivious lazy funnelsort outperforms library implementations of quicksort on uniformly distributed data

I for the largest instances in the RAM, it outperforms std::sort from the STL library, GCC 3.2, by 10-40%

(54)

Lazy Funnel sort

Modification: a buffer is filled when it runs empty =⇒ define the merging algorithm in ak-merger in terms of nodes of a tree of binary mergerswith buffers on the edges.

Ak-merger: a perfectly balanced binary tree with k leaves I each leaf contains a sorted input stream, and each internal

node contains a standard binary merger

I the output of the root is the output stream of the entire k-merger

I each edge between two internal nodes contains a buffer, which is the output stream of the merger in the lower node and is one of the two input streams of the merger in the upper node.

(55)

Lazy Funnel sort

Ak-merger, including the buffers associated with its middle edges, is laid out in memory in contiguous locations. This statement holds recursively for the top tree and for the bottom trees of the k-merger.

In effect, a k-merger is the same as thevan Emde Boas layout of a binary tree, except that edges are buffers and take up more than constant space.

whilev ’s output buffer is not full do if left input buffer empty then

Fill(left child ofv);

if right input buffer empty then Fill(right child ofv);

perform one merge step;