|  |  |
| --- | --- |
|  |  |

The Ministry of Education and Science of the Russian Federation

Lobachevsky State University of Nizhni Novgorod

Computing Mathematics and Cybernetics faculty

The competitiveness enhancement program   
of the Lobachevsky State University of Nizhni Novgorod   
among the world's research and education centers

Strategic initiative  
“Achieving leading positions in the field of supercomputer technology   
and high-performance computing”

**Introduction to GPU programming**

*Practice 7. Matrix multiplication*

Nizhni Novgorod

2014

**Author: S.I. Bastrakov**

# Objectives

The objective of this practice is to implement and analyze naive matrix multiplication algorithm, and implement optimized block algorithm using shared memory.

# Abstract

This practice considers naive and block matrix multiplication algorithms. We analyze both algorithms and explain how shared memory leads to significant speedup of block algorithm.

# BRIEF OVERVIEW

We consider matrix multiplication of two rectangular matrices, and suppose number of rows and columns of all matrices is a multiple of 16.

Implementation on CPU is very straightforward (not it is very far from optimal on CPU):

void mmult(int m, int n, int k, const float \* a, const float \* b,  
 float \* c)

{

for (int i = 0; i < m; ++i)

for (int j = 0; j < k; ++j)

{

c[i \* k + j] = 0;

for (int l = 0; l < n; ++l)

c[i \* k + j] += a[i \* n + l] \* b[l \* k + j];

}

}

This implementation is easily transformed into a naïve kernel and the code to launch it:

\_\_global\_\_ void mmult\_kernel(int m, int n, int k, const float \* a, const float \* b, float \* c)

{

int i = blockIdx.x \* blockDim.x + threadIdx.x;

int j = blockIdx.y \* blockDim.y + threadIdx.y;

float sum = 0;

for (int l = 0; l < n; ++l)

sum += a[i \* n + l] \* b[l \* k + j]; // sum += a(i, l) \* b(l, j)

c[i \* k + j] = sum; // c(i, j) = sum

}

void mmult\_gpu(int m, int n, int k, const float \* a, const float \* b, float \* c)

{

dim3 blocks(m / BLOCK\_SIZE, k / BLOCK\_SIZE);

dim3 threads(BLOCK\_SIZE, BLOCK\_SIZE);

mmult\_kernel<<<blocks, threads>>>(m, n, k, a, b, c);

}

Let us analyze memory access pattern of the naïve implementation. Each thread reads a row of matrix A and a column of matrix B, overall 2n numbers. Thread block reads overall 2n \* BLOCK\_SIZE2 numbers. To compute a block of matrix C sized BLOCK\_SIZE x BLOCK\_SIZE it is only required to read BLOCK\_SIZE rows of A, and BLOCK\_SIZE columns B, that is only 2n \* BLOCK\_SIZE numbers. This observation is a key to block algorithm.

In block algorithm each thread block computes a block of resulting matrix using shared memory. It is implemented in CUDA as follows:

\_\_global\_\_ void mmult\_kernel\_opt(int m, int n, int k, const float \* a, const float \* b, float \* c)

{

int i = blockIdx.x \* blockDim.x + threadIdx.x;

int j = blockIdx.y \* blockDim.y + threadIdx.y;

\_\_shared\_\_ float a\_block[BLOCK\_SIZE][BLOCK\_SIZE];

\_\_shared\_\_ float b\_block[BLOCK\_SIZE][BLOCK\_SIZE];

float sum = 0;

for (int p = 0; p < n / BLOCK\_SIZE; ++p) {

a\_block[threadIdx.x][threadIdx.y] = a[i \* n + p \* BLOCK\_SIZE + threadIdx.y];

b\_block[threadIdx.x][threadIdx.y] = b[(p \* BLOCK\_SIZE + threadIdx.x) \* k + j];

\_\_syncthreads();

for (int l = 0; l < BLOCK\_SIZE; ++l)

sum += a\_block[threadIdx.x][l] \* b\_block[l][threadIdx.y];

\_\_syncthreads();

}

c[i \* k + j] = sum; // c(i, j) = sum

}

void mmult\_gpu\_opt(int m, int n, int k, const float \* a, const float \* b, float \* c)

{

dim3 blocks(m / BLOCK\_SIZE, k / BLOCK\_SIZE);

dim3 threads(BLOCK\_SIZE, BLOCK\_SIZE);

mmult\_kernel\_opt<<<blocks, threads>>>(m, n, k, a, b, c);

}

Block algorithm is superior over naïve for matrices that are large enough to not fit global memory cache due to putting frequently used data to shared memory. This is one of the basic optimization approaches on GPU.

# FOR STUDENTS

Empirical evaluation of various performance optimization techniques is presented in [1]. Advanced optimization topics are covered in [2, 3].

# References

1. Farber R. CUDA Application Design and Development. – Morgan Kaufmann, 2011. – 336 p.
2. GPU Computing Gems Emerald Edition, ed. Wen-mei W. Hwu. – Morgan Kaufmann, 2011. – 886 p.
3. NVIDIA CUDA C Best Practices Guide [http://docs.nvidia.com/cuda/cuda-c-best-practices-guide#axzz3JRcPurfI]

# Individual work

1. Modify block algorithm so that each thread computes several elements instead of 1.
2. Empirically find optimal block size.
3. Implement Strassen matrix multiplication algorithm using the developed block algorithm implementation for small enough matrices. Compare performance of Strassen and block algorithms depending on matrix size.