Parallel and distributed computing, high performance computing, supercomputing technologies are referred to today as the most important directions of the scientific technical development in many leading countries of the world, including Russia. The potential of high performance computing makes it possible to solve many fundamental and applied scientific and technical problems, which require large-scale computations.

It is necessary to start working purposefully at developing a system of parallel and distributed computing education in order to prepare specialists for the realities of the future superparallel computer world. Only three components together – hardware, software and education – will create a steady basis for the development of the entire high-performance computing domain.

Taking into account these issues educational community initiated different activities to stimulate education in Parallel and Distributed Computing (PDC). A special knowledge area called Parallel and Distributed Computing was included into Informatics Curricula 2013. Having started in 2010 IEEE Computer Society Technical Committee on Parallel Processing (TCPP) with NSF support has developed a draft of parallel and distributed computing curricula. Besides, in 2010 the national Supercomputing Education Project was started in Russia. The State University of Nizhni Novgorod (UNN) is one of the key participants in the project. On the basis of the State University of Nizhni Novgorod the UNN HPC Center was established. Currently this center conducts research in High Performance Computing and Supercomputing Technologies and supports the education process at the State University of Nizhni Novgorod in these areas of knowledge. Among the priority targets of the center is improvement of the quality of PDC education.

The main goal of the project is to renew the curriculum for Bachelor’s Degree at the CMC so that it corresponds to the requirements of the parallel computation world.

NSF/TCPP Curriculum Initiative recommendations fully correspond to the plans of curriculum renewal and is used as the basis for completion of this task.

To implement the project, the team will follow these provisions:

- Sections connected with parallel computing should be presented in most of the courses based on the modern curriculum.
- Materials that will be included in the renewed courses should be created on modular basis.
- Learning materials on parallel computing should follow the bottom-up (from simple to difficult) approach.
- The courses should be renewed in accordance with the recommendations of the ACM-IEEE Computer Science Curricula CS2013 Strawman Draft report and the recommendations of Body of Knowledge of Supercomputing developed as a part of Russian national project Supercomputing Education.

The course is aimed at teaching classical data structures (stacks, queues, lists, trees, tables) and their use in algorithm development.

The goal of the course development is to teach parallel programming for systems with distributed memory (MPI technology is used as a basis).

The renewed course will give additional information on modern high-performance systems that provide huge (over petaflops) computational potential. Students will be taught the basics of multi-processor organization of programs: concepts of process and its difference from the concept of thread, process independence and communication, communication problems (overhead costs of data communication, data waiting locks, deadlocks) and message passing methods (point-to-point and collective operations, asynchronous data passing, communication complexity evaluation), presentation of a parallel program as a set of concurrently executed processes.

The basics of the MPI technology (processes and communicators, operations of sending and receiving of messages, typical communication operations) will be given to student to help them master and use the above concepts in practice. This technology will be taught on the example of various computational tasks: matrix computation, graph processing, Monte Carlo methods.

In general, the added learning materials correspond to the recommendations of the TCPP Curriculum on the parallel computing for the systems with distributed memory.

The course is intended for mastering the basic concepts, methods and algorithms of architecture and functioning of operating systems. The subject includes models and algorithms used in realization of various subsystems, application runtime environments and architecture of state-of-the-art operating systems. Special attention is given to operation of multitask systems, multiprocess and multithreaded applications, which is critical for development of programs for multiprocessor / multicore systems.

This is the main course for student studying parallel computing. CMC has been giving this course since 1995. It is constantly modernized to cover all the new achievements in the field of parallel computing.

During this course students study examples and classifications of parallel systems, parallel computing performance factors, computation and communication complexity of algorithms, additional sections in OpenMP and MPI technologies, methods of development of parallel algorithms and programs on the example of matrix computation tasks, sorting, graph processing, optimization, etc. The peculiarity of the course is that it demonstrates the possibility to predict the efficiency of the developed parallel algorithms and to confirm their efficiency by computational experiments.

The course includes extended laboratory practicum. Upon completion of the course each student is to present an individual project.

The course will be renewed by adding sections connected with the development of parallel algorithms and algorithms for computational systems with hierarchy structure (many nodes with distributed memory, each node may be multiprocessor, each processor may be multi-core). For the development of parallel programs, these systems require combining of OpenMP and MPI technologies. To master this approach, students have to make a lot of effort and carry out a lot of computational experiments.

The course includes studying a lot of TCPP Curriculum topics and may be used as the basis for courses on parallel computing in advanced level.

Course Syllabus, Course Materials

The main objective of the course is to study basic principles and acquire skills to develop programs that efficiently utilize Intel Xeon Phi.

The course includes the following topics:

- Exploration of Intel Xeon Phi architecture and the main mechanisms that influence application performance.
- Study of Intel Xeon Phi programming models and the corresponding system-level software. Mastering development, building and launching applications on Intel Xeon Phi.
- Study of principles and features of applying parallel programming technologies for development and optimization of computational applications for Xeon Phi, including SIMD, OpenMP and Cilk Plus.
- Mastering optimization and vectorization of computational loops, enhancing memory efficiency, load balancing.
- Getting acquainted with examples of successful optimization of applications that are initially not fully suited for efficient utilization of Intel Xeon Phi.

The course examines the construction and the performance analysis of deep neural networks using the Intel® neon™ Framework.

The following topics are covered:

- Introduction to deep learning.
- Multilayered fully-connected neural networks.
- Introduction to the Intel® neon™ Framework.
- Convolutional neural networks. Deep residual networks.
- Transfer learning of deep neural networks.
- Unsupervised learning: autoencoders, deconvolutional networks.
- Recurrent neural networks.
- Introduction to the Intel® nGraph™.

The course examines the practical application of deep learning for solving actual problems of computer vision during developing video surveillance systems.

The following topics are covered.

- Goals and tasks, course structure. The general scheme of solving computer vision problems using deep learning (from preparing data to assessing the quality). Overview of software tools used at each step.
- Image classification with a large number of categories using deep learning.
- Overview of the Intel Distribution of OpenVINO toolkit (Inference Engine, OpenCV, Open Model Zoo).
- Object detection in images using deep neural networks.
- Deep models for tracking objects in videos.
- Semantic segmentation of images using deep learning.
- Preparing synthetic data based on generative adversarial networks.

**Direct methods for solving system of linear equations**

The course examines the known numerical algorithms for solving systems of linear algebraic equations, and also a range of issues related to parallelization of these algorithms in shared memory systems.

The course covers the following problems:

- Investigation of the basic matrix operations as well as problems of their efficient implementation for memory hierarchy systems.
- Studying the algorithms for solving systems of linear equations with both matrices of general type and matrices of special form (Gaussian method, Cholesky factorization, sweep and reduction methods)
- Consideration of the general approaches to algorithm optimization for memory work and load balancing for parallelization.
- Discussing (within the laboratory works) the examples of efficient implementation of the examined algorithms.

**Iterative methods for solving system of linear equations**

The main course objective is to study basic iterative algorithms of solving linear systems, gain experience in developing parallel numerical algorithms for efficient use on shared memory systems. It involves solving the following problems:

- Mastering parallel programming on shared memory systems (OpenMP, TBB, Cilk Plus).
- Studying basic iterative methods of solving linear systems (Jacobi, Seidel, SOR and direct iteration methods).
- Studying Krylov subspace iterative methods (conjugate gradient, biconjugate gradient and generalized minimum residual methods).
- Studying basic preconditioning algorithms (methods based on incomplete LUfactorization).
- Studying general approaches to memory algorithm optimization and load balancing in case of parallelization.
- Insight into efficient implementations of studied algorithms (in the course of laboratory works).

**Numerical methods for solving differential equations**

The main objective of this course is to study numerical methods for solving ordinary and partial differential equations and approaches to their parallelization for shared memory systems. The course involves the following problems:

- Studying numerical methods for solving systems of ordinary, stochastic and partial differential equations
- Studying approaches to checking correctness and convergence of experimental results to theoretical data.
- Studying principles of parallel algorithm construction for solving differential equations.
- Studying functionality of libraries to solve auxiliary problems such as random number generation and Fourier transform (Intel MKL, FFTW).

**Introduction to parallel programming**

The purpose of this course is to study mathematical models, methods and technologies of parallel programming for multicore and multiprocessor computers to the extent ensuring a successful start in the field of parallel programming. The proposed set of knowledge and skills forms a theoretical basis for method of complex program development and includes such topics as purposes and objectives of parallel data processing , construction principles for parallel computing systems, modeling and analysis of parallel computations, development principles for parallel programs and algorithms, systems for parallel programming and parallel numerical algorithms for solving standard computing mathematics problems.

**Parallel programming for multiprocessor distributed memory systems**

The training course covers parallel programming technology designed for development of high-performance implementation of time-consuming algorithms to be executed on the parallel computation systems of cluster architecture. Lecture sections propose studying of Message Passing Interface technology fundamental principles, structure of MPI library, types of communications between processes, derived data types, virtual topologies, blocking and nonblocking communications in point-to-point and collective modes. During laboratory practice phases of parallel software development will be reviewed. Among those are: development of serial implementation, as a comparative example, parallel version development, its analysis. Training process is based on the test problems which don’t require specific knowledge from particular application domains, except the information from the training course.

**Introduction to MPI**

The purpose of this course is to study one of MPI technologies of parallel programming. This course also studies mathematical models, methods and technologies of parallel programming for multicore and multiprocessor computers to the extent ensuring a successful start in the field of parallel programming.

**Parallel programming for shared memory systems**

The purpose of this course is mastering a set of skills and knowledge required for successful start of professional activities in the field of parallel programming on shared memory systems. The course incorporates both all theory of parallel computing using Intel Threading Building Blocks (TBB), and practical knowledge and skills of TBB-based parallel programming.

**Introduction to GPU programming**

The main objective of the course is to introduce basic principles and acquire skills of GPU programming. The course consideres the following topics:

- Overview of GPU programming technologies.
- Study of GPU architecture.
- Getting familiar with CUDA C programming languages. Mastering writing kernels and device functions, workload distribution, and data exchanges between host and device memory.
- Getting acquainted with examples of implementation and optimization of applications on GPU.
- Using optimized CUDA libraries: CUBLAS, CUFFT, CURAND.