DLI: Deep Learning Inference…

DLI: Deep Learning Inference Benchmark

DLI is a benchmark for deep learning inference on various hardware. The main advantage of DLI from the existing benchmarks is the availability of perfomance results for a large number of deep models inferred on the Intel platforms (Intel CPUs, Intel Processor Graphics, Intel Movidius Neural Compute Stick). This makes it possible to assess the prospectives for the practical appliance of models. The source code of the system is publicly available on GitHub, which allows a third-party user to independently conduct experiments to analyze the inference performance of models of interest on their existing hardware.

DLI supports inference using the following frameworks:

Intel® Distribution of OpenVINO™ Toolkit (C++ and Python APIs).
Intel® Optimization for Caffe (Python API).
Intel® Optimizations for TensorFlow (Python API).
TensorFlow Lite (C++ and Python APIs).
ONNX Runtime (C++ and Python APIs).
MXNet (Python Gluon API).
OpenCV DNN (C++ and Python APIs).
PyTorch (C++ and Python APIs).
Apache TVM (Python API).
Deep Graph Library (PyTorch-based).
Spektral (Python API).
RKNN (C++ API).
ncnn (Python API).
PaddlePaddle (Python API).
ExecuTorch (C++ API).

Benchmark Achitecture

The software consists of the several components.

ConfigMaker is a graphical application for generating configuration files for various components of the DLI benchmark. The application is self-contained and does not depend on the rest of the software components.
Deployment is a component that provides automatic deployment of infrastructure to the computational nodes using Docker technology. Information about the computational nodes is contained in the configuration file.
BenchmarkApp is a component responsible for collecting performance metrics for the inference of a set of models using various deep learning frameworks. Information about models and parameters for executing inference is contained in the component’s configuration file.
Inference is a component that implements the inference of deep neural networks using various frameworks. It is used by the BenchmarkApp component to infer neural networks with the specified parameters.
AccuracyChecker is a component that provides an assessment of models performance on the public datasets. It is an add-on over a similar component of the OpenVINO toolkit.
RemoteController is a component that executes experiments remotely to determine inference performance and accuracy of deep models on the computational nodes.
Converters is an auxiliary component that contains various converters for convenient representation of performance and accuracy results. In particular, this component provides conversion of the output data to HTML-format for publishing the experimental results on the project web-page.

The main scenario involves the following actions.

Generating configuration files using ConfigMaker for the Deployment, RemoteController, BenchmarkApp and AccuracyChecker components.
Deploying test infrastructure to the computational nodes using the Deployment component. All necessary data for experiments is stored on the FTP-server. At this stage, the directory structure on the FTP-server is prepared, the template docker image of the computer is copied to the FTP-server, the docker image is remotely downloaded from the FTP-server to the computational nodes and the docker image is deployed locally, and the experiments configurations are copied per node.
Launching experiments remotely on the computational nodes using RemoteController. RemoteController launches AccuracyChecker, where the accuracy of the original models is determined first, and then the accuracy of the models that are converted to the intermediate representation of the OpenVINO toolkit with FP32 and FP16 weights formats. Next, RemoteController launches BenchmarkApp to determine inference performance (test order is the same as for AccuracyChecker). BenchmarkApp uses the Inference component to infer deep model on the computational node, each model is inferred using OpenVINO and the source framework used for the model training, if the DLI supports that framework for inference. After all calculations, RemoteController generates a file with the results of experiments on the FTP-server.
Converting the results table to HTML-format using Converters. This component converts tables with the results obtained by the BenchmarkApp and AccuracyChecker components into HTML-format for further publication on the official project web-page.

Test Datasets

Neural network performance was evaluated on images of the represented datasets.

Image classification: ImageNet.
Object detection: MS COCO, PASCAL VOC.
Semantic segmentation: MS COCO, PASCAL VOC.
Instance segmentation: MS COCO.
Semantic segmentation of on-road scenes: Cityscapes.
Face detection: LFW, VGGFace2, Wider Face.
Other tasks: public images from the Internet.

Measurements

The software is verified using the Inference Engine component of the Intel Distribution of OpenVINO Toolkit. The OpenVINO toolkit provides two inference modes.

Latency mode. This mode involves creating and executing a single request to infer the model on the selected device. The next inference request is created after completing of the previous one. During performance analysis, the number of requests is determined by the iterations number of the test loop. Latency mode minimizes inference time of the single request.
Throughtput mode. It involves creating a set of requests to infer the neural network on the selected device. The order of requests completion is an arbitrary one. The number of requests sets is determined by the number of iterations of the test loop. Throughput mode minimizes inference time of the overal requests set.

Inference Engine provides two programming interfaces.

Sync API is used to implement latency mode.
Async API is used to implement latency mode if a single request is created, and throughput mode, otherwise.

A single inference request corresponds to the feed forward of the neural network for a batch of images. Required test parameters:

batch size,
number of iterations (the number of time taken to infer one request for the latency mode and a set of requests for the througput mode),
number of requests created in throughput mode.

Inference can be executed in multi-threading mode. The number of threads is an inference parameter (by default, it equals the number of phisycal cores).

For throughput mode there is a possibility to execute requests in parallel using streams. Stream is a group of physical threads. The number of streams is a parameter too.

Due to the fact that the OpenVINO toolkit provides two inference modes, performance measurements are taken for each mode. Evaluating inference performance for the latency mode, requests are executed sequentially. The next request is infered after the completion of the previous one. For each request, its duration time is measured. The standard deviation is calculated on the set of obtained durations and the ones that goes beyond three standard deviations relative to the mean inference time are discarded. The final set of times is used to calculate the performance metrics for the latency mode.

Latency* is a median of execution times.
Average time of a single pass* is the ratio of the total execution time of all iterations to the number of iterations.
Batch FPS is the ratio of the batch size to the latency.
FPS is the ratio of the total number of processed images to the total execution time.

For the throughput mode, performance metrics are provided below.

Average time of a single pass* is the ratio of the execution time of all requests sets to the iterations number of the test loop. It is the execution time of a set of simultaneously created requests on the device.
Batch FPS is the ratio of the product of the batch size and the iterations number to the execution time of all requests.
FPS is the ratio of the total number of processed images to the total execution time.

Along with the OpenVINO toolkit, DLI supports inference using Intel Optimization for Caffe, Intel Optimization for TensorFlow and other inference frameworks. These frameworks support only one inference mode, similar to the latency mode of the OpenVINO toolkit. Therefore, the corresponding performance metrics are valid for these frameworks.

* From some version of the software, these indicators are not published on the project web-page, since the Batch FPS idicator is more representative, and some indicators can be calculated based on the Batch FPS. Since 2023 we publish FPS.

Experiment parameters

<precision> is a floating-point format (FP16, FP32, INT8);
<dataset> is a test dataset;
<batch_size> is a batch size (number of images) for the network single pass;
<mode> is an inference mode (sync or latency mode, async or throughput mode for old and new versions of the OpenVINO toolkit respectively);
<plugin> (<device>) is a plugin type (CPU, GPU, MYRIAD (Movidius));
[<async_req_count>] is a request count (async or throughput mode for old and new versions of the OpenVINO toolkit respectively);
<iterations_num> is a number of test iterations;
[<threads_num>] is a number of threads, by default, it equals the number of physical cores;
[<streams_num>] is a number of streams (async or throughput mode for old and new versions of the OpenVINO toolkit respectively).

Results

Inference performance results are available on the project Wiki.

Accuracy results for public models are available on the project Wiki too.

Publications

Kustikova V., Vasilyev E., Khvatov A., Kumbrasiev P., Vikhrev I., Utkin K., Dudchenko A., Gladilov G. Intel Distribution of OpenVINO Toolkit: A Case Study of Semantic Segmentation // Lecture Notes in Computer Science. V. 11832. 2019. P. 11-23. [https://link.springer.com/chapter/10.1007/978-3-030-37334-4_2].
Kustikova V., Vasilyev E., Khvatov A., Kumbrasiev P., Rybkin R., Kogteva N. DLI: Deep Learning Inference Benchmark // Communications in Computer and Information Science. V. 1129. 2019. P. 542-553. [https://link.springer.com/chapter/10.1007/978-3-030-36592-9_44].
Sidorova A.K., Alibekov M.R., Makarov A.A., Vasiliev E.P., Kustikova V.D. Automation of collecting performance indicators for the inference of deep neural networks in Deep Learning Inference Benchmark // Mathematical modeling and supercomputer technologies. Proceedings of the XXI International Conference (N. Novgorod, November 22–26, 2021). – Nizhny Novgorod: Nizhny Novgorod State University Publishing House, 2021. – 423 p. [https://hpc-education.unn.ru/files/conference_hpc/2021/MMST2021_Proceedings.pdf]. (In Russian)
Alibekov M.R., Berezina N.E., Vasiliev E.P., Vikhrev I.B., Kamelina Yu.D., Kustikova V.D., Maslova Z.A., Mukhin I.S., Sidorova A.K., Suchkov V.N. Performance analysis methodology of deep neural networks inference on the example of an image classification problem // Numerical Methods and Programming. – 2024. – Vol. 25(2). – P. 127-141. – [https://num-meth.ru/index.php/journal/article/view/1332/1264].
(In Russian)
Mukhin I., Rodimkov Y., Vasiliev E., Volokitin V., Sidorova A., Kozinov E., Meyerov I., Kustikova V. Benchmarking Deep Learning Inference on RISC-V CPUs // Springer Lecture Notes in Computer Science. – 2025. – Vol. 15406. — P. 331-346. — [https://link.springer.com/chapter/10.1007/978-3-031-78459-0_24].
Sidorova A., Mukhin I., Kustikova V. Optimizing Deep Learning Inference on RISC-V CPUs within the OpenVINO Toolkit // Springer Communications in Computer and Information Science (CCIS). – 2025. – Vol. 2363. — P. 74–91. — [https://link.springer.com/chapter/10.1007/978-3-031-80457-1_6].