Практическая работа №4. Автоматическая оптимизация нейронной сети с помощью Apache TVM¶

1. Цели и задачи работы¶

Цель работы – изучить программный интерфейс для автоматической оптимизации нейронных сетей с помощью Apache TVM на процессорах архитектуры x86-64.

Достижение указанной цели предполагает решение следующих задач:

  1. Обучение архитектур логистической регрессии и полносвязной нейронной сети на наборе данных MNIST на x86-устройстве. Сохранение модели в формате Apache TVM, а также сохранение метрик качества и набора данных в формате NumPy для дальнейшего тестирования.
  2. Установка LLVM и сборка Apache TVM с LLVM.
  3. Оптимизация модели логистической регрессии.
    1. Загрузка модели логистической регрессии. Запуск, проверка корректности и измерение времени инференса без оптимизации.
    2. Оптимизация модели логистической регрессии с помощью AutoTVM, Auto-scheduler, MetaScheduler.
    3. Анализ результатов оптимизации логистической регрессии.
  4. Оптимизация полносвязной нейронной сети.
    1. Загрузка модели полносвязной нейронной сети. Запуск, проверка корректности и измерение времени инференса без оптимизации.
    2. Оптимизация модели полносвязной нейронной сети с помощью AutoTVM, Auto-scheduler, MetaScheduler.
    3. Анализ результатов оптимизации полносвязной нейронной сети.
  5. Оптимизация сверточной нейронной сети.
    1. Загрузка модели сверточной нейронной сети. Запуск, проверка корректности и измерение времени инференса без оптимизации.
    2. Оптимизация модели сверточной нейронной сети с помощью AutoTVM, Auto-scheduler, MetaScheduler.
    3. Анализ результатов оптимизации сверточной нейронной сети.

Полезные ссылки:

  • AutoTVM туториалы,
  • AutoTVM документация,
  • Auto-scheduler туториалы,
  • Auto-scheduler документация,
  • MetaScheduler документация.

Примечание: в настоящее время Apache TVM не полностью портирован на архитектуру RISC-V, в связи с этим имеется ряд ограничений, не позволяющих в полном объеме продемонстрировать имеющийся функционал для автоматической оптимизации сетей на RISC-V-устройствах. Ниже приведен примерный перечень проблем, с которыми авторы столкнулись в процессе подготовки материалов настоящей практической работы.

  • Во время оптимизации сверточных нейронных сетей возникают критические ошибки, которые не позволяют выполнить оптимизацию этих архитектур нейронных сетей на устройствах с архитектурой RISC-V. Поэтому в данной практической работе рассматриваются только полносвязные нейронные сети. В данной работе не рассматривается возможность использования оптимизации через RPC.
  • При запуске на устройствах с архитектурой RISC-V используется opt_level=2. Более высокий уровень оптимизации вызывает ошибки компиляции модели.
  • На текущий момент реализация Auto-scheduler работает с серьезными ограничениями, поэтому в данной работе Auto-scheduler используется только для оптимизации логистической регрессии.
  • Использование MetaScheduler для оптимизации приводит к критическим ошибкам, связанным с графом вычислений и ошибками компиляции.

2. Обучение моделей глубокого обучения¶

Обучение моделей выполняется на архитектуре x86 с использованием библиотеки PyTorch.

2.1 Установка зависимостей для обучения моделей¶

2.1.2 Установка Apache TVM¶

Необходимо установить LLVM и собрать Apache TVM из исходных кодов по аналогии с тем, как это было сделано в предыдущей практической работе. Ниже приведена соответствующая последовательность команд.

sudo apt install clang-17 llvm-17*
git clone --recursive https://github.com/apache/tvm
cd tvm

mkdir build
cd build

cmake -DUSE_LLVM=ON ..
make

2.1.2 Настройка окружения Python¶

Далее будем считать, что на x86-узле установлена Miniconda. Соответственно создадим виртуальное окружение для подготовки тестовых моделей. Для обучения моделей используется библиотека PyTorch и набор данных MNIST. Поэтому потребуется пакет torch, обеспечивающий функционал, необходимый для обучения/тестирования нейронных сетей, и torchvision, содержащий вспомогательные функции, в частности, для загрузки широко известных наборов данных. Далее приведена примерная последовательность команд для создания и настройки окружения.

conda create -n torch_train python==3.10
conda activate torch_train

pip install numpy matplotlib torchmetrics
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install notebook

2.2 Обучение моделей¶

Перед обучением модели необходимо активировать созданное на предыдущем этапе виртуальное окружение и установить путь к Apache TVM.

conda activate torch_train
export PYTHONPATH=<PATH TO TVM>/python:${PYTHONPATH}

Процесс обучения реализован в файле 04_train_model_x86.ipynb. Более подробно возможности библиотеки PyTorch для обучения моделей рассматривались во второй практической работе. Необходимо запустить выполнение этого файла. После завершения его работы архитектура и веса обученных нейронных сетей будут сохранены в файл в директории model/. В этой же директории будет сохранен файл с показателями точности моделей. Наряду с этим, указанный скрипт обеспечивает сохранение тестовых данных (изображения и их метки) для упрощения процедуры их загрузки на RISC-V-устройствах. Соответственное данные сохраняются в директорию data/.

3. Сборка и установка LLVM и Apache TVM¶

3.1. Сборка LLVM¶

Требуется собрать Apache TVM с LLVM. Рекомендуется использовать версию 15 <= LLVM <= 17.

3.1.1. Установка с помощью менеджера пакетов¶

sudo apt install clang-17 llvm-17*

3.1.1. Сборка LLVM версии llvmorg-17.0.6 (для версии llvmorg-17.0.6)¶

Для сборка LLVM из исходных кодов требуется загрузить необходимую версию LLVM из репозитори GitHub. В данном работе используется версия llvmorg-17.0.6, далее, используя утилиту CMake сгенерировать make-файлы и выполнить сборку. Ниже приведена соответствующая последовательност команд.

git clone https://github.com/llvm/llvm-project.git -b llvmorg-17.0.6
cd llvm-project

mkdir _build
cd _build

cmake -DCMAKE_BUILD_TYPE="Release" \
      -DLLVM_ENABLE_PROJECTS=clang \
      -DBUILD_SHARED_LIBS=True \
      -DLLVM_USE_SPLIT_DWARF=True \
      -DCMAKE_INSTALL_PREFIX="../../_install" ../llvm

make

Примечание: в случае сборки LLVM из исходных кодов перед сборкой Apache TVM необходимо указать путь к LLVM в переменной окружения PATH и создать переменную окружения LLVM_CONFIG. Ниже показан пример.

PATH="<PATH TO LLVM>/_build/bin:$PATH"
export LLVM_CONFIG=<PATH TO LLVM>/_build/bin/llvm-config

3.2. Установка OpenBLAS¶

Далее необходимо установить OpenBLAS, используя менеджер пакетов.

sudo apt-get install libopenblas-dev

3.3. Настройка окружения Python¶

Для выполнения практической работы создадим и настроим виртуальное окружение Python так, как показано ниже:

python3 -m venv ~/tvm_cpu/
source ~/tvm_cpu/bin/activate

pip install scipy numpy matplotlib pandas
pip install cloudpickle traitlets typing-extensions psutil pybind11 decorator attrs 
pip install notebook

3.4. Сборка Apache TVM¶

Для сборки Apache TVM используем ветку main GitHub-репозитория, так как недавно были внесены критически важные исправленияя 1 и 2. Для сборки Apache TVM не обязательн использовать созданную виртуальную среду для Python.

git clone --recursive https://github.com/apache/tvm
cd tvm

mkdir build
cd build

cmake -DUSE_LLVM=ON -DUSE_BLAS=openblas ..
make

3.5. Активация окружения для практической работы¶

Для активации виртуальной среды с целью решения задач практической работы необходимо выполнить следующие команды:

source ~/tvm_cpu/bin/activate
export PYTHONPATH=<PATH TO TVM>/python:${PYTHONPATH}

4. Программная реализация вспомогательных функций¶

4.1. Импорт пакетов¶

Для использования функционала Apache TVM и других вспомогательных библиотек импортируем необходимые пакеты.

Также определим переменную, содержащую используемый тип данных для элементов тензоров - float32, а также установим в качестве целевого устройства для запуска CPU.

In [1]:
import os
from time import time

import matplotlib.pyplot as plt


import numpy as np
import tvm

from tvm import autotvm
from tvm import auto_scheduler
from tvm import meta_schedule as ms

from tvm import relay
from tvm.autotvm.tuner import XGBTuner
from tvm.contrib import graph_executor


dtype = 'float32'
dev = tvm.cpu()

global_trial = 96

4.2. Строка компиляции¶

На данном этапе определим строку компиляции target. Компиляция нейронных сетей происходит на устройстве с архитектурой x86-64. Для упрощения тестирования и отладки добавлена возможность запуска на x86_64.

Для определения архитектуры устройства необходимо создать обьект строки компиляции по умолчанию для LLVM - tvm.target.Target('llvm'). Далее с помощью атрибута mtriple выбрать строку компиляции:

  • Если атрибут mtriple отсутствует или содержит подстроку x86_64, используется стандартная строка компиляции llvm.
  • Если mtriple содержит подстроку riscv64, используется строка компиляции llvm -jit=orcjit -mtriple=riscv64-unknown-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d.
  • В противном случае генерируется исключение.

Примечание 1: если TVM устанавливался через PyPI, то mtriple пустой.

Примечание 2: TVM поддерживает различные бэкенды, такие, как llvm, opencl, cuda и прочие. В данном случае для генерации машинного кода TIR будет транслироваться в LLVM IR, после чего из LLVM IR будет генерироваться машинный код. Краткое описание параметров строки компиляции приведено ниже.

  • -jit=orcjit указывает на использование JIT-компилятора ORC (On-Request Compilation). TVM необходим данный ключ при компиляции на RISC-V.

  • -mtriple=riscv64-unknown-linux-gnu определяет тройку целевой архитектуры. Она указывает на платформу RISC-V 64-бит с операционной системой Linux и неуточненным вендором.

  • -mcpu=generic-rv64 указывает целевой тип процессора.

  • -mabi=lp64d определяет используемый ABI (Application Binary Interface). lp64d обозначает ABI, в котором длинные целые (long) и указатели (pointers) имеют размер 64 бита, и включена поддержка вещественных чисел двойной точности (d).

  • -mattr=+64bit,+m,+a,+f,+d задает атрибуты целевой архитектуры.

    • +64bit - поддержка 64-битной архитектуры.
    • +m - поддержка умножения и деления.
    • +a - поддержка атомарных операций.
    • +f - поддержка операций с плавающей запятой одинарной точности.
    • +d - поддержка операций с плавающей запятой двойной точности.

4.3. Уровень оптимизации графа¶

При запусках на устройствах RISC-V используется opt_level=2, в случае запуска на архитектуре x86-64 используется opt_level=3.

In [2]:
def is_x86():
    if tvm.target.Target('llvm').attrs.get('mtriple') is None:
        return True
    return 'x86_64' in tvm.target.Target('llvm').attrs.get('mtriple')

def is_riscv():
    return 'riscv64' in tvm.target.Target('llvm').attrs.get('mtriple')
    

print(f"mtriple устройства {tvm.target.Target('llvm').attrs.get('mtriple')}")

if is_x86():
    target = tvm.target.Target('llvm')
    opt_level = 3
elif is_riscv():
    target = tvm.target.Target(
        'llvm -jit=orcjit -mtriple=riscv64-unknown-linux-gnu '
        '-mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d'
    )
    opt_level = 2
else:
    raise ValueError("Unsupported architecture")

print(f'{target = }')
mtriple устройства x86_64-pc-linux-gnu
target = llvm -keys=cpu -mtriple=x86_64-pc-linux-gnu

4.3 Вспомогательные функции¶

Реализуем функцию load_model для загрузки модели в формате TVM, а также функцию load_images_and_labels для загрузки изображений и меток из набора данных MNIST.

In [3]:
def load_model(mod_file, params_file):
    with open(mod_file, "r") as fo:
        mod = fo.read()
        
    mod = tvm.ir.load_json(mod)

    with open(params_file, "rb") as fo:
        params = relay.load_param_dict(fo.read())
    
    return mod, params

def load_images_and_labels(images_path, labels_path):
    images = np.load(images_path)
    labels = np.load(labels_path)
    
    return images, labels

Далее выполним реализацию функции timeit_inference для измерения времени инференса и функции get_accuracy для определения качества решения задачи.

  1. Функция timeit_inference. Измерение времени инференса проводится на наборе данных MNIST. Инференс выполняется отдельно для каждого изображения из набора данных MNIST. Время выполнения и результаты предсказания (номер класса, на котором достигается максимумальная достоверность) возвращаются из функции.
  2. Функция get_accuracy. Определение качества решения задачи выполняется для всего набора данных MNIST посредством сравнения результатов предсказания и разметки. Точность вычисляется как отношение количества совпадений предсказанных и размеченных классов к общему числу изображений в наборе данных.
In [4]:
def timeit_inference(mod, lib, images):
    input_name = mod['main'].params[0].name_hint
    input_shape = mod['main'].params[0].type_annotation.shape
    input_shape = [int(s) for s in input_shape]

    dev = tvm.cpu()
    module = graph_executor.GraphModule(lib["default"](dev))
    
    predict = []
    times = []
    for i in range(len(images)):
        img = np.array(images[i:i+1], dtype=np.float32).reshape(input_shape)
        module.set_input(input_name, img)

        ts = time()
        module.run()
        tf = time()
        times.append((tf - ts) * 1000)
        
        output = module.get_output(0).numpy()
        predict.append(np.argmax(output))
        
    return np.array(predict), np.array(times)
    
def get_accuracy(labels, predict):
    return np.mean(labels == predict)

На данном этапе необходимо загрузить изображения, разметку и информацию о точности работы нейронных сетей, полученную после обучения на системе с архитектурой x86-64. Далее при решении задач практической работы точность нейронной сети необходимо сопоставлять с загруженными значениями.

In [5]:
images, labels = load_images_and_labels('data/test_images.npy', 'data/test_labels.npy')

metric = np.load('model/metric.npy', allow_pickle='TRUE').item()
print(metric)
{'logreg': array(0.9264, dtype=float32), 'fcnn': array(0.9804, dtype=float32), 'cnn': array(0.985, dtype=float32)}

5. Общая информация про методы автоматической оптимизации слоев в Apache TVM¶

Интерфейс методов автоматической оптимизации в Apache TVM имеет схожие элементы. Сначала происходит извлечение задач, где задачей считается слой или подграф нейронной сети. После этого каждая задача подвергается оптимизации. Результаты оптимизации логируются либо в файл, либо в отдельную директорию.

Ключевым параметром в процессе оптимизации является количество итераций оптимизации задач. Подбор этого параметра является нетривиальной задачей:

  • Если значение параметра слишком маленькое, эффективное решение может не быть найдено.
  • Слишком большое значение параметра приведет к значительным затратам времени.
  • Оптимальное количество итераций зависит от характеристик нейронной сети, целевого устройства и используемого метода оптимизации.

На каждой итерации выполняется несколько проверок качества конкретной реализации. Параметры этих проверок задаются через обьекты классов autotvm.measure_option, auto_scheduler.LocalRunner и ms.runner.LocalRunner, которые предоставляют интерфейс для указания числа замеров производительности:

  • number - количество запусков кода для усреднения времени выполнения в процессе одного замера.
  • repeat - число замеров. Всего выполняется (1 + number x repeat) запусков, где первый запуск используется для прогрева и не учитывается.
  • enable_cpu_cache_flush очищает кэш CPU между последовательными замерами для более точной оценки задержек.

Таким образом, чем больше значение number x repeat, тем более точной будет оценка времени работы планов вычислений, однако, это также увеличивает продолжительность процесса автоматической оптимизации.

Примечание: псевдокод работы методов оценки времени выполнения в Apache TVM приведен ниже.

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start) / number)

6. Запуск и оптимизация модели логистической регрессии¶

6.1. Компиляция и запуск модели¶

Вначале необходимо выполнить загрузку модели логистической регрессии.

In [6]:
default_logreg_time, autotvm_logreg_time, autoscheduler_logreg_time, ms_logreg_time = 0, 0, 0, 0

mod, params = load_model('model/logreg.json', 'model/logreg.params')
print(mod['main'])
fn (%input0: Tensor[(1, 784), float32] /* span=aten::linear_0.input0:0:0 */, %aten::linear_0.weight: Tensor[(10, 784), float32] /* span=aten::linear_0.weight:0:0 */, %aten::linear_0.bias: Tensor[(10), float32] /* span=aten::linear_0.bias:0:0 */) {
  %0 = nn.dense(%input0, %aten::linear_0.weight, units=None) /* span=aten::linear_0:0:0 */;
  nn.bias_add(%0, %aten::linear_0.bias, axis=-1) /* span=aten::linear_0:0:0 */
}

Следующий шаг - компиляция модели без оптимизации слоев.

In [7]:
with tvm.transform.PassContext(opt_level=opt_level):
    lib = relay.build(mod, target=target, params=params)
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.

После компиляции можно выполнить запуск вывода и измерение времени выполнения с использованием разработанной функции timeit_inference, а также определение качества работы логистической регрессии с помощью функции get_accuracy и сравнение полученной точности классификации с загруженным значением, которое получено на x86-64.

In [8]:
default_logreg_predict, default_logreg_times = timeit_inference(mod, lib, images)

default_logreg_accuracy = get_accuracy(labels, default_logreg_predict)
assert np.allclose(metric['logreg'], default_logreg_accuracy, rtol=1e-5)

default_logreg_time = np.median(default_logreg_times)
print(f'Медианное время работы неоптимизированной модели: {default_logreg_time:.4f} мc')
Медианное время работы неоптимизированной модели: 0.0107 мc

6.2. Использованием возможностей AutoTVM¶

Определим функцию get_autotvm_task для извлечения задач и вывода информации о задачах (номер задачи и task.workload). Для этого используем метод autotvm.task.extract_from_program, передав на вход модель, целевое устройство и обученные параметры модели. В данном случае рассматриваются два типа задач: полносвязные слои без трансформации весов и с трансформацией весов для улучшения работы с памятью.

Для архитектур x86 и RISC-V задачи обозначаются как dense_*.x86. На данный момент в Apache TVM нет реализаций планов вычислений для RISC-V. Благодаря тому, что Apache TVM опирается на возможности LLVM в процессе компиляции и сходство архитектур, инструмент успешно использует планы вычислений, разработанные для x86-платформ, на устройствах с архитектурой RISC-V.

In [9]:
def get_autotvm_task(
    mod: tvm.ir.module.IRModule, 
    target: tvm.target.target.Target, 
    params: tvm.ir.container.Map
) -> list[tvm.autotvm.task.task.Task, ...]:
    """
    Параметры:
        mod: Модуль IRModule.
        target: Строка компиляции.
        params: Веса нейронной сети.
    
    Возвращаемое значение:
        Список задач.
    """
    print("Извлечение задач\n")

    tasks = autotvm.task.extract_from_program(
        mod, target=target, params=params,
    )

    for idx, task in enumerate(tasks):
        print(f"Номер задачи: {idx}\nИнформация о задаче: {task.workload}\n")
        
    return tasks

Вызовем разработанную функцию get_autotvm_task для извлечения задач из графа вычислений для AutoTVM.

In [10]:
tasks = get_autotvm_task(mod, target, params)
Извлечение задач

Номер задачи: 0
Информация о задаче: ('dense_nopack.x86', ('TENSOR', (1, 784), 'float32'), ('TENSOR', (10, 784), 'float32'), None, 'float32')

Номер задачи: 1
Информация о задаче: ('dense_pack.x86', ('TENSOR', (1, 784), 'float32'), ('TENSOR', (10, 784), 'float32'), None, 'float32')

Следующий этап после извлечения задач - это оптимизация каждой задачи. Для этого необходимо реализовать функцию tune_autotvm, содержащую установку параметров оптимизации и ее запуск.

Вначале необходимо определить параметры проверки времени выполнения каждого плана с помощью autotvm.measure_option и autotvm.LocalRunner. Затем для каждой задачи определить модель затрат. В качестве модели затрат для оценки времени выполнения слоя используется метод градиентного бустинга деревьев, реализованный на базе XGBoost. Apache TVM предоставляет интерфейс для нескольких методов оптимизации. Инициализируем для каждой задачи класс XGBTuner.

Каждая задача оптимизируется min(n_trial, len(task.config_space)) раз, где n_trial - заданное количество попыток, а len(task.config_space) - количество различных конфигураций в плане вычислений для данного тензорного выражения.

После определения всех параметров необходимо запустить оптимизацию с помощью метода tuner_obj.tune, передав в качестве параметров количество экспериментов оптимизации для каждой задачи, объект measure_option и название файла для логирования через autotvm.callback.log_to_file(log_file).

In [11]:
def tune_autotvm(
    tasks: list[tvm.autotvm.task.task.Task, ...], 
    n_trial: int, 
    log_file: str
):
    """
    Параметры:
        tasks: Список задач.
        n_trial: Количество экспериментов для каждой задачи.
        log_file: Файл для логирование результатов оптимизации.
    """    
    measure_option = autotvm.measure_option(
        builder=autotvm.LocalBuilder(),
        runner=autotvm.LocalRunner(repeat=1, number=3, enable_cpu_cache_flush=True),
    )

    for i, task in enumerate(tasks):
        prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
        tuner_obj = XGBTuner(task)

        n = min(n_trial, len(task.config_space))

        tuner_obj.tune(
            n_trial=n,
            measure_option=measure_option,
            callbacks=[
                autotvm.callback.progress_bar(n_trial, prefix=prefix),
                autotvm.callback.log_to_file(log_file),
            ],
        )

Для запуска оптимизации с помощью AutoTVM необходимо определить файл log_file для логирование результатов оптимизации, установить число экспериментов при оптимизации, а затем вызвать разработанную функцию tune_autotvm.

In [12]:
os.makedirs('autotvm/', exist_ok=True)
log_file = 'autotvm/autotvm_logreg.log'
n_trial = global_trial

tune_autotvm(tasks, n_trial, log_file)
[Task  1/ 2]  Current/Best:    0.64/  16.04 GFLOPS | Progress: (60/96) | 26.38 s Done.
[Task  2/ 2]  Current/Best:    3.41/  14.86 GFLOPS | Progress: (96/96) | 34.58 s Done.

Перед использованием оптимизированной модели необходимо выполнить компиляцию модели с учетом истории оптимизации, которая была сохранена в файл log_file.

In [13]:
with autotvm.apply_history_best(log_file):
    with tvm.transform.PassContext(opt_level=opt_level):
        lib = relay.build(mod, target=target, params=params)

На данном этапе можно выполнить измерение времени вывода с использованием функции timeit_inference, проверку качества работы оптимизированной модели с помощью функции get_accuracy и сравнение точности классификации с рефенсным значением, которое было получено после запуска обучения модели.

In [14]:
autotvm_logreg_predict, autotvm_logreg_times = timeit_inference(mod, lib, images)

autotvm_logreg_accuracy = get_accuracy(labels, autotvm_logreg_predict)
assert np.allclose(metric['logreg'], autotvm_logreg_accuracy, rtol=1e-5)

autotvm_logreg_time = np.median(autotvm_logreg_times)
print(f'Медианное время работы после оптимизации слоев с помощью AutoTVM: {autotvm_logreg_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью AutoTVM: 0.0033 мc

6.3. Использование Auto-scheduler¶

Определим функцию get_auto_scheduler_task для извлечения задач и вывода информации о задачах (номер задачи и task.desc).

Аналогично AutoTVM, вначале необходимо извлечь задачи, используя метод auto_scheduler.extract_tasks, передав в качестве входных параметров модель, целевое устройство для запуска вывода, набор обученных параметров модели. Также Auto-scheduler позволяет регулировать уровень оптимизации графа с помощью параметра opt_level. Это значение должно совпадать с уровнем оптимизации графа вычислений при компиляции модели. Отметим, что в данном случае, граф вычислений состоит только из одного слоя, поэтому объединение слоев не будет выполняться. Метод возвращает значение task_weights, которое определяет вес каждого подграфа. По умолчанию вес равен $1$. Если присутствуют $N$ одинаковых подграфов, то они будут представлены в виде одной задачи с весом $N$.

In [15]:
def get_auto_scheduler_task(
    mod: tvm.ir.module.IRModule, 
    target: tvm.target.target.Target, 
    params: tvm.ir.container.Map,
    opt_level: int
) -> tuple[list[tvm.auto_scheduler.search_task.SearchTask, ...], list[int, ...]]:
    """
    Параметры:
        mod: Модуль IRModule.
        target: Строка компиляции.
        params: Веса нейронной сети.
        opt_level: Уровень оптимизации графа вычислений.
    
    Возвращаемое значение:
        Список задач и список весов задач.
    """
    tasks, task_weights = auto_scheduler.extract_tasks(mod, target=target, params=params, opt_level=opt_level)

    for idx, task in enumerate(tasks):
        print(f"Номер задачи: {idx}\nИнформация о задаче: {task.desc}\n")
        
    return tasks, task_weights

Выполним извлечение задач для Auto-scheduler, вызвав функцию get_auto_scheduler_task.

In [16]:
tasks, task_weights = get_auto_scheduler_task(mod, target, params, opt_level)
Номер задачи: 0
Информация о задаче: vm_mod_fused_nn_dense_add

Далее реализуем функцию tune_auto_scheduler для автоматической настройки параметров оптимизации нейронной сети.

В данном случае для оптимизации необходимо создать обьект класса auto_scheduler.TaskScheduler с описанием задач и определить параметры оптимизации auto_scheduler.TuningOptions. После этого можно вызвать метод tune для созданного объекта класса auto_scheduler.TaskScheduler.

Примечания:

  1. При определении параметров оптимизации используется параметр num_measures_per_round. Он определяет количество конфигураций аннотированных эскизов, для которых будет измерено время перед обновлением базы результатов. После обновления базы результатов модель затрат переобучается, и запускается новая итерация эволюционного алгоритма для генерации новых эскизов.
  2. Параметр количества оптимизаций num_measure_trials в Auto-scheduler задает общее количество измерений для всех подграфов.
In [17]:
def tune_auto_scheduler(
    tasks: list[tvm.auto_scheduler.search_task.SearchTask, ...], 
    task_weights: list[int, ...], 
    log_file: str, 
    n_trials: int
):
    """
    Параметры:
        tasks: Список задач.
        task_weights: Список весов задач.
        n_trial: Количество экспериментов для каждой задачи.
        log_file: Файл для логирования результатов оптимизации.
    """
    tuner = auto_scheduler.TaskScheduler(tasks, task_weights, strategy='round-robin')
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=n_trials,
        num_measures_per_round=8,
        runner=auto_scheduler.LocalRunner(repeat=1, number=3, enable_cpu_cache_flush=True),
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
        verbose=1,
    )

    tuner.tune(tune_option)

На данном этапе можно выполнить запуск оптимизации с помощью Auto-scheduler. Определим файл с навзанием log_file для логирования результатов оптимизации. Установим число экспериментов при оптимизации равным N * len(tasks). Выполним запуск оптимизации посредством вызова функции tune_auto_scheduler.

In [18]:
os.makedirs('auto_schedule/', exist_ok=True)
log_file = 'auto_schedule/auto-schedule_logreg.log'
n_trial_per_task = global_trial

tune_auto_scheduler(tasks, task_weights, log_file, n_trial_per_task * len(tasks))
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms	Trials: 0	Used time : 0 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches		#s: 5
Sample Initial Population	#s: 905	fail_ct: 694	Time elapsed: 0.81
GA Iter: 0	Max score: 0.9990	Min score: 0.9849	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.0000	Min score: 0.9986	#Pop: 16	#M+: 1383	#M-: 71
EvolutionarySearch		#s: 16	Time elapsed: 3.28
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
....E.E.E.E.E***
Time elapsed for measurement: 2.38 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.03 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.008 |           1.88 |      8 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.008 ms	Trials: 8	Used time : 7 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 953	fail_ct: 701	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9981	Min score: 0.9893	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9998	Min score: 0.9980	#Pop: 16	#M+: 1380	#M-: 78
EvolutionarySearch		#s: 16	Time elapsed: 3.38
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
...E..E.E.E.E***
Time elapsed for measurement: 2.44 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.03 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.008 |           2.00 |     16 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.008 ms	Trials: 16	Used time : 13 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 933	fail_ct: 666	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9854	Min score: 0.9375	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9854	Min score: 0.9375	#Pop: 16	#M+: 1376	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.50
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.45 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.008 |           2.00 |     24 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.008 ms	Trials: 24	Used time : 20 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 936	fail_ct: 666	Time elapsed: 0.76
GA Iter: 0	Max score: 1.0220	Min score: 0.9424	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.0220	Min score: 0.9872	#Pop: 16	#M+: 1384	#M-: 76
EvolutionarySearch		#s: 16	Time elapsed: 3.51
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.50 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.008 |           2.00 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.008 ms	Trials: 32	Used time : 27 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 915	fail_ct: 673	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9778	Min score: 0.9376	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9993	Min score: 0.9681	#Pop: 16	#M+: 1376	#M-: 76
EvolutionarySearch		#s: 16	Time elapsed: 3.52
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.45 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.007 |           2.10 |     40 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.007 ms	Trials: 40	Used time : 34 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 933	fail_ct: 713	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9421	Min score: 0.8828	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9439	Min score: 0.9176	#Pop: 16	#M+: 1386	#M-: 82
EvolutionarySearch		#s: 16	Time elapsed: 3.52
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.43 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.007 |           2.10 |     48 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.007 ms	Trials: 48	Used time : 40 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 925	fail_ct: 700	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9296	Min score: 0.8383	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9296	Min score: 0.9032	#Pop: 16	#M+: 1383	#M-: 77
EvolutionarySearch		#s: 16	Time elapsed: 3.49
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.48 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.007 |           2.10 |     56 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.007 ms	Trials: 56	Used time : 47 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 951	fail_ct: 693	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9323	Min score: 0.8621	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9323	Min score: 0.8974	#Pop: 16	#M+: 1383	#M-: 76
EvolutionarySearch		#s: 16	Time elapsed: 3.50
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.43 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.005 |           3.18 |     64 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.005 ms	Trials: 64	Used time : 54 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 927	fail_ct: 680	Time elapsed: 0.75
GA Iter: 0	Max score: 0.6080	Min score: 0.5780	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9661	Min score: 0.8379	#Pop: 16	#M+: 1378	#M-: 74
EvolutionarySearch		#s: 16	Time elapsed: 3.46
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.38 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.005 |           3.18 |     72 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.005 ms	Trials: 72	Used time : 60 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 922	fail_ct: 696	Time elapsed: 0.75
GA Iter: 0	Max score: 0.6066	Min score: 0.5840	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9249	Min score: 0.7680	#Pop: 16	#M+: 1385	#M-: 74
EvolutionarySearch		#s: 16	Time elapsed: 3.45
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.54 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.005 |           3.22 |     80 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.005 ms	Trials: 80	Used time : 67 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 942	fail_ct: 669	Time elapsed: 0.78
GA Iter: 0	Max score: 0.6339	Min score: 0.5862	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9170	Min score: 0.7797	#Pop: 16	#M+: 1374	#M-: 65
EvolutionarySearch		#s: 16	Time elapsed: 3.48
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.46 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |                                     vm_mod_fused_nn_dense_add |        0.005 |           3.22 |     88 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.005 ms	Trials: 88	Used time : 74 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 918	fail_ct: 734	Time elapsed: 0.77
GA Iter: 0	Max score: 0.6298	Min score: 0.6084	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9234	Min score: 0.7981	#Pop: 16	#M+: 1387	#M-: 61
EvolutionarySearch		#s: 16	Time elapsed: 3.48
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.53 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s

По завершении оптимизации необходимо скомпилировать модель с учетом истории оптимизации.

In [19]:
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(
        opt_level=opt_level, config={"relay.backend.use_auto_scheduler": True},
    ):
        lib = relay.build(mod, target=target, params=params)

Далее для скомпилированной модели можно выполнить измерение времени выполнения с использованием вызова функции timeit_inference, определить качество работы с помощью функции get_accuracy и проверить корректность, сравнив полученное значение показателя точности с референсным значением.

In [20]:
autoscheduler_logreg_predict, autoscheduler_logreg_times = timeit_inference(mod, lib, images)

autoscheduler_logreg_accuracy = get_accuracy(labels, autoscheduler_logreg_predict)
assert np.allclose(metric['logreg'], autoscheduler_logreg_accuracy, rtol=1e-5)

autoscheduler_logreg_time = np.median(autoscheduler_logreg_times)
print(f'Медианное время работы после оптимизации слоев с помощью Auto-scheduler: {autoscheduler_logreg_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью Auto-scheduler: 0.0062 мc

6.4. Применение MetaScheduler¶

Использование MetaScheduler требует указания числа ядер при формировании строки, содержащей параметры целевого устройства, например, -num-cores 4. Данный параметр можно указать равным количеству физических ядер на устройстве. Внесем соответствующие изменения в исходный код.

In [21]:
print(f"mtriple устройства {tvm.target.Target('llvm').attrs.get('mtriple')}")

if is_x86():
      target = tvm.target.Target('llvm -num-cores 6')
elif is_riscv():    
    target = tvm.target.Target(
        'llvm -jit=orcjit -mtriple=riscv64-unknown-linux-gnu '
        '-mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d -num-cores 4'
    )
else:
    raise ValueError("Unsupported architecture")


    print(f'{target = }')
mtriple устройства x86_64-pc-linux-gnu

Определим функцию get_ms_task для извлечения задач и вывода информации о задачах (номер задачи и task.task_name). Аналогично предыдущим методам оптимизации, извлечение задач выполняется с помощью методов ms.relay_integration.extract_tasks и ms.relay_integration.extracted_tasks_to_tune_contexts.

In [22]:
def get_ms_task(
    mod: tvm.ir.module.IRModule, 
    target: tvm.target.target.Target, 
    params: tvm.ir.container.Map,
    opt_level: int,
    work_dir: str
) -> tuple[list[tvm.meta_schedule.tune_context.TuneContext, ...], list[int, ...]]:
    """
    Параметры:
        mod: Модуль IRModule.
        target: Строка компиляции.
        params: Веса нейронной сети.
        opt_level: Уровень оптимизации графа вычислений.
        work_dir: Директория для логирования результатов оптимизации.
    
    Возвращаемое значение:
        Список задач и список весов задач.
    """
    extracted_tasks = ms.relay_integration.extract_tasks(
        mod, target=target, params=params, opt_level=opt_level,
    )

    tasks, task_weights = ms.relay_integration.extracted_tasks_to_tune_contexts(
        extracted_tasks, work_dir
    )

    for idx, task in enumerate(tasks):
        print(f"Номер задачи: {idx}\nИнформация о задаче: {task.task_name}\n")
        
    return tasks, task_weights

Вызовем разработанную функцию get_ms_task, предварительно определив директорию work_dir для логирования результатов оптимизации.

In [23]:
work_dir = "meta_schedule_logreg"

if is_x86():
    tasks, task_weights = get_ms_task(mod, target, params, opt_level, work_dir)
2024-11-06 18:02:07 [INFO] Logging directory: meta_schedule_logreg/logs
Номер задачи: 0
Информация о задаче: fused_nn_dense_add

По аналогии с другими рассмотренными методами реализуем функцию tune_ms для автоматической настройки параметров запуска вывода нейронной сети. Данная функция должна вызывать метод ms.tune.tune_tasks, который принимает на вход набор задач, веса этих задач и параметры оптимизации.

Примечание: для указания количества запусков при оценке качества эскиза на вход ms.tune.tune_tasks передается объект ms.runner.LocalRunner с указанием параметра ms.runner.config.EvaluatorConfig.

In [24]:
def tune_ms(
    tasks: list[tvm.meta_schedule.tune_context.TuneContext, ...], 
    task_weights: list[int, ...], 
    work_dir: str, 
    n_trials: int
):
    """
    Параметры:
        tasks: Список задач.
        task_weights: Список весов задач.
        work_dir: Директория для логирования результатов оптимизации.
        n_trial: Количество экспериментов для каждой задачи.
    """
    
    if not os.path.exists(work_dir):
        os.mkdir(work_dir)

    ms.tune.tune_tasks(
        tasks=tasks,
        task_weights=task_weights,
        work_dir=work_dir,
        max_trials_global=n_trials,
        num_trials_per_iter=8,
        builder=ms.builder.LocalBuilder(),
        runner=ms.runner.LocalRunner(
            evaluator_config=ms.runner.config.EvaluatorConfig(repeat=1, number=3, enable_cpu_cache_flush=True)
        ),
    )

Далее выполним запуск оптимизации с помощью MetaScheduler посредством вызова функции tune_ms, установив число экспериментов при оптимизации равным N * len(tasks).

In [25]:
n_trial_per_task = global_trial

if is_x86():
    tune_ms(tasks, task_weights, work_dir, n_trial_per_task * len(tasks))
2024-11-06 18:02:15 [INFO] LocalBuilder: max_workers = 12
2024-11-06 18:02:16 [INFO] LocalRunner: max_workers = 1
2024-11-06 18:02:17 [INFO] [task_scheduler.cc:159] Initializing Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 N/A N/A N/A 0
Total trials: 0
Total latency (us): 0

2024-11-06 18:02:17 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |            N/A |          N/A |                   N/A |      0 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0

2024-11-06 18:02:17 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:18 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:20 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:22 [DEBUG] XGB iter   0: tr-p-rmse: 0.599053	tr-a-peak@32: 0.810757	tr-rmse: 0.332563	tr-rmse: 0.332563
2024-11-06 18:02:22 [DEBUG] XGB iter  25: tr-p-rmse: 0.047430	tr-a-peak@32: 1.000000	tr-rmse: 0.368312	tr-rmse: 0.368312
2024-11-06 18:02:22 [DEBUG] XGB iter  50: tr-p-rmse: 0.047438	tr-a-peak@32: 1.000000	tr-rmse: 0.368301	tr-rmse: 0.368301
2024-11-06 18:02:22 [DEBUG] XGB stopped. Best iteration: [18] tr-p-rmse:0.04720	tr-a-peak@32:1.00000	tr-rmse:0.36865	tr-rmse:0.36865 
2024-11-06 18:02:22 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 14.3007 1.0971 1.0971 8
Total trials: 8
Total latency (us): 1.09715

2024-11-06 18:02:22 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        14.3007 |       1.0971 |                1.0971 |      8 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 8
Total latency (us): 1.09715

2024-11-06 18:02:22 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:24 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:25 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:28 [DEBUG] XGB validation: p-rmse: 0.238527	a-peak@32: 0.947573
2024-11-06 18:02:28 [DEBUG] XGB iter   0: tr-p-rmse: 0.616919	tr-a-peak@32: 0.735797	tr-rmse: 0.295998	tr-rmse: 0.295998
2024-11-06 18:02:28 [DEBUG] XGB iter  25: tr-p-rmse: 0.063694	tr-a-peak@32: 1.000000	tr-rmse: 0.322048	tr-rmse: 0.322048
2024-11-06 18:02:28 [DEBUG] XGB iter  50: tr-p-rmse: 0.061487	tr-a-peak@32: 1.000000	tr-rmse: 0.322399	tr-rmse: 0.322399
2024-11-06 18:02:28 [DEBUG] XGB iter  75: tr-p-rmse: 0.061487	tr-a-peak@32: 1.000000	tr-rmse: 0.322399	tr-rmse: 0.322399
2024-11-06 18:02:28 [DEBUG] XGB stopped. Best iteration: [39] tr-p-rmse:0.06149	tr-a-peak@32:1.00000	tr-rmse:0.32240	tr-rmse:0.32240 
2024-11-06 18:02:28 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 16
Total trials: 16
Total latency (us): 0.85589

2024-11-06 18:02:28 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     16 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 16
Total latency (us): 0.85589

2024-11-06 18:02:28 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:29 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:31 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:33 [DEBUG] XGB validation: p-rmse: 0.143498	a-peak@32: 0.985518
2024-11-06 18:02:33 [DEBUG] XGB iter   0: tr-p-rmse: 0.558818	tr-a-peak@32: 0.863130	tr-rmse: 0.303502	tr-rmse: 0.303502
2024-11-06 18:02:33 [DEBUG] XGB iter  25: tr-p-rmse: 0.044666	tr-a-peak@32: 1.000000	tr-rmse: 0.339051	tr-rmse: 0.339051
2024-11-06 18:02:33 [DEBUG] XGB iter  50: tr-p-rmse: 0.044602	tr-a-peak@32: 1.000000	tr-rmse: 0.339225	tr-rmse: 0.339225
2024-11-06 18:02:33 [DEBUG] XGB iter  75: tr-p-rmse: 0.044602	tr-a-peak@32: 1.000000	tr-rmse: 0.339225	tr-rmse: 0.339225
2024-11-06 18:02:33 [DEBUG] XGB stopped. Best iteration: [36] tr-p-rmse:0.04460	tr-a-peak@32:1.00000	tr-rmse:0.33922	tr-rmse:0.33922 
2024-11-06 18:02:33 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 24
2024-11-06 18:02:33 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     24 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 24
Total latency (us): 0.85589


Total trials: 24
Total latency (us): 0.85589

2024-11-06 18:02:33 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:35 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:37 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:39 [DEBUG] XGB validation: p-rmse: 0.075066	a-peak@32: 1.000000
2024-11-06 18:02:39 [DEBUG] XGB iter   0: tr-p-rmse: 0.552131	tr-a-peak@32: 0.832467	tr-rmse: 0.305779	tr-rmse: 0.305779
2024-11-06 18:02:39 [DEBUG] XGB iter  25: tr-p-rmse: 0.035881	tr-a-peak@32: 1.000000	tr-rmse: 0.339852	tr-rmse: 0.339852
2024-11-06 18:02:39 [DEBUG] XGB iter  50: tr-p-rmse: 0.035777	tr-a-peak@32: 1.000000	tr-rmse: 0.340078	tr-rmse: 0.340078
2024-11-06 18:02:39 [DEBUG] XGB iter  75: tr-p-rmse: 0.035777	tr-a-peak@32: 1.000000	tr-rmse: 0.340078	tr-rmse: 0.340078
2024-11-06 18:02:39 [DEBUG] XGB stopped. Best iteration: [36] tr-p-rmse:0.03578	tr-a-peak@32:1.00000	tr-rmse:0.34008	tr-rmse:0.34008 
2024-11-06 18:02:39 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 32
2024-11-06 18:02:39 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     32 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 32
Total latency (us): 0.85589


Total trials: 32
Total latency (us): 0.85589

2024-11-06 18:02:39 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:40 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:42 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:44 [DEBUG] XGB validation: p-rmse: 0.190006	a-peak@32: 0.911841
2024-11-06 18:02:44 [DEBUG] XGB iter   0: tr-p-rmse: 0.540138	tr-a-peak@32: 0.804785	tr-rmse: 0.294483	tr-rmse: 0.294483
2024-11-06 18:02:44 [DEBUG] XGB iter  25: tr-p-rmse: 0.045466	tr-a-peak@32: 1.000000	tr-rmse: 0.331657	tr-rmse: 0.331657
2024-11-06 18:02:44 [DEBUG] XGB iter  50: tr-p-rmse: 0.045316	tr-a-peak@32: 1.000000	tr-rmse: 0.332103	tr-rmse: 0.332103
2024-11-06 18:02:44 [DEBUG] XGB iter  75: tr-p-rmse: 0.045316	tr-a-peak@32: 1.000000	tr-rmse: 0.332103	tr-rmse: 0.332103
2024-11-06 18:02:44 [DEBUG] XGB stopped. Best iteration: [35] tr-p-rmse:0.04532	tr-a-peak@32:1.00000	tr-rmse:0.33210	tr-rmse:0.33210 
2024-11-06 18:02:44 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 40
Total trials: 40
Total latency (us): 0.85589

2024-11-06 18:02:44 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     40 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 40
Total latency (us): 0.85589

2024-11-06 18:02:44 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:46 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:47 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:49 [DEBUG] XGB validation: p-rmse: 0.130328	a-peak@32: 1.000000
2024-11-06 18:02:49 [DEBUG] XGB iter   0: tr-p-rmse: 0.552521	tr-a-peak@32: 0.779963	tr-rmse: 0.294587	tr-rmse: 0.294587
2024-11-06 18:02:50 [DEBUG] XGB iter  25: tr-p-rmse: 0.041127	tr-a-peak@32: 1.000000	tr-rmse: 0.329823	tr-rmse: 0.329823
2024-11-06 18:02:50 [DEBUG] XGB iter  50: tr-p-rmse: 0.041115	tr-a-peak@32: 1.000000	tr-rmse: 0.329842	tr-rmse: 0.329842
2024-11-06 18:02:50 [DEBUG] XGB iter  75: tr-p-rmse: 0.041115	tr-a-peak@32: 1.000000	tr-rmse: 0.329842	tr-rmse: 0.329842
2024-11-06 18:02:50 [DEBUG] XGB stopped. Best iteration: [33] tr-p-rmse:0.04111	tr-a-peak@32:1.00000	tr-rmse:0.32984	tr-rmse:0.32984 
2024-11-06 18:02:50 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 48
Total trials: 48
Total latency (us): 0.85589

2024-11-06 18:02:50 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     48 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 48
Total latency (us): 0.85589

2024-11-06 18:02:50 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:51 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:53 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:02:56 [DEBUG] XGB validation: p-rmse: 0.088329	a-peak@32: 1.000000
2024-11-06 18:02:56 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 56
2024-11-06 18:02:56 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     56 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 56
Total latency (us): 0.85589


Total trials: 56
Total latency (us): 0.85589

2024-11-06 18:02:56 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:02:57 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:02:59 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:03:01 [DEBUG] XGB validation: p-rmse: 0.203209	a-peak@32: 1.000000
2024-11-06 18:03:01 [DEBUG] XGB iter   0: tr-p-rmse: 0.547631	tr-a-peak@32: 0.705767	tr-rmse: 0.273523	tr-rmse: 0.273523
2024-11-06 18:03:01 [DEBUG] XGB iter  25: tr-p-rmse: 0.039272	tr-a-peak@32: 1.000000	tr-rmse: 0.312109	tr-rmse: 0.312109
2024-11-06 18:03:01 [DEBUG] XGB iter  50: tr-p-rmse: 0.039263	tr-a-peak@32: 1.000000	tr-rmse: 0.312125	tr-rmse: 0.312125
2024-11-06 18:03:01 [DEBUG] XGB iter  75: tr-p-rmse: 0.039263	tr-a-peak@32: 1.000000	tr-rmse: 0.312125	tr-rmse: 0.312125
2024-11-06 18:03:01 [DEBUG] XGB stopped. Best iteration: [31] tr-p-rmse:0.03926	tr-a-peak@32:1.00000	tr-rmse:0.31212	tr-rmse:0.31212 
2024-11-06 18:03:01 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 64
Total trials: 64
Total latency (us): 0.85589

2024-11-06 18:03:01 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     64 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 0.85589

2024-11-06 18:03:01 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:03:03 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:03:04 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:03:06 [DEBUG] XGB validation: p-rmse: 0.057320	a-peak@32: 1.000000
2024-11-06 18:03:06 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 72
2024-11-06 18:03:06 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     72 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 72
Total latency (us): 0.85589


Total trials: 72
Total latency (us): 0.85589

2024-11-06 18:03:06 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:03:08 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:03:10 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:03:12 [DEBUG] XGB validation: p-rmse: 0.123057	a-peak@32: 1.000000
2024-11-06 18:03:12 [DEBUG] XGB iter   0: tr-p-rmse: 0.530334	tr-a-peak@32: 1.000000	tr-rmse: 0.274288	tr-rmse: 0.274288
2024-11-06 18:03:12 [DEBUG] XGB iter  25: tr-p-rmse: 0.038213	tr-a-peak@32: 1.000000	tr-rmse: 0.312104	tr-rmse: 0.312104
2024-11-06 18:03:12 [DEBUG] XGB iter  50: tr-p-rmse: 0.038206	tr-a-peak@32: 1.000000	tr-rmse: 0.312117	tr-rmse: 0.312117
2024-11-06 18:03:12 [DEBUG] XGB iter  75: tr-p-rmse: 0.038206	tr-a-peak@32: 1.000000	tr-rmse: 0.312117	tr-rmse: 0.312117
2024-11-06 18:03:12 [DEBUG] XGB stopped. Best iteration: [31] tr-p-rmse:0.03821	tr-a-peak@32:1.00000	tr-rmse:0.31212	tr-rmse:0.31212 
2024-11-06 18:03:12 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 80
Total trials: 80
Total latency (us): 0.85589

2024-11-06 18:03:12 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     80 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 80
Total latency (us): 0.85589

2024-11-06 18:03:12 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:03:14 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:03:15 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:03:17 [DEBUG] XGB validation: p-rmse: 0.073607	a-peak@32: 0.950096
2024-11-06 18:03:17 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 88
2024-11-06 18:03:17 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     88 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 88
Total latency (us): 0.85589


Total trials: 88
Total latency (us): 0.85589

2024-11-06 18:03:17 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add"
2024-11-06 18:03:19 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:03:21 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:03:23 [DEBUG] XGB validation: p-rmse: 0.036131	a-peak@32: 1.000000
2024-11-06 18:03:23 [DEBUG] XGB iter   0: tr-p-rmse: 0.536892	tr-a-peak@32: 1.000000	tr-rmse: 0.264214	tr-rmse: 0.264214
2024-11-06 18:03:23 [DEBUG] XGB iter  25: tr-p-rmse: 0.035004	tr-a-peak@32: 1.000000	tr-rmse: 0.298392	tr-rmse: 0.298392
2024-11-06 18:03:23 [DEBUG] XGB iter  50: tr-p-rmse: 0.035002	tr-a-peak@32: 1.000000	tr-rmse: 0.298396	tr-rmse: 0.298396
2024-11-06 18:03:23 [DEBUG] XGB iter  75: tr-p-rmse: 0.035002	tr-a-peak@32: 1.000000	tr-rmse: 0.298396	tr-rmse: 0.298396
2024-11-06 18:03:23 [DEBUG] XGB stopped. Best iteration: [27] tr-p-rmse:0.03500	tr-a-peak@32:1.00000	tr-rmse:0.29839	tr-rmse:0.29839 
2024-11-06 18:03:23 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 96
2024-11-06 18:03:23 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     96 |      
------------------------------------------------------------------------------------------------------------------
Total trials: 96
Total latency (us): 0.85589


Total trials: 96
Total latency (us): 0.85589

2024-11-06 18:03:23 [INFO] [task_scheduler.cc:260] Task #0 has finished. Remaining task(s): 0
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add 15690 1 18.3318 0.8559 0.8559 96 Y
Total trials: 96
Total latency (us): 0.85589

2024-11-06 18:03:23 [DEBUG] [task_scheduler.cc:318] 
 ID |               Name |  FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
------------------------------------------------------------------------------------------------------------------
  0 | fused_nn_dense_add | 15690 |      1 |        18.3318 |       0.8559 |                0.8559 |     96 |    Y 
------------------------------------------------------------------------------------------------------------------
Total trials: 96
Total latency (us): 0.85589

После оптимизации можно скомпилировать нейронную с учетом построенных оптимизаций с помощью интерфейса MetaScheduler ms.relay_integration.compile_relay.

In [26]:
if is_x86():
    
    database = ms.database.JSONDatabase(
        f"{work_dir}/database_workload.json",
        f"{work_dir}/database_tuning_record.json",
        allow_missing=False
    )

    lib = ms.relay_integration.compile_relay(
        database, mod, target, params,
        opt_level=opt_level,
    )

В завершении измерим время вывода с использованием функции timeit_inference, определим качество работы модели с помощью функции get_accuracy и выполним проверку корректности работы оптимизированной модели, сравнив полученное значение показателя точности с референсным.

In [27]:
if is_x86():
    
    ms_logreg_predict, ms_logreg_times = timeit_inference(mod, lib, images)

    ms_logreg_accuracy = get_accuracy(labels, ms_logreg_predict)
    assert np.allclose(metric['logreg'], ms_logreg_accuracy, rtol=1e-5)

    ms_logreg_time = np.median(ms_logreg_times)
    print(f'Медианное время работы после оптимизации слоев с помощью MetaScheduler: {ms_logreg_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью MetaScheduler: 0.0036 мc

6.5. Анализ полученных результатов¶

Для анализа результатов оптимизации нейронной сети с использованием различных методов построим прафик медианного времени выполнения.

In [28]:
fig, ax = plt.subplots()

name = ['Без оптимизации\nслоев', 'AutoTVM', 'Auto-scheduler', 'MetaScheduler']
times = [default_logreg_time, autotvm_logreg_time, autoscheduler_logreg_time, ms_logreg_time]
bar_labels = ['red', 'blue', '_red', 'orange']
bar_colors = ['tab:blue', 'tab:red', 'tab:green', 'tab:orange']

bars = ax.bar(name, times, label=name, color=bar_colors)
ax.set_title('Среднее время\nвыполнения (мс)', fontsize=18)

for bar, n, t in zip(bars, name, times):
    h = bar.get_height()
    if n == 'Без оптимизации\nслоев': h = h / 2
    if h != 0:
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            h,
            f'{round(t, 4)} с',
            ha='center',
            va='bottom',
            fontsize=15,
        )

ax.xaxis.label.set_size(40)
ax.set_title('Среднее время\nвыполнения (с)', fontsize=18)
plt.grid()
No description has been provided for this image

Вывод: оптимизация значительно ускоряет время работы сети.

7. Запуск и оптимизация полносвязной нейронной сети¶

7.1. Компиляция и запуск модели¶

Вначале необходимо выполнить загрузку модели полносвязной нейронной сети. Следует учитывать что в данной модели больше слоев и, следовательно, будут другие промежуточные и финальные результаты работы методов, например, количество и веса извлеченных задач, результаты оптимизации.

In [29]:
default_fcnn_time, autotvm_fcnn_time, ms_fcnn_time = 0, 0, 0

mod, params = load_model('model/fcnn.json', 'model/fcnn.params')
print(mod['main'])
fn (%input0: Tensor[(1, 784), float32] /* span=aten::linear_0.input0:0:0 */, %aten::linear_0.weight: Tensor[(300, 784), float32] /* span=aten::linear_0.weight:0:0 */, %aten::linear_0.bias: Tensor[(300), float32] /* span=aten::linear_0.bias:0:0 */, %aten::linear_1.weight: Tensor[(300, 300), float32] /* span=aten::linear_1.weight:0:0 */, %aten::linear_1.bias: Tensor[(300), float32] /* span=aten::linear_1.bias:0:0 */, %aten::linear_2.weight: Tensor[(300, 300), float32] /* span=aten::linear_2.weight:0:0 */, %aten::linear_2.bias: Tensor[(300), float32] /* span=aten::linear_2.bias:0:0 */, %aten::linear_3.weight: Tensor[(10, 300), float32] /* span=aten::linear_3.weight:0:0 */, %aten::linear_3.bias: Tensor[(10), float32] /* span=aten::linear_3.bias:0:0 */) {
  %0 = nn.dense(%input0, %aten::linear_0.weight, units=None) /* span=aten::linear_0:0:0 */;
  %1 = nn.bias_add(%0, %aten::linear_0.bias, axis=-1) /* span=aten::linear_0:0:0 */;
  %2 = nn.relu(%1) /* span=aten::relu_0:0:0 */;
  %3 = nn.dense(%2, %aten::linear_1.weight, units=None) /* span=aten::linear_1:0:0 */;
  %4 = nn.bias_add(%3, %aten::linear_1.bias, axis=-1) /* span=aten::linear_1:0:0 */;
  %5 = nn.relu(%4) /* span=aten::relu_1:0:0 */;
  %6 = nn.dense(%5, %aten::linear_2.weight, units=None) /* span=aten::linear_2:0:0 */;
  %7 = nn.bias_add(%6, %aten::linear_2.bias, axis=-1) /* span=aten::linear_2:0:0 */;
  %8 = nn.relu(%7) /* span=aten::relu_2:0:0 */;
  %9 = nn.dense(%8, %aten::linear_3.weight, units=None) /* span=aten::linear_3:0:0 */;
  nn.bias_add(%9, %aten::linear_3.bias, axis=-1) /* span=aten::linear_3:0:0 */
}

Следующий шаг - компиляция модели без оптимизации слоев.

In [30]:
with tvm.transform.PassContext(opt_level=opt_level):
    lib = relay.build(mod, target=target, params=params)

После компиляции можно выполнить запуск вывода и измерение времени выполнения с использованием разработанной функции timeit_inference, а также проверку качества работы полносвязной нейронной сети после загрузки с помощью функции get_accuracy и сравнение полученной точности классификации с загруженным значением, которое получено на x86-64.

In [31]:
default_fcnn_predict, default_fcnn_times = timeit_inference(mod, lib, images)

default_fcnn_accuracy = get_accuracy(labels, default_fcnn_predict)
assert np.allclose(metric['fcnn'], default_fcnn_accuracy, rtol=1e-5)

default_fcnn_time = np.median(default_fcnn_times)
print(f'Медианное время работы не оптимизированной модели: {default_fcnn_time:.4f} мc')
Медианное время работы не оптимизированной модели: 0.0401 мc

7.2. Использование возможностей AutoTVM¶

Вызовем разработанную функцию get_autotvm_task для извлечения задач из графа вычислений для AutoTVM.

В данном случае следовало бы ожидать 8 задач, так как есть 4 слоя. Но задач 6: 3 с трансформацией данных и 3 без трансформации данных. Два слоя имеют идентичные параметры, поэтому данные задачи нет необходимости дублировать. Аналогичное поведение будет и у других методов оптимизации слоев.

In [32]:
tasks = get_autotvm_task(mod, target, params)
Извлечение задач

Номер задачи: 0
Информация о задаче: ('dense_nopack.x86', ('TENSOR', (1, 784), 'float32'), ('TENSOR', (300, 784), 'float32'), None, 'float32')

Номер задачи: 1
Информация о задаче: ('dense_pack.x86', ('TENSOR', (1, 784), 'float32'), ('TENSOR', (300, 784), 'float32'), None, 'float32')

Номер задачи: 2
Информация о задаче: ('dense_nopack.x86', ('TENSOR', (1, 300), 'float32'), ('TENSOR', (300, 300), 'float32'), None, 'float32')

Номер задачи: 3
Информация о задаче: ('dense_pack.x86', ('TENSOR', (1, 300), 'float32'), ('TENSOR', (300, 300), 'float32'), None, 'float32')

Номер задачи: 4
Информация о задаче: ('dense_nopack.x86', ('TENSOR', (1, 300), 'float32'), ('TENSOR', (10, 300), 'float32'), None, 'float32')

Номер задачи: 5
Информация о задаче: ('dense_pack.x86', ('TENSOR', (1, 300), 'float32'), ('TENSOR', (10, 300), 'float32'), None, 'float32')

Для запуска оптимизации с помощью AutoTVM необходимо определить файл log_file для логирования результатов оптимизации, установить число экспериментов при оптимизации, а затем вызвать разработанную функцию tune_autotvm.

In [33]:
log_file = 'autotvm/autotvm_fcnn.log'
n_trial = global_trial

tune_autotvm(tasks, n_trial, log_file)
[Task  2/ 6]  Current/Best:   19.99/  30.68 GFLOPS | Progress: (60/96) | 30.37 s Done.
[Task  2/ 6]  Current/Best:   25.21/  31.85 GFLOPS | Progress: (96/96) | 47.14 s Done.
[Task  3/ 6]  Current/Best:    1.83/  22.75 GFLOPS | Progress: (96/96) | 42.73 s Done.
[Task  4/ 6]  Current/Best:   11.52/  22.01 GFLOPS | Progress: (96/96) | 41.10 s Done.
[Task  5/ 6]  Current/Best:    0.89/  10.69 GFLOPS | Progress: (72/96) | 24.93 s Done.
[Task  6/ 6]  Current/Best:    4.44/  13.13 GFLOPS | Progress: (96/96) | 31.13 s Done.

Перед использованием оптимизированной модели, необходимо выполнить компиляцию модели с учетом истории оптимизации, которая была сохранена в файл log_file.

In [34]:
with autotvm.apply_history_best(log_file):
    with tvm.transform.PassContext(opt_level=opt_level):
        lib = relay.build(mod, target=target, params=params)

На данном этапе можно выполнить измерение времени выполнения с использованием функции timeit_inference, проверку качества работы оптимизированной модели с помощью функции get_accuracy и сравнение точности классификации с рефенсным значением, которое было получено после запуска обучения модели.

In [35]:
autotvm_fcnn_predict, autotvm_fcnn_times = timeit_inference(mod, lib, images)

autotvm_fcnn_accuracy = get_accuracy(labels, autotvm_fcnn_predict)
assert np.allclose(metric['fcnn'], autotvm_fcnn_accuracy, rtol=1e-5)

autotvm_fcnn_time = np.median(autotvm_fcnn_times)
print(f'Медианное время работы после оптимизации слоев с помощью AutoTVM: {autotvm_fcnn_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью AutoTVM: 0.0243 мc

7.3. Применение MetaScheduler¶

Вызовем разработанную функцию get_ms_task, предварительно определив директорию work_dir для логирования результатов оптимизации.

В данном случае строка компиляции уже содержит информацию о числе потоков, поэтому модифицировать ее нет необходимости.

In [36]:
if is_x86():
    work_dir = "meta_schedule_fcnn"

    tasks, task_weights = get_ms_task(mod, target, params, opt_level, work_dir)
2024-11-06 18:07:51 [INFO] Logging directory: meta_schedule_fcnn/logs
Номер задачи: 0
Информация о задаче: fused_nn_dense_add_nn_relu

Номер задачи: 1
Информация о задаче: fused_nn_dense_add_nn_relu_1

Номер задачи: 2
Информация о задаче: fused_nn_dense_add

Далее выполним запуск оптимизации с помощью MetaScheduler посредством вызова функции tune_ms, установив число экспериментов при оптимизации равным N * len(tasks).

In [37]:
n_trial_per_task = global_trial

if is_x86():
    tune_ms(tasks, task_weights, work_dir, n_trial_per_task * len(tasks))
2024-11-06 18:07:51 [INFO] LocalBuilder: max_workers = 12
2024-11-06 18:07:51 [INFO] LocalRunner: max_workers = 1
2024-11-06 18:07:52 [INFO] [task_scheduler.cc:159] Initializing Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:07:52 [INFO] [task_scheduler.cc:159] Initializing Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:07:52 [INFO] [task_scheduler.cc:159] Initializing Task #2: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 N/A N/A N/A 0
1 fused_nn_dense_add_nn_relu_1 180600 2 N/A N/A N/A 0
2 fused_nn_dense_add 6010 1 N/A N/A N/A 0
Total trials: 0
Total latency (us): 0

2024-11-06 18:07:52 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |            N/A |          N/A |                   N/A |      0 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |            N/A |          N/A |                   N/A |      0 |      
  2 |           fused_nn_dense_add |   6010 |      1 |            N/A |          N/A |                   N/A |      0 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0

2024-11-06 18:07:52 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:07:54 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:07:56 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:07:57 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:07:59 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:01 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:03 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_dense_add"
2024-11-06 18:08:04 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:06 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:08 [DEBUG] XGB iter   0: tr-p-rmse: 0.351465	tr-a-peak@32: 0.991026	tr-rmse: 0.366215	tr-rmse: 0.366215
2024-11-06 18:08:08 [DEBUG] XGB iter  25: tr-p-rmse: 0.040920	tr-a-peak@32: 0.997009	tr-rmse: 0.405606	tr-rmse: 0.405606
2024-11-06 18:08:08 [DEBUG] XGB iter  50: tr-p-rmse: 0.040911	tr-a-peak@32: 0.997009	tr-rmse: 0.405618	tr-rmse: 0.405618
2024-11-06 18:08:08 [DEBUG] XGB iter  75: tr-p-rmse: 0.040911	tr-a-peak@32: 0.997009	tr-rmse: 0.405618	tr-rmse: 0.405618
2024-11-06 18:08:08 [DEBUG] XGB stopped. Best iteration: [34] tr-p-rmse:0.04091	tr-a-peak@32:0.99701	tr-rmse:0.40562	tr-rmse:0.40562 
2024-11-06 18:08:08 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 37.8599 12.4406 12.4406 8
1 fused_nn_dense_add_nn_relu_1 180600 2 N/A N/A N/A 0
2 fused_nn_dense_add 6010 1 N/A N/A N/A 0
2024-11-06 18:08:08 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        37.8599 |      12.4406 |               12.4406 |      8 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |            N/A |          N/A |                   N/A |      0 |      
  2 |           fused_nn_dense_add |   6010 |      1 |            N/A |          N/A |                   N/A |      0 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 8
Total latency (us): 12.4406


Total trials: 8
Total latency (us): 12.4406

2024-11-06 18:08:08 [DEBUG] XGB iter   0: tr-p-rmse: 0.307720	tr-a-peak@32: 1.000000	tr-rmse: 0.334265	tr-rmse: 0.334265
2024-11-06 18:08:09 [DEBUG] XGB iter  25: tr-p-rmse: 0.033375	tr-a-peak@32: 1.000000	tr-rmse: 0.386839	tr-rmse: 0.386839
2024-11-06 18:08:09 [DEBUG] XGB iter  50: tr-p-rmse: 0.033347	tr-a-peak@32: 1.000000	tr-rmse: 0.386890	tr-rmse: 0.386890
2024-11-06 18:08:09 [DEBUG] XGB iter  75: tr-p-rmse: 0.033347	tr-a-peak@32: 1.000000	tr-rmse: 0.386890	tr-rmse: 0.386890
2024-11-06 18:08:09 [DEBUG] XGB stopped. Best iteration: [34] tr-p-rmse:0.03335	tr-a-peak@32:1.00000	tr-rmse:0.38689	tr-rmse:0.38689 
2024-11-06 18:08:09 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 37.8599 12.4406 12.4406 8
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 8
2 fused_nn_dense_add 6010 1 N/A N/A N/A 0
Total trials: 16
Total latency (us): 22.4224

2024-11-06 18:08:09 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        37.8599 |      12.4406 |               12.4406 |      8 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |      8 |      
  2 |           fused_nn_dense_add |   6010 |      1 |            N/A |          N/A |                   N/A |      0 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 16
Total latency (us): 22.4224

2024-11-06 18:08:09 [DEBUG] XGB iter   0: tr-p-rmse: 0.333048	tr-a-peak@32: 0.997774	tr-rmse: 0.385003	tr-rmse: 0.385003
2024-11-06 18:08:09 [DEBUG] XGB iter  25: tr-p-rmse: 0.028149	tr-a-peak@32: 1.000000	tr-rmse: 0.435107	tr-rmse: 0.435107
2024-11-06 18:08:09 [DEBUG] XGB iter  50: tr-p-rmse: 0.028147	tr-a-peak@32: 1.000000	tr-rmse: 0.435111	tr-rmse: 0.435111
2024-11-06 18:08:09 [DEBUG] XGB iter  75: tr-p-rmse: 0.028147	tr-a-peak@32: 1.000000	tr-rmse: 0.435111	tr-rmse: 0.435111
2024-11-06 18:08:09 [DEBUG] XGB stopped. Best iteration: [28] tr-p-rmse:0.02815	tr-a-peak@32:1.00000	tr-rmse:0.43511	tr-rmse:0.43511 
2024-11-06 18:08:09 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 37.8599 12.4406 12.4406 8
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 8
2 fused_nn_dense_add 6010 1 1.7964 3.3456 3.3456 8
Total trials: 24
Total latency (us): 25.768

2024-11-06 18:08:09 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        37.8599 |      12.4406 |               12.4406 |      8 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |      8 |      
  2 |           fused_nn_dense_add |   6010 |      1 |         1.7964 |       3.3456 |                3.3456 |      8 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 24
Total latency (us): 25.768

2024-11-06 18:08:09 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:08:10 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:12 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:14 [DEBUG] XGB validation: p-rmse: 0.610446	a-peak@32: 0.814860
2024-11-06 18:08:14 [DEBUG] XGB iter   0: tr-p-rmse: 0.386757	tr-a-peak@32: 0.968726	tr-rmse: 0.359414	tr-rmse: 0.359414
2024-11-06 18:08:14 [DEBUG] XGB iter  25: tr-p-rmse: 0.045522	tr-a-peak@32: 1.000000	tr-rmse: 0.410471	tr-rmse: 0.410471
2024-11-06 18:08:14 [DEBUG] XGB iter  50: tr-p-rmse: 0.045518	tr-a-peak@32: 1.000000	tr-rmse: 0.410478	tr-rmse: 0.410478
2024-11-06 18:08:14 [DEBUG] XGB iter  75: tr-p-rmse: 0.045518	tr-a-peak@32: 1.000000	tr-rmse: 0.410478	tr-rmse: 0.410478
2024-11-06 18:08:14 [DEBUG] XGB stopped. Best iteration: [29] tr-p-rmse:0.04552	tr-a-peak@32:1.00000	tr-rmse:0.41048	tr-rmse:0.41048 
2024-11-06 18:08:14 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 16
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 8
2 fused_nn_dense_add 6010 1 1.7964 3.3456 3.3456 8
2024-11-06 18:08:14 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     16 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |      8 |      
  2 |           fused_nn_dense_add |   6010 |      1 |         1.7964 |       3.3456 |                3.3456 |      8 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 32
Total latency (us): 20.2229


Total trials: 32
Total latency (us): 20.2229

2024-11-06 18:08:14 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:08:16 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:18 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:20 [DEBUG] XGB validation: p-rmse: 0.115511	a-peak@32: 1.000000
2024-11-06 18:08:20 [DEBUG] XGB iter   0: tr-p-rmse: 0.379134	tr-a-peak@32: 0.968726	tr-rmse: 0.337476	tr-rmse: 0.337476
2024-11-06 18:08:20 [DEBUG] XGB iter  25: tr-p-rmse: 0.046934	tr-a-peak@32: 1.000000	tr-rmse: 0.392350	tr-rmse: 0.392350
2024-11-06 18:08:21 [DEBUG] XGB iter  50: tr-p-rmse: 0.046930	tr-a-peak@32: 1.000000	tr-rmse: 0.392357	tr-rmse: 0.392357
2024-11-06 18:08:21 [DEBUG] XGB iter  75: tr-p-rmse: 0.046930	tr-a-peak@32: 1.000000	tr-rmse: 0.392357	tr-rmse: 0.392357
2024-11-06 18:08:21 [DEBUG] XGB stopped. Best iteration: [29] tr-p-rmse:0.04693	tr-a-peak@32:1.00000	tr-rmse:0.39236	tr-rmse:0.39236 
2024-11-06 18:08:21 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 16
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 16
2 fused_nn_dense_add 6010 1 1.7964 3.3456 3.3456 8
2024-11-06 18:08:21 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     16 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |     16 |      
  2 |           fused_nn_dense_add |   6010 |      1 |         1.7964 |       3.3456 |                3.3456 |      8 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 40
Total latency (us): 20.2229


Total trials: 40
Total latency (us): 20.2229

2024-11-06 18:08:21 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:08:22 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:24 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:26 [DEBUG] XGB validation: p-rmse: 0.254907	a-peak@32: 0.979337
2024-11-06 18:08:26 [DEBUG] XGB iter   0: tr-p-rmse: 0.383945	tr-a-peak@32: 0.968726	tr-rmse: 0.335318	tr-rmse: 0.335318
2024-11-06 18:08:26 [DEBUG] XGB iter  25: tr-p-rmse: 0.046007	tr-a-peak@32: 1.000000	tr-rmse: 0.387972	tr-rmse: 0.387972
2024-11-06 18:08:26 [DEBUG] XGB iter  50: tr-p-rmse: 0.045991	tr-a-peak@32: 1.000000	tr-rmse: 0.388001	tr-rmse: 0.388001
2024-11-06 18:08:26 [DEBUG] XGB iter  75: tr-p-rmse: 0.045991	tr-a-peak@32: 1.000000	tr-rmse: 0.388001	tr-rmse: 0.388001
2024-11-06 18:08:26 [DEBUG] XGB stopped. Best iteration: [32] tr-p-rmse:0.04599	tr-a-peak@32:1.00000	tr-rmse:0.38800	tr-rmse:0.38800 
2024-11-06 18:08:26 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 16
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 24
2 fused_nn_dense_add 6010 1 1.7964 3.3456 3.3456 8
Total trials: 48
Total latency (us): 20.2229

2024-11-06 18:08:26 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     16 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |     24 |      
  2 |           fused_nn_dense_add |   6010 |      1 |         1.7964 |       3.3456 |                3.3456 |      8 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 48
Total latency (us): 20.2229

2024-11-06 18:08:26 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:08:27 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:29 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:31 [DEBUG] XGB validation: p-rmse: 0.149560	a-peak@32: 0.985629
2024-11-06 18:08:31 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 24
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 24
2 fused_nn_dense_add 6010 1 1.7964 3.3456 3.3456 8
2024-11-06 18:08:32 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     24 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |     24 |      
  2 |           fused_nn_dense_add |   6010 |      1 |         1.7964 |       3.3456 |                3.3456 |      8 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 56
Total latency (us): 20.2229


Total trials: 56
Total latency (us): 20.2229

2024-11-06 18:08:32 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_dense_add"
2024-11-06 18:08:33 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:35 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:37 [DEBUG] XGB validation: p-rmse: 3.034833	a-peak@32: 0.684020
2024-11-06 18:08:37 [DEBUG] XGB iter   0: tr-p-rmse: 0.564327	tr-a-peak@32: 0.861572	tr-rmse: 0.289912	tr-rmse: 0.289912
2024-11-06 18:08:37 [DEBUG] XGB iter  25: tr-p-rmse: 0.064114	tr-a-peak@32: 1.000000	tr-rmse: 0.336602	tr-rmse: 0.336602
2024-11-06 18:08:37 [DEBUG] XGB iter  50: tr-p-rmse: 0.064079	tr-a-peak@32: 1.000000	tr-rmse: 0.336649	tr-rmse: 0.336649
2024-11-06 18:08:37 [DEBUG] XGB iter  75: tr-p-rmse: 0.064079	tr-a-peak@32: 1.000000	tr-rmse: 0.336649	tr-rmse: 0.336649
2024-11-06 18:08:37 [DEBUG] XGB stopped. Best iteration: [33] tr-p-rmse:0.06408	tr-a-peak@32:1.00000	tr-rmse:0.33665	tr-rmse:0.33665 
2024-11-06 18:08:37 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 24
1 fused_nn_dense_add_nn_relu_1 180600 2 36.1861 4.9909 9.9817 24
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:08:37 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     24 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        36.1861 |       4.9909 |                9.9817 |     24 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 64
Total latency (us): 17.2641


Total trials: 64
Total latency (us): 17.2641

2024-11-06 18:08:37 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:08:39 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:40 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:42 [DEBUG] XGB validation: p-rmse: 0.239612	a-peak@32: 1.000000
2024-11-06 18:08:42 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 24
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 32
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 72
Total latency (us): 16.8238

2024-11-06 18:08:42 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     24 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     32 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 72
Total latency (us): 16.8238

2024-11-06 18:08:42 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:08:44 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:46 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:48 [DEBUG] XGB validation: p-rmse: 0.117211	a-peak@32: 1.000000
2024-11-06 18:08:48 [DEBUG] XGB iter   0: tr-p-rmse: 0.515154	tr-a-peak@32: 0.859875	tr-rmse: 0.298764	tr-rmse: 0.298764
2024-11-06 18:08:48 [DEBUG] XGB iter  25: tr-p-rmse: 0.059394	tr-a-peak@32: 0.999572	tr-rmse: 0.347265	tr-rmse: 0.347265
2024-11-06 18:08:48 [DEBUG] XGB iter  50: tr-p-rmse: 0.059326	tr-a-peak@32: 0.999572	tr-rmse: 0.347359	tr-rmse: 0.347359
2024-11-06 18:08:48 [DEBUG] XGB iter  75: tr-p-rmse: 0.059326	tr-a-peak@32: 0.999572	tr-rmse: 0.347359	tr-rmse: 0.347359
2024-11-06 18:08:48 [DEBUG] XGB stopped. Best iteration: [37] tr-p-rmse:0.05933	tr-a-peak@32:0.99957	tr-rmse:0.34736	tr-rmse:0.34736 
2024-11-06 18:08:48 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 24
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 40
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:08:48 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     24 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     40 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 80
Total latency (us): 16.8238


Total trials: 80
Total latency (us): 16.8238

2024-11-06 18:08:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:08:50 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:51 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:53 [DEBUG] XGB validation: p-rmse: 0.129843	a-peak@32: 1.000000
2024-11-06 18:08:53 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 32
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 40
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 88
Total latency (us): 16.8238

2024-11-06 18:08:53 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     32 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     40 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 88
Total latency (us): 16.8238

2024-11-06 18:08:53 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:08:55 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:08:57 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:08:59 [DEBUG] XGB validation: p-rmse: 0.124433	a-peak@32: 1.000000
2024-11-06 18:08:59 [DEBUG] XGB iter   0: tr-p-rmse: 0.504882	tr-a-peak@32: 0.861572	tr-rmse: 0.291846	tr-rmse: 0.291846
2024-11-06 18:08:59 [DEBUG] XGB iter  25: tr-p-rmse: 0.055972	tr-a-peak@32: 0.999786	tr-rmse: 0.342515	tr-rmse: 0.342515
2024-11-06 18:08:59 [DEBUG] XGB iter  50: tr-p-rmse: 0.055959	tr-a-peak@32: 0.999786	tr-rmse: 0.342533	tr-rmse: 0.342533
2024-11-06 18:08:59 [DEBUG] XGB iter  75: tr-p-rmse: 0.055959	tr-a-peak@32: 0.999786	tr-rmse: 0.342533	tr-rmse: 0.342533
2024-11-06 18:08:59 [DEBUG] XGB stopped. Best iteration: [34] tr-p-rmse:0.05596	tr-a-peak@32:0.99979	tr-rmse:0.34253	tr-rmse:0.34253 
2024-11-06 18:08:59 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 40
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 40
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 96
Total latency (us): 16.8238

2024-11-06 18:08:59 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     40 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     40 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 96
Total latency (us): 16.8238

2024-11-06 18:08:59 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:09:00 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:02 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:05 [DEBUG] XGB validation: p-rmse: 0.104376	a-peak@32: 1.000000
2024-11-06 18:09:05 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 40
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 48
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 104
Total latency (us): 16.8238

2024-11-06 18:09:05 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     40 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     48 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 104
Total latency (us): 16.8238

2024-11-06 18:09:05 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:09:08 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:16 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:18 [DEBUG] XGB validation: p-rmse: 0.182499	a-peak@32: 0.992085
2024-11-06 18:09:18 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 68.3047 6.8956 6.8956 40
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 56
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:09:18 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        68.3047 |       6.8956 |                6.8956 |     40 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     56 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 112
Total latency (us): 16.8238


Total trials: 112
Total latency (us): 16.8238

2024-11-06 18:09:18 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:09:22 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:23 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:26 [DEBUG] XGB validation: p-rmse: 0.115199	a-peak@32: 0.998358
2024-11-06 18:09:26 [DEBUG] XGB iter   0: tr-p-rmse: 0.466295	tr-a-peak@32: 0.861331	tr-rmse: 0.324392	tr-rmse: 0.324392
2024-11-06 18:09:26 [DEBUG] XGB iter  25: tr-p-rmse: 0.054985	tr-a-peak@32: 1.000000	tr-rmse: 0.368779	tr-rmse: 0.368779
2024-11-06 18:09:26 [DEBUG] XGB iter  50: tr-p-rmse: 0.054976	tr-a-peak@32: 1.000000	tr-rmse: 0.368793	tr-rmse: 0.368793
2024-11-06 18:09:26 [DEBUG] XGB iter  75: tr-p-rmse: 0.054976	tr-a-peak@32: 1.000000	tr-rmse: 0.368793	tr-rmse: 0.368793
2024-11-06 18:09:26 [DEBUG] XGB stopped. Best iteration: [33] tr-p-rmse:0.05498	tr-a-peak@32:1.00000	tr-rmse:0.36879	tr-rmse:0.36879 
2024-11-06 18:09:26 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 69.7673 6.7510 6.7510 48
1 fused_nn_dense_add_nn_relu_1 180600 2 37.8558 4.7707 9.5415 56
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:09:26 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        69.7673 |       6.7510 |                6.7510 |     48 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        37.8558 |       4.7707 |                9.5415 |     56 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 120
Total latency (us): 16.6793


Total trials: 120
Total latency (us): 16.6793

2024-11-06 18:09:26 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:09:29 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:31 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:33 [DEBUG] XGB validation: p-rmse: 0.055297	a-peak@32: 0.986657
2024-11-06 18:09:33 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 69.7673 6.7510 6.7510 48
1 fused_nn_dense_add_nn_relu_1 180600 2 39.2438 4.6020 9.2040 64
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:09:33 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        69.7673 |       6.7510 |                6.7510 |     48 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        39.2438 |       4.6020 |                9.2040 |     64 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 128
Total latency (us): 16.3418


Total trials: 128
Total latency (us): 16.3418

2024-11-06 18:09:33 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:09:37 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:39 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:41 [DEBUG] XGB validation: p-rmse: 0.095030	a-peak@32: 0.996017
2024-11-06 18:09:41 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 69.7673 6.7510 6.7510 48
1 fused_nn_dense_add_nn_relu_1 180600 2 39.2438 4.6020 9.2040 72
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 136
Total latency (us): 16.3418

2024-11-06 18:09:41 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        69.7673 |       6.7510 |                6.7510 |     48 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        39.2438 |       4.6020 |                9.2040 |     72 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 136
Total latency (us): 16.3418

2024-11-06 18:09:41 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:09:44 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:46 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:48 [DEBUG] XGB validation: p-rmse: 0.121401	a-peak@32: 0.954176
2024-11-06 18:09:48 [DEBUG] XGB iter   0: tr-p-rmse: 0.428651	tr-a-peak@32: 0.861572	tr-rmse: 0.349399	tr-rmse: 0.349399
2024-11-06 18:09:48 [DEBUG] XGB iter  25: tr-p-rmse: 0.052485	tr-a-peak@32: 1.000000	tr-rmse: 0.389504	tr-rmse: 0.389504
2024-11-06 18:09:48 [DEBUG] XGB iter  50: tr-p-rmse: 0.052480	tr-a-peak@32: 1.000000	tr-rmse: 0.389515	tr-rmse: 0.389515
2024-11-06 18:09:48 [DEBUG] XGB iter  75: tr-p-rmse: 0.052480	tr-a-peak@32: 1.000000	tr-rmse: 0.389515	tr-rmse: 0.389515
2024-11-06 18:09:48 [DEBUG] XGB stopped. Best iteration: [29] tr-p-rmse:0.05248	tr-a-peak@32:1.00000	tr-rmse:0.38951	tr-rmse:0.38951 
2024-11-06 18:09:48 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 72.1406 6.5289 6.5289 56
1 fused_nn_dense_add_nn_relu_1 180600 2 39.2438 4.6020 9.2040 72
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:09:48 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        72.1406 |       6.5289 |                6.5289 |     56 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        39.2438 |       4.6020 |                9.2040 |     72 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 144
Total latency (us): 16.1197


Total trials: 144
Total latency (us): 16.1197

2024-11-06 18:09:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:09:52 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:09:54 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:09:56 [DEBUG] XGB validation: p-rmse: 0.128095	a-peak@32: 0.918648
2024-11-06 18:09:56 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 72.1406 6.5289 6.5289 56
1 fused_nn_dense_add_nn_relu_1 180600 2 41.2211 4.3812 8.7625 80
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 152
Total latency (us): 15.6782

2024-11-06 18:09:56 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        72.1406 |       6.5289 |                6.5289 |     56 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.2211 |       4.3812 |                8.7625 |     80 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 152
Total latency (us): 15.6782

2024-11-06 18:09:56 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:10:00 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:01 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:04 [DEBUG] XGB validation: p-rmse: 0.144623	a-peak@32: 1.000000
2024-11-06 18:10:04 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 72.1406 6.5289 6.5289 64
1 fused_nn_dense_add_nn_relu_1 180600 2 41.2211 4.3812 8.7625 80
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 160
Total latency (us): 15.6782

2024-11-06 18:10:04 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        72.1406 |       6.5289 |                6.5289 |     64 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.2211 |       4.3812 |                8.7625 |     80 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 160
Total latency (us): 15.6782

2024-11-06 18:10:04 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:10:07 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:09 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:11 [DEBUG] XGB validation: p-rmse: 0.205111	a-peak@32: 1.000000
2024-11-06 18:10:11 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 72.1406 6.5289 6.5289 64
1 fused_nn_dense_add_nn_relu_1 180600 2 41.2211 4.3812 8.7625 88
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:10:11 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        72.1406 |       6.5289 |                6.5289 |     64 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.2211 |       4.3812 |                8.7625 |     88 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 168
Total latency (us): 15.6782


Total trials: 168
Total latency (us): 15.6782

2024-11-06 18:10:11 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:10:15 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:16 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:19 [DEBUG] XGB validation: p-rmse: 0.095756	a-peak@32: 0.990469
2024-11-06 18:10:19 [DEBUG] XGB iter   0: tr-p-rmse: 0.400105	tr-a-peak@32: 0.861572	tr-rmse: 0.353001	tr-rmse: 0.353001
2024-11-06 18:10:19 [DEBUG] XGB iter  25: tr-p-rmse: 0.052631	tr-a-peak@32: 0.997970	tr-rmse: 0.395233	tr-rmse: 0.395233
2024-11-06 18:10:19 [DEBUG] XGB iter  50: tr-p-rmse: 0.052628	tr-a-peak@32: 0.997970	tr-rmse: 0.395241	tr-rmse: 0.395241
2024-11-06 18:10:19 [DEBUG] XGB iter  75: tr-p-rmse: 0.052628	tr-a-peak@32: 0.997970	tr-rmse: 0.395241	tr-rmse: 0.395241
2024-11-06 18:10:19 [DEBUG] XGB stopped. Best iteration: [29] tr-p-rmse:0.05263	tr-a-peak@32:0.99797	tr-rmse:0.39524	tr-rmse:0.39524 
2024-11-06 18:10:19 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.0748 6.4455 6.4455 72
1 fused_nn_dense_add_nn_relu_1 180600 2 41.2211 4.3812 8.7625 88
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 176
Total latency (us): 15.5947

2024-11-06 18:10:19 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.0748 |       6.4455 |                6.4455 |     72 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.2211 |       4.3812 |                8.7625 |     88 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 176
Total latency (us): 15.5947

2024-11-06 18:10:19 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:10:22 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:24 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:26 [DEBUG] XGB validation: p-rmse: 0.079819	a-peak@32: 0.998800
2024-11-06 18:10:26 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.0748 6.4455 6.4455 72
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 96
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 184
Total latency (us): 15.5244

2024-11-06 18:10:26 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.0748 |       6.4455 |                6.4455 |     72 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |     96 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 184
Total latency (us): 15.5244

2024-11-06 18:10:26 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:10:30 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:32 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:34 [DEBUG] XGB validation: p-rmse: 0.165400	a-peak@32: 0.995687
2024-11-06 18:10:34 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.0748 6.4455 6.4455 72
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 104
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 192
Total latency (us): 15.5244

2024-11-06 18:10:34 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.0748 |       6.4455 |                6.4455 |     72 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    104 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 192
Total latency (us): 15.5244

2024-11-06 18:10:34 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:10:38 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:39 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:42 [DEBUG] XGB validation: p-rmse: 0.038273	a-peak@32: 0.999526
2024-11-06 18:10:42 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 80
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 104
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:10:42 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     80 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    104 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 200
Total latency (us): 15.4949


Total trials: 200
Total latency (us): 15.4949

2024-11-06 18:10:42 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:10:45 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:47 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:49 [DEBUG] XGB validation: p-rmse: 0.049444	a-peak@32: 1.000000
2024-11-06 18:10:49 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 80
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 112
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:10:49 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     80 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    112 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 208
Total latency (us): 15.4949


Total trials: 208
Total latency (us): 15.4949

2024-11-06 18:10:49 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:10:53 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:10:54 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:10:57 [DEBUG] XGB validation: p-rmse: 0.174322	a-peak@32: 0.973278
2024-11-06 18:10:57 [DEBUG] XGB iter   0: tr-p-rmse: 0.366311	tr-a-peak@32: 0.861155	tr-rmse: 0.375179	tr-rmse: 0.375179
2024-11-06 18:10:57 [DEBUG] XGB iter  25: tr-p-rmse: 0.055858	tr-a-peak@32: 0.998764	tr-rmse: 0.412615	tr-rmse: 0.412615
2024-11-06 18:10:57 [DEBUG] XGB iter  50: tr-p-rmse: 0.055856	tr-a-peak@32: 0.998764	tr-rmse: 0.412620	tr-rmse: 0.412620
2024-11-06 18:10:57 [DEBUG] XGB iter  75: tr-p-rmse: 0.055856	tr-a-peak@32: 0.998764	tr-rmse: 0.412620	tr-rmse: 0.412620
2024-11-06 18:10:57 [DEBUG] XGB stopped. Best iteration: [27] tr-p-rmse:0.05586	tr-a-peak@32:0.99876	tr-rmse:0.41262	tr-rmse:0.41262 
2024-11-06 18:10:57 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 88
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 112
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 216
Total latency (us): 15.4949

2024-11-06 18:10:57 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     88 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    112 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 216
Total latency (us): 15.4949

2024-11-06 18:10:57 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:11:00 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:03 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:06 [DEBUG] XGB validation: p-rmse: 0.131449	a-peak@32: 0.963567
2024-11-06 18:11:06 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 88
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 120
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 224
Total latency (us): 15.4949

2024-11-06 18:11:06 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     88 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    120 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 224
Total latency (us): 15.4949

2024-11-06 18:11:06 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:11:09 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:11 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:13 [DEBUG] XGB validation: p-rmse: 0.032458	a-peak@32: 0.999632
2024-11-06 18:11:13 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 96
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 120
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 232
Total latency (us): 15.4949

2024-11-06 18:11:13 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     96 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    120 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 232
Total latency (us): 15.4949

2024-11-06 18:11:13 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:11:17 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:19 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:20 [DEBUG] XGB validation: p-rmse: 0.171026	a-peak@32: 0.999270
2024-11-06 18:11:20 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 96
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 128
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:11:21 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     96 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    128 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 240
Total latency (us): 15.4949


Total trials: 240
Total latency (us): 15.4949

2024-11-06 18:11:21 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:11:24 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:26 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:28 [DEBUG] XGB validation: p-rmse: 0.121831	a-peak@32: 1.000000
2024-11-06 18:11:28 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 96
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 136
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:11:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |     96 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    136 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 248
Total latency (us): 15.4949


Total trials: 248
Total latency (us): 15.4949

2024-11-06 18:11:28 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:11:32 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:33 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:36 [DEBUG] XGB validation: p-rmse: 0.037096	a-peak@32: 0.986682
2024-11-06 18:11:36 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 104
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 136
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 256
Total latency (us): 15.4949

2024-11-06 18:11:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    104 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    136 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 256
Total latency (us): 15.4949

2024-11-06 18:11:36 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:11:40 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:42 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:44 [DEBUG] XGB validation: p-rmse: 0.127470	a-peak@32: 0.969532
2024-11-06 18:11:44 [DEBUG] XGB iter   0: tr-p-rmse: 0.335610	tr-a-peak@32: 0.858004	tr-rmse: 0.392716	tr-rmse: 0.392716
2024-11-06 18:11:44 [DEBUG] XGB iter  25: tr-p-rmse: 0.061349	tr-a-peak@32: 0.999599	tr-rmse: 0.426219	tr-rmse: 0.426219
2024-11-06 18:11:44 [DEBUG] XGB iter  50: tr-p-rmse: 0.061347	tr-a-peak@32: 0.999599	tr-rmse: 0.426224	tr-rmse: 0.426224
2024-11-06 18:11:44 [DEBUG] XGB iter  75: tr-p-rmse: 0.061347	tr-a-peak@32: 0.999599	tr-rmse: 0.426224	tr-rmse: 0.426224
2024-11-06 18:11:44 [DEBUG] XGB stopped. Best iteration: [34] tr-p-rmse:0.06135	tr-a-peak@32:0.99960	tr-rmse:0.42622	tr-rmse:0.42622 
2024-11-06 18:11:44 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 104
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 144
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:11:44 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    104 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    144 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 264
Total latency (us): 15.4949


Total trials: 264
Total latency (us): 15.4949

2024-11-06 18:11:44 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:11:48 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:50 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:11:52 [DEBUG] XGB validation: p-rmse: 0.028990	a-peak@32: 0.999826
2024-11-06 18:11:52 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 112
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 144
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:11:52 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    112 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    144 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 272
Total latency (us): 15.4949


Total trials: 272
Total latency (us): 15.4949

2024-11-06 18:11:52 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_dense_add_nn_relu_1"
2024-11-06 18:11:55 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:11:57 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:12:00 [DEBUG] XGB validation: p-rmse: 0.029809	a-peak@32: 1.000000
2024-11-06 18:12:00 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_dense_add_nn_relu_1"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 112
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 152
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 280
Total latency (us): 15.4949

2024-11-06 18:12:00 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    112 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    152 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 280
Total latency (us): 15.4949

2024-11-06 18:12:00 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_nn_dense_add_nn_relu"
2024-11-06 18:12:03 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:12:05 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:12:07 [DEBUG] XGB validation: p-rmse: 0.069612	a-peak@32: 0.975988
2024-11-06 18:12:07 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_nn_dense_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 120
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 152
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 288
Total latency (us): 15.4949

2024-11-06 18:12:07 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    120 |      
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    152 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 288
Total latency (us): 15.4949

2024-11-06 18:12:07 [INFO] [task_scheduler.cc:260] Task #0 has finished. Remaining task(s): 2
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 120 Y
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 152
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
2024-11-06 18:12:07 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    120 |    Y 
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    152 |      
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 288
Total latency (us): 15.4949


Total trials: 288
Total latency (us): 15.4949

2024-11-06 18:12:07 [INFO] [task_scheduler.cc:260] Task #1 has finished. Remaining task(s): 1
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 120 Y
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 152 Y
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16
Total trials: 288
Total latency (us): 15.4949

2024-11-06 18:12:07 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    120 |    Y 
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    152 |    Y 
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |      
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 288
Total latency (us): 15.4949

2024-11-06 18:12:07 [INFO] [task_scheduler.cc:260] Task #2 has finished. Remaining task(s): 0
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_nn_dense_add_nn_relu 471000 1 73.4099 6.4160 6.4160 120 Y
1 fused_nn_dense_add_nn_relu_1 180600 2 41.5548 4.3461 8.6921 152 Y
2 fused_nn_dense_add 6010 1 15.5394 0.3868 0.3868 16 Y
2024-11-06 18:12:07 [DEBUG] [task_scheduler.cc:318] 
 ID |                         Name |   FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-----------------------------------------------------------------------------------------------------------------------------
  0 |   fused_nn_dense_add_nn_relu | 471000 |      1 |        73.4099 |       6.4160 |                6.4160 |    120 |    Y 
  1 | fused_nn_dense_add_nn_relu_1 | 180600 |      2 |        41.5548 |       4.3461 |                8.6921 |    152 |    Y 
  2 |           fused_nn_dense_add |   6010 |      1 |        15.5394 |       0.3868 |                0.3868 |     16 |    Y 
-----------------------------------------------------------------------------------------------------------------------------
Total trials: 288
Total latency (us): 15.4949


Total trials: 288
Total latency (us): 15.4949

После оптимизации можно скомпилировать нейронную с учетом построенных оптимизаций с помощью интерфейса MetaScheduler ms.relay_integration.compile_relay.

In [38]:
if is_x86():
    
    database = ms.database.JSONDatabase(
        f"{work_dir}/database_workload.json",
        f"{work_dir}/database_tuning_record.json",
        allow_missing=False
    )

    lib = ms.relay_integration.compile_relay(
        database, mod, target, params,
        opt_level=opt_level,
    )

В завершении измерим время вывода с использованием функции timeit_inference, определим качество работы модели с помощью функции get_accuracy и выполним проверку корректности работы оптимизированной модели, сравнив полученное значение показателя точности с референсным.

In [39]:
if is_x86():
    ms_fcnn_predict, ms_fcnn_times = timeit_inference(mod, lib, images)

    ms_fcnn_accuracy = get_accuracy(labels, ms_fcnn_predict)
    assert np.allclose(metric['fcnn'], ms_fcnn_accuracy, rtol=1e-5)

    ms_fcnn_time = np.median(ms_fcnn_times)
    print(f'Медианное время работы после оптимизации слоев с помощью MetaScheduler: {ms_fcnn_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью MetaScheduler: 0.0238 мc

7.4. Анализ результатов¶

Для анализа результатов оптимизации нейронной сети с использованием различных методов построим прафик медианного времени выполнения.

In [40]:
fig, ax = plt.subplots()

name = ['Без оптимизации\nслоев', 'AutoTVM', 'MetaScheduler']
times = [default_fcnn_time, autotvm_fcnn_time, ms_fcnn_time]

bars = ax.bar(name, times, label=name, color=bar_colors)
ax.set_title('Среднее время\nвыполнения (мс)', fontsize=18)

for bar, n, t in zip(bars, name, times):
    h = bar.get_height()
    if n == 'Без оптимизации\nслоев': h = h / 2
    if h != 0:
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            h,
            f'{round(t, 4)} с',
            ha='center',
            va='bottom',
            fontsize=15,
        )

ax.xaxis.label.set_size(40)
ax.set_title('Среднее время\nвыполнения (с)', fontsize=18)
plt.grid()
No description has been provided for this image

Вывод: оптимизация значительно ускоряет время работы сети.

8. Запуск и оптимизация сверточной нейронной сети¶

8.1. Компиляция и запуск модели¶

Вначале необходимо выполнить загрузку модели сверточной нейронной сети.

In [41]:
default_cnn_time, autotvm_cnn_time, autoscheduler_cnn_time, ms_cnn_time = 0, 0, 0, 0

mod, params = load_model('model/cnn.json', 'model/cnn.params')
print(mod['main'])
fn (%input0: Tensor[(1, 1, 28, 28), float32] /* span=aten::_convolution_0.input0:0:0 */, %aten::_convolution_0.weight: Tensor[(64, 1, 3, 3), float32] /* span=aten::_convolution_0.weight:0:0 */, %aten::_convolution_0.bias: Tensor[(64), float32] /* span=aten::_convolution_0.bias:0:0 */, %aten::linear_0.weight: Tensor[(10, 12544), float32] /* span=aten::linear_0.weight:0:0 */, %aten::linear_0.bias: Tensor[(10), float32] /* span=aten::linear_0.bias:0:0 */) {
  %0 = nn.conv2d(%input0, %aten::_convolution_0.weight, padding=[1, 1, 1, 1], channels=64, kernel_size=[3, 3]) /* span=aten::_convolution_0:0:0 */;
  %1 = nn.bias_add(%0, %aten::_convolution_0.bias) /* span=aten::_convolution_0:0:0 */;
  %2 = nn.relu(%1) /* span=aten::relu_0:0:0 */;
  %3 = nn.max_pool2d(%2, pool_size=[2, 2], strides=[2, 2], padding=[0, 0, 0, 0]) /* span=aten::max_pool2d_0:0:0 */;
  %4 = reshape(%3, newshape=[1, -1]) /* span=aten::view_0:0:0 */;
  %5 = nn.dense(%4, %aten::linear_0.weight, units=None) /* span=aten::linear_0:0:0 */;
  nn.bias_add(%5, %aten::linear_0.bias, axis=-1) /* span=aten::linear_0:0:0 */
}

Следующий шаг - компиляция модели без оптимизации слоев.

In [42]:
with tvm.transform.PassContext(opt_level=opt_level):
    lib = relay.build(mod, target=target, params=params)

После компиляции можно выполнить запуск вывода и измерение времени выполнения с использованием разработанной функции timeit_inference, а также проверку качества работы полносвязной нейронной сети после загрузки с помощью функции get_accuracy и сравнение полученной точности классификации с загруженным значением, которое получено на x86-64.

In [43]:
default_cnn_predict, default_cnn_times = timeit_inference(mod, lib, images)

default_cnn_accuracy = get_accuracy(labels, default_cnn_predict)
assert np.allclose(metric['cnn'], default_cnn_accuracy, rtol=1e-5)

default_cnn_time = np.median(default_cnn_times)
print(f'Медианное время работы не оптимизированной модели: {default_cnn_time:.4f} мc')
Медианное время работы не оптимизированной модели: 0.0644 мc

8.2. Использование возможностей AutoTVM¶

Вызовем разработанную функцию get_autotvm_task для извлечения задач из графа вычислений для AutoTVM.

В данном случае к задачам с полносвязным слоем добавляется задача со сверточным слоем.

In [44]:
if is_x86():
    tasks = get_autotvm_task(mod, target, params)
Извлечение задач

Номер задачи: 0
Информация о задаче: ('conv2d_NCHWc.x86', ('TENSOR', (1, 1, 28, 28), 'float32'), ('TENSOR', (64, 1, 3, 3), 'float32'), (1, 1), (1, 1, 1, 1), (1, 1), 'NCHW', 'NCHW', 'float32')

Номер задачи: 1
Информация о задаче: ('dense_nopack.x86', ('TENSOR', (1, 12544), 'float32'), ('TENSOR', (10, 12544), 'float32'), None, 'float32')

Номер задачи: 2
Информация о задаче: ('dense_pack.x86', ('TENSOR', (1, 12544), 'float32'), ('TENSOR', (10, 12544), 'float32'), None, 'float32')

Для запуска оптимизации с помощью AutoTVM необходимо определить файл log_file для логирования результатов оптимизации, установить число экспериментов при оптимизации, а затем вызвать разработанную функцию tune_autotvm.

In [45]:
log_file = 'autotvm/autotvm_cnn.log'
n_trial = global_trial

if is_x86():
    tune_autotvm(tasks, n_trial, log_file)
[Task  1/ 3]  Current/Best:   71.28/  80.80 GFLOPS | Progress: (96/96) | 35.12 s Done.
[Task  3/ 3]  Current/Best:    9.54/  16.49 GFLOPS | Progress: (60/96) | 60.29 s Done.
[Task  3/ 3]  Current/Best:   12.25/  16.49 GFLOPS | Progress: (96/96) | 85.73 s Done.

Перед использованием оптимизированной модели, необходимо выполнить компиляцию модели с учетом истории оптимизации, которая была сохранена в файл log_file.

In [46]:
if is_x86():
    with autotvm.apply_history_best(log_file):
        with tvm.transform.PassContext(opt_level=opt_level):
            lib = relay.build(mod, target=target, params=params)

На данном этапе можно выполнить измерение времени выполнения с использованием функции timeit_inference, проверку качества работы оптимизированной модели с помощью функции get_accuracy и сравнение точности классификации с рефенсным значением, которое было получено после запуска обучения модели.

In [47]:
if is_x86():
    autotvm_cnn_predict, autotvm_cnn_times = timeit_inference(mod, lib, images)

    autotvm_cnn_accuracy = get_accuracy(labels, autotvm_cnn_predict)
    assert np.allclose(metric['cnn'], autotvm_cnn_accuracy, rtol=1e-5)

    autotvm_cnn_time = np.median(autotvm_cnn_times)
    print(f'Медианное время работы после оптимизации слоев с помощью AutoTVM: {autotvm_cnn_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью AutoTVM: 0.0508 мc

8.3. Использование Auto-scheduler¶

Вызовем разработанную функцию get_auto_scheduler_task для извлечения задач из графа вычислений для AutoTVM.

In [48]:
if is_x86():
    tasks, task_weights = get_auto_scheduler_task(mod, target, params, opt_level)
Номер задачи: 0
Информация о задаче: vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu

Номер задачи: 1
Информация о задаче: vm_mod_fused_nn_dense_add

Номер задачи: 2
Информация о задаче: vm_mod_fused_nn_max_pool2d

Для запуска оптимизации с помощью AutoTVM необходимо определить файл log_file для логирования результатов оптимизации, установить число экспериментов при оптимизации, а затем вызвать разработанную функцию tune_auto_scheduler.

In [49]:
os.makedirs('auto_schedule/', exist_ok=True)
log_file = 'auto_schedule/auto-schedule_cnn.log'
n_trial_per_task = global_trial

if is_x86():
    tune_auto_scheduler(tasks, task_weights, log_file, n_trial_per_task * len(tasks))
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |            - |              - |      0 |
|    1 |                                     vm_mod_fused_nn_dense_add |            - |              - |      0 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms	Trials: 0	Used time : 0 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches		#s: 3
Sample Initial Population	#s: 1785	fail_ct: 0	Time elapsed: 3.63
GA Iter: 0	Max score: 0.9987	Min score: 0.9908	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9996	Min score: 0.9986	#Pop: 16	#M+: 1385	#M-: 36
EvolutionarySearch		#s: 16	Time elapsed: 17.56
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 3.67 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |      8 |
|    1 |                                     vm_mod_fused_nn_dense_add |            - |              - |      0 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms	Trials: 8	Used time : 25 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches		#s: 5
Sample Initial Population	#s: 989	fail_ct: 626	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9995	Min score: 0.9803	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9995	Min score: 0.9982	#Pop: 16	#M+: 1373	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.35
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
.....E.E.E.E****
Time elapsed for measurement: 2.41 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |      8 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.022 |          11.17 |      8 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |            - |              - |      0 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: - ms	Trials: 16	Used time : 32 s	Next ID: 2	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Generate Sketches		#s: 1
Sample Iter: 5	#Pop: 4	#Target: 50	fail_ct: 10236	Time elapsed: 2.63
#Target has been reduced to 25 due to too many failures or duplications
Sample Iter: 10	#Pop: 4	#Target: 25	fail_ct: 20476	Time elapsed: 5.26
#Target has been reduced to 12 due to too many failures or duplications
Sample Iter: 15	#Pop: 4	#Target: 12	fail_ct: 30716	Time elapsed: 7.88
#Target has been reduced to 6 due to too many failures or duplications
Sample Iter: 20	#Pop: 4	#Target: 6	fail_ct: 40956	Time elapsed: 10.51
#Target has been reduced to 3 due to too many failures or duplications
Sample Initial Population	#s: 4	fail_ct: 43004	Time elapsed: 11.04
GA Iter: 0	Max score: 0.8313	Min score: 0.4427	#Pop: 4	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9996	Min score: 0.9047	#Pop: 16	#M+: 354	#M-: 6816
EvolutionarySearch		#s: 16	Time elapsed: 1.93
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.67 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |      8 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.022 |          11.17 |      8 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          12.12 |      8 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.041 ms	Trials: 24	Used time : 47 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1796	fail_ct: 0	Time elapsed: 3.66
GA Iter: 0	Max score: 0.9998	Min score: 0.9923	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.0000	Min score: 0.9991	#Pop: 16	#M+: 1383	#M-: 40
EvolutionarySearch		#s: 16	Time elapsed: 17.65
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 3.24 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     16 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.022 |          11.17 |      8 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          12.12 |      8 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.041 ms	Trials: 32	Used time : 72 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 993	fail_ct: 613	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9746	Min score: 0.8838	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.6343	Min score: 1.3069	#Pop: 16	#M+: 1374	#M-: 78
EvolutionarySearch		#s: 16	Time elapsed: 3.36
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.50 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     16 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     16 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          12.12 |      8 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.039 ms	Trials: 40	Used time : 79 s	Next ID: 2	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 4	fail_ct: 2044	Time elapsed: 0.54
GA Iter: 0	Max score: 0.5743	Min score: 0.5743	#Pop: 2	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9231	Min score: 0.2326	#Pop: 12	#M+: 341	#M-: 7000
EvolutionarySearch		#s: 12	Time elapsed: 2.07
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.69 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     16 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     16 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     16 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.038 ms	Trials: 48	Used time : 84 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1760	fail_ct: 0	Time elapsed: 3.65
GA Iter: 0	Max score: 0.9608	Min score: 0.8305	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.1595	Min score: 1.0566	#Pop: 16	#M+: 1385	#M-: 36
EvolutionarySearch		#s: 16	Time elapsed: 17.91
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.91 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     24 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     16 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     16 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.038 ms	Trials: 56	Used time : 108 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1011	fail_ct: 589	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9508	Min score: 0.8637	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9968	Min score: 0.9424	#Pop: 16	#M+: 1390	#M-: 74
EvolutionarySearch		#s: 16	Time elapsed: 3.59
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.45 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     24 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     24 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     16 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.038 ms	Trials: 64	Used time : 115 s	Next ID: 2	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 4	fail_ct: 2044	Time elapsed: 0.55
GA Iter: 0	Max score: N/A	Min score: N/A	#Pop: 0	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.5409	Min score: 0.3118	#Pop: 4	#M+: 350	#M-: 6927
EvolutionarySearch		#s: 4	Time elapsed: 2.12
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 4 programs to measure:
....****
Time elapsed for measurement: 1.72 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.014 |          71.28 |     24 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     24 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     24 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.038 ms	Trials: 68	Used time : 120 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1790	fail_ct: 0	Time elapsed: 3.82
GA Iter: 0	Max score: 0.8568	Min score: 0.7690	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9531	Min score: 0.8350	#Pop: 16	#M+: 1378	#M-: 44
EvolutionarySearch		#s: 16	Time elapsed: 18.60
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.54 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.012 |          82.61 |     32 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     24 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     24 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.037 ms	Trials: 76	Used time : 145 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1015	fail_ct: 605	Time elapsed: 0.77
GA Iter: 0	Max score: 0.9435	Min score: 0.8916	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9504	Min score: 0.9210	#Pop: 16	#M+: 1388	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.59
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.58 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.012 |          82.61 |     32 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     32 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     24 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.037 ms	Trials: 84	Used time : 152 s	Next ID: 2	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 4	fail_ct: 2044	Time elapsed: 0.57
GA Iter: 0	Max score: N/A	Min score: N/A	#Pop: 0	#M+: 0	#M-: 0
GA Iter: 4	Max score: N/A	Min score: N/A	#Pop: 0	#M+: 356	#M-: 6909
EvolutionarySearch		#s: 0	Time elapsed: 2.05
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 0 programs to measure:
Time elapsed for measurement: 0.00 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.00 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.012 |          82.61 |     32 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     32 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.037 ms	Trials: 84	Used time : 154 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1778	fail_ct: 0	Time elapsed: 3.64
GA Iter: 0	Max score: 0.8297	Min score: 0.6632	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.8921	Min score: 0.8259	#Pop: 16	#M+: 1397	#M-: 45
EvolutionarySearch		#s: 16	Time elapsed: 17.75
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.58 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.011 |          95.29 |     40 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     32 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.035 ms	Trials: 92	Used time : 179 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1045	fail_ct: 592	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9027	Min score: 0.8365	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9494	Min score: 0.9037	#Pop: 16	#M+: 1385	#M-: 73
EvolutionarySearch		#s: 16	Time elapsed: 3.41
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.48 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.04 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.011 |          95.29 |     40 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     40 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.035 ms	Trials: 100	Used time : 185 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1799	fail_ct: 0	Time elapsed: 3.69
GA Iter: 0	Max score: 0.7539	Min score: 0.6363	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.8545	Min score: 0.7825	#Pop: 16	#M+: 1381	#M-: 46
EvolutionarySearch		#s: 16	Time elapsed: 17.88
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.76 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.011 |          95.29 |     48 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     40 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.035 ms	Trials: 108	Used time : 210 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1017	fail_ct: 608	Time elapsed: 0.76
GA Iter: 0	Max score: 0.8832	Min score: 0.8387	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9849	Min score: 0.8709	#Pop: 16	#M+: 1377	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.48
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.49 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.011 |          95.29 |     48 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     48 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.035 ms	Trials: 116	Used time : 217 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1793	fail_ct: 0	Time elapsed: 3.68
GA Iter: 0	Max score: 0.7573	Min score: 0.6526	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9031	Min score: 0.7777	#Pop: 16	#M+: 1382	#M-: 46
EvolutionarySearch		#s: 16	Time elapsed: 18.27
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.61 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     56 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.15 |     48 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.035 ms	Trials: 124	Used time : 241 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1014	fail_ct: 609	Time elapsed: 0.84
GA Iter: 0	Max score: 0.8629	Min score: 0.8298	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9486	Min score: 0.8564	#Pop: 16	#M+: 1374	#M-: 72
EvolutionarySearch		#s: 16	Time elapsed: 3.47
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.51 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     56 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     56 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 132	Used time : 248 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1776	fail_ct: 0	Time elapsed: 3.69
GA Iter: 0	Max score: 0.7908	Min score: 0.6810	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9305	Min score: 0.8259	#Pop: 16	#M+: 1382	#M-: 47
EvolutionarySearch		#s: 16	Time elapsed: 18.05
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.63 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     64 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     56 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 140	Used time : 273 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1005	fail_ct: 632	Time elapsed: 0.76
GA Iter: 0	Max score: 0.8697	Min score: 0.8214	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9633	Min score: 0.8459	#Pop: 16	#M+: 1377	#M-: 70
EvolutionarySearch		#s: 16	Time elapsed: 3.47
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.45 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     64 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     64 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 148	Used time : 279 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1794	fail_ct: 0	Time elapsed: 3.67
GA Iter: 0	Max score: 0.6588	Min score: 0.6052	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9298	Min score: 0.7892	#Pop: 16	#M+: 1394	#M-: 43
EvolutionarySearch		#s: 16	Time elapsed: 17.99
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.63 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     72 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     64 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 156	Used time : 304 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 974	fail_ct: 638	Time elapsed: 0.74
GA Iter: 0	Max score: 0.8673	Min score: 0.8455	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9544	Min score: 0.8627	#Pop: 16	#M+: 1382	#M-: 77
EvolutionarySearch		#s: 16	Time elapsed: 3.46
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.43 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     72 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     72 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 164	Used time : 310 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1799	fail_ct: 1	Time elapsed: 3.68
GA Iter: 0	Max score: 0.7651	Min score: 0.6197	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9655	Min score: 0.9023	#Pop: 16	#M+: 1372	#M-: 48
EvolutionarySearch		#s: 16	Time elapsed: 18.05
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.50 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     80 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     72 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 172	Used time : 335 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1000	fail_ct: 622	Time elapsed: 0.75
GA Iter: 0	Max score: 0.8641	Min score: 0.8334	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.8957	Min score: 0.8601	#Pop: 16	#M+: 1377	#M-: 65
EvolutionarySearch		#s: 16	Time elapsed: 3.49
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.48 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          98.69 |     80 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     80 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 180	Used time : 342 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1781	fail_ct: 0	Time elapsed: 3.66
GA Iter: 0	Max score: 0.7655	Min score: 0.6635	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9821	Min score: 0.8913	#Pop: 16	#M+: 1391	#M-: 45
EvolutionarySearch		#s: 16	Time elapsed: 18.14
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.62 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |     88 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     80 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 188	Used time : 366 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1012	fail_ct: 566	Time elapsed: 0.74
GA Iter: 0	Max score: 0.9146	Min score: 0.8536	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9146	Min score: 0.8863	#Pop: 16	#M+: 1388	#M-: 76
EvolutionarySearch		#s: 16	Time elapsed: 3.49
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.44 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |     88 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     88 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 196	Used time : 373 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1788	fail_ct: 0	Time elapsed: 3.70
GA Iter: 0	Max score: 0.7399	Min score: 0.6505	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9578	Min score: 0.8916	#Pop: 16	#M+: 1397	#M-: 45
EvolutionarySearch		#s: 16	Time elapsed: 18.10
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.49 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |     96 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.021 |          12.19 |     88 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.034 ms	Trials: 204	Used time : 397 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1008	fail_ct: 628	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9234	Min score: 0.8479	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9948	Min score: 0.9072	#Pop: 16	#M+: 1376	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.52
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.44 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |     96 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.31 |     96 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 212	Used time : 404 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1777	fail_ct: 0	Time elapsed: 3.66
GA Iter: 0	Max score: 0.7967	Min score: 0.6456	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9699	Min score: 0.8964	#Pop: 16	#M+: 1384	#M-: 44
EvolutionarySearch		#s: 16	Time elapsed: 18.16
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.48 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    104 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.31 |     96 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 220	Used time : 428 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1012	fail_ct: 605	Time elapsed: 0.77
GA Iter: 0	Max score: 0.8123	Min score: 0.7859	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.8887	Min score: 0.8289	#Pop: 16	#M+: 1390	#M-: 70
EvolutionarySearch		#s: 16	Time elapsed: 3.51
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.46 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    104 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.40 |    104 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 228	Used time : 435 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1770	fail_ct: 0	Time elapsed: 3.63
GA Iter: 0	Max score: 0.6901	Min score: 0.6150	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9919	Min score: 0.8839	#Pop: 16	#M+: 1390	#M-: 36
EvolutionarySearch		#s: 16	Time elapsed: 18.25
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.56 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    112 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.40 |    104 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 236	Used time : 460 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1024	fail_ct: 631	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9062	Min score: 0.8772	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9845	Min score: 0.9154	#Pop: 16	#M+: 1387	#M-: 70
EvolutionarySearch		#s: 16	Time elapsed: 3.55
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.51 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    112 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.40 |    112 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 244	Used time : 467 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1766	fail_ct: 0	Time elapsed: 3.61
GA Iter: 0	Max score: 0.7807	Min score: 0.6345	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9656	Min score: 0.8971	#Pop: 16	#M+: 1391	#M-: 39
EvolutionarySearch		#s: 16	Time elapsed: 18.19
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.63 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.07 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    120 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.019 |          13.40 |    112 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.033 ms	Trials: 252	Used time : 491 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1017	fail_ct: 616	Time elapsed: 0.75
GA Iter: 0	Max score: 0.8916	Min score: 0.8486	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9631	Min score: 0.9069	#Pop: 16	#M+: 1392	#M-: 69
EvolutionarySearch		#s: 16	Time elapsed: 3.55
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.47 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.05 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.26 |    120 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.018 |          13.62 |    120 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.032 ms	Trials: 260	Used time : 498 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1789	fail_ct: 0	Time elapsed: 3.65
GA Iter: 0	Max score: 0.7186	Min score: 0.6269	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9358	Min score: 0.8948	#Pop: 16	#M+: 1384	#M-: 35
EvolutionarySearch		#s: 16	Time elapsed: 18.34
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.57 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.07 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.37 |    128 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.018 |          13.62 |    120 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.032 ms	Trials: 268	Used time : 523 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1003	fail_ct: 607	Time elapsed: 0.75
GA Iter: 0	Max score: 0.9033	Min score: 0.8858	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9796	Min score: 0.9338	#Pop: 16	#M+: 1384	#M-: 67
EvolutionarySearch		#s: 16	Time elapsed: 3.57
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.60 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.37 |    128 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.018 |          13.81 |    128 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.032 ms	Trials: 276	Used time : 530 s	Next ID: 0	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1788	fail_ct: 0	Time elapsed: 3.66
GA Iter: 0	Max score: 0.7131	Min score: 0.6271	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 1.0194	Min score: 0.8859	#Pop: 16	#M+: 1384	#M-: 40
EvolutionarySearch		#s: 16	Time elapsed: 18.37
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.57 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.07 s
|  ID  |                       Task Description                        | Latency (ms) | Speed (GFLOPS) | Trials |
-----------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|    0 |              vm_mod_fused_nn_contrib_conv2d_NCHWc_add_nn_relu |        0.010 |          99.37 |    136 |
|    1 |                                     vm_mod_fused_nn_dense_add |        0.018 |          13.81 |    128 |
|    2 |                                    vm_mod_fused_nn_max_pool2d |        0.004 |          13.49 |     32 |
-----------------------------------------------------------------------------------------------------------------
Estimated total latency: 0.032 ms	Trials: 284	Used time : 555 s	Next ID: 1	
----------------------------------------------------------------------
------------------------------  [ Search ]
----------------------------------------------------------------------
Sample Initial Population	#s: 1021	fail_ct: 622	Time elapsed: 0.76
GA Iter: 0	Max score: 0.9536	Min score: 0.8689	#Pop: 16	#M+: 0	#M-: 0
GA Iter: 4	Max score: 0.9951	Min score: 0.9339	#Pop: 16	#M+: 1377	#M-: 63
EvolutionarySearch		#s: 16	Time elapsed: 3.59
----------------------------------------------------------------------
------------------------------  [ Measure ]
----------------------------------------------------------------------
Get 8 programs to measure:
........********
Time elapsed for measurement: 2.52 s
----------------------------------------------------------------------
------------------------------  [ Train cost model ]
----------------------------------------------------------------------
Time elapsed for training: 0.06 s

Перед использованием оптимизированной модели, необходимо выполнить компиляцию модели с учетом истории оптимизации, которая была сохранена в файл log_file.

In [50]:
if is_x86():
    with auto_scheduler.ApplyHistoryBest(log_file):
        with tvm.transform.PassContext(
            opt_level=opt_level, config={"relay.backend.use_auto_scheduler": True},
        ):
            lib = relay.build(mod, target=target, params=params)

На данном этапе можно выполнить измерение времени выполнения с использованием функции timeit_inference, проверку качества работы оптимизированной модели с помощью функции get_accuracy и сравнение точности классификации с рефенсным значением, которое было получено после запуска обучения модели.

In [51]:
if is_x86():
    autoscheduler_cnn_predict, autoscheduler_cnn_times = timeit_inference(mod, lib, images)

    autoscheduler_cnn_accuracy = get_accuracy(labels, autoscheduler_cnn_predict)
    assert np.allclose(metric['cnn'], autoscheduler_cnn_accuracy, rtol=1e-5)

    autoscheduler_cnn_time = np.median(autoscheduler_cnn_times)
    print(f'Медианное время работы после оптимизации слоев с помощью Auto-scheduler: {autoscheduler_cnn_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью Auto-scheduler: 0.0548 мc

8.4. Применение MetaScheduler¶

Вызовем разработанную функцию get_ms_task, предварительно определив директорию work_dir для логирования результатов оптимизации.

В данном случае строка компиляции уже содержит информацию о числе потоков, поэтому модифицировать ее нет необходимости.

In [52]:
if is_x86():
    work_dir = "meta_schedule_cnn"

    tasks, task_weights = get_ms_task(mod, target, params, opt_level, work_dir)
2024-11-06 18:25:08 [INFO] Logging directory: meta_schedule_cnn/logs
Номер задачи: 0
Информация о задаче: fused_layout_transform

Номер задачи: 1
Информация о задаче: fused_nn_contrib_conv2d_NCHWc_add_nn_relu

Номер задачи: 2
Информация о задаче: fused_nn_max_pool2d

Номер задачи: 3
Информация о задаче: fused_layout_transform_reshape

Номер задачи: 4
Информация о задаче: fused_nn_dense_add

Далее выполним запуск оптимизации с помощью MetaScheduler посредством вызова функции tune_ms, установив число экспериментов при оптимизации равным N * len(tasks).

In [53]:
n_trial_per_task = global_trial

if is_x86():
    tune_ms(tasks, task_weights, work_dir, n_trial_per_task * len(tasks))
2024-11-06 18:25:08 [INFO] LocalBuilder: max_workers = 12
2024-11-06 18:25:09 [INFO] LocalRunner: max_workers = 1
2024-11-06 18:25:09 [INFO] [task_scheduler.cc:159] Initializing Task #0: "fused_layout_transform"
2024-11-06 18:25:09 [INFO] [task_scheduler.cc:159] Initializing Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:25:09 [INFO] [task_scheduler.cc:159] Initializing Task #2: "fused_nn_max_pool2d"
2024-11-06 18:25:09 [INFO] [task_scheduler.cc:159] Initializing Task #3: "fused_layout_transform_reshape"
2024-11-06 18:25:09 [INFO] [task_scheduler.cc:159] Initializing Task #4: "fused_nn_dense_add"
[18:25:09] /home/yury/project/tensor_compilers/TVM/tvms/tvm18/src/meta_schedule/schedule_rule/apply_custom_rule.cc:56: Warning: Unknown schedule rule "meta_schedule.pool_max" for target keys "["cpu"]". Checked PackedFuncs:
  meta_schedule.cpu.meta_schedule.pool_max
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 N/A N/A N/A 0
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 N/A N/A N/A 0
2 fused_nn_max_pool2d 50176 1 N/A N/A N/A 0
3 fused_layout_transform_reshape 1 1 N/A N/A N/A 0
4 fused_nn_dense_add 250890 1 N/A N/A N/A 0
2024-11-06 18:25:09 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |            N/A |          N/A |                   N/A |      0 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |            N/A |          N/A |                   N/A |      0 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |            N/A |          N/A |                   N/A |      0 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |            N/A |          N/A |                   N/A |      0 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |            N/A |          N/A |                   N/A |      0 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 0
Total latency (us): 0


Total trials: 0
Total latency (us): 0

2024-11-06 18:25:09 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:25:10 [INFO] [task_scheduler.cc:193] Sending 2 sample(s) to builder
2024-11-06 18:25:10 [INFO] [task_scheduler.cc:195] Sending 2 sample(s) to runner
2024-11-06 18:25:11 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:25:17 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:21 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:23 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:25:24 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:26 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:28 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:25:29 [INFO] [task_scheduler.cc:193] Sending 2 sample(s) to builder
2024-11-06 18:25:29 [INFO] [task_scheduler.cc:195] Sending 2 sample(s) to runner
2024-11-06 18:25:30 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:25:32 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:33 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:36 [DEBUG] XGB iter   0: tr-p-rmse: 0.428956	tr-a-peak@32: 1.000000	tr-rmse: 0.428984	tr-rmse: 0.428984
2024-11-06 18:25:36 [DEBUG] XGB iter  25: tr-p-rmse: 0.013121	tr-a-peak@32: 1.000000	tr-rmse: 0.013146	tr-rmse: 0.013146
2024-11-06 18:25:36 [DEBUG] XGB iter  50: tr-p-rmse: 0.005225	tr-a-peak@32: 1.000000	tr-rmse: 0.005226	tr-rmse: 0.005226
2024-11-06 18:25:36 [DEBUG] XGB iter  75: tr-p-rmse: 0.005215	tr-a-peak@32: 1.000000	tr-rmse: 0.005215	tr-rmse: 0.005215
2024-11-06 18:25:36 [DEBUG] XGB iter 100: tr-p-rmse: 0.005215	tr-a-peak@32: 1.000000	tr-rmse: 0.005215	tr-rmse: 0.005215
2024-11-06 18:25:36 [DEBUG] XGB stopped. Best iteration: [63] tr-p-rmse:0.00522	tr-a-peak@32:1.00000	tr-rmse:0.00522	tr-rmse:0.00522 
2024-11-06 18:25:36 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_layout_transform"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 N/A N/A N/A 0
2 fused_nn_max_pool2d 50176 1 N/A N/A N/A 0
3 fused_layout_transform_reshape 1 1 N/A N/A N/A 0
4 fused_nn_dense_add 250890 1 N/A N/A N/A 0
2024-11-06 18:25:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |            N/A |          N/A |                   N/A |      0 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |            N/A |          N/A |                   N/A |      0 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |            N/A |          N/A |                   N/A |      0 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |            N/A |          N/A |                   N/A |      0 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 2
Total latency (us): 3.19523


Total trials: 2
Total latency (us): 3.19523

2024-11-06 18:25:36 [DEBUG] XGB iter   0: tr-p-rmse: 0.603430	tr-a-peak@32: 1.000000	tr-rmse: 0.379929	tr-rmse: 0.379929
2024-11-06 18:25:36 [DEBUG] XGB iter  25: tr-p-rmse: 0.035632	tr-a-peak@32: 1.000000	tr-rmse: 0.388200	tr-rmse: 0.388200
2024-11-06 18:25:36 [DEBUG] XGB iter  50: tr-p-rmse: 0.035772	tr-a-peak@32: 1.000000	tr-rmse: 0.388074	tr-rmse: 0.388074
2024-11-06 18:25:36 [DEBUG] XGB stopped. Best iteration: [22] tr-p-rmse:0.03496	tr-a-peak@32:1.00000	tr-rmse:0.38890	tr-rmse:0.38890 
2024-11-06 18:25:36 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 8
2 fused_nn_max_pool2d 50176 1 N/A N/A N/A 0
3 fused_layout_transform_reshape 1 1 N/A N/A N/A 0
4 fused_nn_dense_add 250890 1 N/A N/A N/A 0
Total trials: 10
Total latency (us): 16.9321

2024-11-06 18:25:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |      8 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |            N/A |          N/A |                   N/A |      0 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |            N/A |          N/A |                   N/A |      0 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |            N/A |          N/A |                   N/A |      0 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 10
Total latency (us): 16.9321

2024-11-06 18:25:36 [DEBUG] XGB iter   0: tr-p-rmse: 0.483316	tr-a-peak@32: 1.000000	tr-rmse: 0.392427	tr-rmse: 0.392427
2024-11-06 18:25:36 [DEBUG] XGB iter  25: tr-p-rmse: 0.034418	tr-a-peak@32: 1.000000	tr-rmse: 0.414158	tr-rmse: 0.414158
2024-11-06 18:25:36 [DEBUG] XGB iter  50: tr-p-rmse: 0.034458	tr-a-peak@32: 1.000000	tr-rmse: 0.414098	tr-rmse: 0.414098
2024-11-06 18:25:36 [DEBUG] XGB stopped. Best iteration: [21] tr-p-rmse:0.03411	tr-a-peak@32:1.00000	tr-rmse:0.41467	tr-rmse:0.41467 
2024-11-06 18:25:36 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 8
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 N/A N/A N/A 0
4 fused_nn_dense_add 250890 1 N/A N/A N/A 0
Total trials: 18
Total latency (us): 23.9759

2024-11-06 18:25:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |      8 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |            N/A |          N/A |                   N/A |      0 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |            N/A |          N/A |                   N/A |      0 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 18
Total latency (us): 23.9759

2024-11-06 18:25:36 [INFO] [task_scheduler.cc:237] [Updated] Task #3: "fused_layout_transform_reshape"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 8
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 N/A N/A N/A 0
Total trials: 20
Total latency (us): 31.1984

2024-11-06 18:25:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |      8 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |            N/A |          N/A |                   N/A |      0 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 20
Total latency (us): 31.1984

2024-11-06 18:25:36 [DEBUG] XGB iter   0: tr-p-rmse: 0.480004	tr-a-peak@32: 1.000000	tr-rmse: 0.408802	tr-rmse: 0.408802
2024-11-06 18:25:36 [DEBUG] XGB iter  25: tr-p-rmse: 0.032664	tr-a-peak@32: 1.000000	tr-rmse: 0.436189	tr-rmse: 0.436189
2024-11-06 18:25:36 [DEBUG] XGB iter  50: tr-p-rmse: 0.032666	tr-a-peak@32: 1.000000	tr-rmse: 0.436185	tr-rmse: 0.436185
2024-11-06 18:25:36 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.03239	tr-a-peak@32:1.00000	tr-rmse:0.43671	tr-rmse:0.43671 
2024-11-06 18:25:36 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 8
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 8
Total trials: 28
Total latency (us): 42.2394

2024-11-06 18:25:36 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |      8 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |      8 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 28
Total latency (us): 42.2394

2024-11-06 18:25:36 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:25:42 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:45 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:47 [DEBUG] XGB validation: p-rmse: 0.350801	a-peak@32: 0.775240
2024-11-06 18:25:47 [DEBUG] XGB iter   0: tr-p-rmse: 0.493188	tr-a-peak@32: 1.000000	tr-rmse: 0.382766	tr-rmse: 0.382766
2024-11-06 18:25:47 [DEBUG] XGB iter  25: tr-p-rmse: 0.053644	tr-a-peak@32: 1.000000	tr-rmse: 0.426199	tr-rmse: 0.426199
2024-11-06 18:25:48 [DEBUG] XGB iter  50: tr-p-rmse: 0.053647	tr-a-peak@32: 1.000000	tr-rmse: 0.426196	tr-rmse: 0.426196
2024-11-06 18:25:48 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.05316	tr-a-peak@32:1.00000	tr-rmse:0.42675	tr-rmse:0.42675 
2024-11-06 18:25:48 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 16
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 8
Total trials: 36
Total latency (us): 42.2394

2024-11-06 18:25:48 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     16 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |      8 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 36
Total latency (us): 42.2394

2024-11-06 18:25:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:25:49 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:51 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:53 [DEBUG] XGB validation: p-rmse: 0.225171	a-peak@32: 0.951184
2024-11-06 18:25:53 [DEBUG] XGB iter   0: tr-p-rmse: 0.496843	tr-a-peak@32: 1.000000	tr-rmse: 0.385603	tr-rmse: 0.385603
2024-11-06 18:25:53 [DEBUG] XGB iter  25: tr-p-rmse: 0.048279	tr-a-peak@32: 1.000000	tr-rmse: 0.431890	tr-rmse: 0.431890
2024-11-06 18:25:53 [DEBUG] XGB iter  50: tr-p-rmse: 0.048280	tr-a-peak@32: 1.000000	tr-rmse: 0.431889	tr-rmse: 0.431889
2024-11-06 18:25:53 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.04810	tr-a-peak@32:1.00000	tr-rmse:0.43212	tr-rmse:0.43212 
2024-11-06 18:25:53 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 16
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 16
Total trials: 44
Total latency (us): 42.2394

2024-11-06 18:25:53 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     16 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     16 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 44
Total latency (us): 42.2394

2024-11-06 18:25:53 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:25:54 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:25:54 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:25:54 [INFO] [task_scheduler.cc:237] [Updated] Task #3: "fused_layout_transform_reshape"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 16
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 8
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 16
2024-11-06 18:25:54 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     16 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |      8 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     16 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 44
Total latency (us): 42.2394


Total trials: 44
Total latency (us): 42.2394

2024-11-06 18:25:54 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:25:55 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:25:57 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:25:59 [DEBUG] XGB validation: p-rmse: 0.077958	a-peak@32: 1.000000
2024-11-06 18:25:59 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 16
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 16
2024-11-06 18:25:59 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     16 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     16 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 52
Total latency (us): 42.2394


Total trials: 52
Total latency (us): 42.2394

2024-11-06 18:25:59 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:26:06 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:08 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:10 [DEBUG] XGB validation: p-rmse: 0.193769	a-peak@32: 1.000000
2024-11-06 18:26:10 [DEBUG] XGB iter   0: tr-p-rmse: 0.485757	tr-a-peak@32: 1.000000	tr-rmse: 0.376904	tr-rmse: 0.376904
2024-11-06 18:26:10 [DEBUG] XGB iter  25: tr-p-rmse: 0.058399	tr-a-peak@32: 1.000000	tr-rmse: 0.427946	tr-rmse: 0.427946
2024-11-06 18:26:10 [DEBUG] XGB iter  50: tr-p-rmse: 0.058400	tr-a-peak@32: 1.000000	tr-rmse: 0.427944	tr-rmse: 0.427944
2024-11-06 18:26:10 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.05822	tr-a-peak@32:1.00000	tr-rmse:0.42820	tr-rmse:0.42820 
2024-11-06 18:26:10 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 24
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 16
2024-11-06 18:26:10 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     24 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     16 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 60
Total latency (us): 42.2394


Total trials: 60
Total latency (us): 42.2394

2024-11-06 18:26:10 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:26:11 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:13 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:15 [DEBUG] XGB validation: p-rmse: 0.162904	a-peak@32: 0.994705
2024-11-06 18:26:15 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 24
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 24
2024-11-06 18:26:15 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     24 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     24 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 68
Total latency (us): 42.2394


Total trials: 68
Total latency (us): 42.2394

2024-11-06 18:26:15 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:26:21 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:25 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:27 [DEBUG] XGB validation: p-rmse: 0.143762	a-peak@32: 0.981613
2024-11-06 18:26:27 [DEBUG] XGB iter   0: tr-p-rmse: 0.481905	tr-a-peak@32: 1.000000	tr-rmse: 0.375591	tr-rmse: 0.375591
2024-11-06 18:26:27 [DEBUG] XGB iter  25: tr-p-rmse: 0.053558	tr-a-peak@32: 0.999831	tr-rmse: 0.428806	tr-rmse: 0.428806
2024-11-06 18:26:27 [DEBUG] XGB iter  50: tr-p-rmse: 0.053559	tr-a-peak@32: 0.999831	tr-rmse: 0.428805	tr-rmse: 0.428805
2024-11-06 18:26:27 [DEBUG] XGB stopped. Best iteration: [18] tr-p-rmse:0.05348	tr-a-peak@32:0.99983	tr-rmse:0.42893	tr-rmse:0.42893 
2024-11-06 18:26:27 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 32
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 22.7236 11.0410 11.0410 24
2024-11-06 18:26:27 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     32 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        22.7236 |      11.0410 |               11.0410 |     24 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 76
Total latency (us): 42.2394


Total trials: 76
Total latency (us): 42.2394

2024-11-06 18:26:27 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:26:29 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:30 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:33 [DEBUG] XGB validation: p-rmse: 0.145088	a-peak@32: 0.977988
2024-11-06 18:26:33 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 32
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
2024-11-06 18:26:33 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     32 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 84
Total latency (us): 41.4546


Total trials: 84
Total latency (us): 41.4546

2024-11-06 18:26:33 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:26:33 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:26:33 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:26:33 [INFO] [task_scheduler.cc:237] [Updated] Task #3: "fused_layout_transform_reshape"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 32
2 fused_nn_max_pool2d 50176 1 7.1234 7.0438 7.0438 16
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
2024-11-06 18:26:33 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     32 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |         7.1234 |       7.0438 |                7.0438 |     16 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 84
Total latency (us): 41.4546


Total trials: 84
Total latency (us): 41.4546

2024-11-06 18:26:33 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:26:34 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:37 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:39 [DEBUG] XGB validation: p-rmse: 0.656155	a-peak@32: 0.666110
2024-11-06 18:26:39 [DEBUG] XGB iter   0: tr-p-rmse: 0.479655	tr-a-peak@32: 0.994792	tr-rmse: 0.343173	tr-rmse: 0.343173
2024-11-06 18:26:39 [DEBUG] XGB iter  25: tr-p-rmse: 0.051710	tr-a-peak@32: 0.999831	tr-rmse: 0.406589	tr-rmse: 0.406589
2024-11-06 18:26:39 [DEBUG] XGB iter  50: tr-p-rmse: 0.051711	tr-a-peak@32: 0.999831	tr-rmse: 0.406587	tr-rmse: 0.406587
2024-11-06 18:26:39 [DEBUG] XGB stopped. Best iteration: [18] tr-p-rmse:0.05157	tr-a-peak@32:0.99983	tr-rmse:0.40679	tr-rmse:0.40679 
2024-11-06 18:26:39 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 32
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
2024-11-06 18:26:39 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     32 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 92
Total latency (us): 38.131


Total trials: 92
Total latency (us): 38.131

2024-11-06 18:26:39 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:26:45 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:26:48 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:26:50 [DEBUG] XGB validation: p-rmse: 0.161438	a-peak@32: 0.928364
2024-11-06 18:26:50 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 40
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
2024-11-06 18:26:50 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     40 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 100
Total latency (us): 38.131


Total trials: 100
Total latency (us): 38.131

2024-11-06 18:26:50 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:26:51 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:26:51 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:26:51 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_layout_transform"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 73.0532 13.7368 13.7368 40
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
2024-11-06 18:26:51 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        73.0532 |      13.7368 |               13.7368 |     40 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 100
Total latency (us): 38.131


Total trials: 100
Total latency (us): 38.131

2024-11-06 18:26:51 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:27:04 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:27:06 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:27:08 [DEBUG] XGB validation: p-rmse: 0.263920	a-peak@32: 0.993032
2024-11-06 18:27:08 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 77.6145 12.9295 12.9295 48
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 24.4622 10.2562 10.2562 32
Total trials: 108
Total latency (us): 37.3237

2024-11-06 18:27:08 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        77.6145 |      12.9295 |               12.9295 |     48 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        24.4622 |      10.2562 |               10.2562 |     32 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 108
Total latency (us): 37.3237

2024-11-06 18:27:08 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:27:12 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:27:24 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:27:26 [DEBUG] XGB validation: p-rmse: 0.176584	a-peak@32: 1.000000
2024-11-06 18:27:26 [DEBUG] XGB iter   0: tr-p-rmse: 0.492402	tr-a-peak@32: 0.998163	tr-rmse: 0.367762	tr-rmse: 0.367762
2024-11-06 18:27:26 [DEBUG] XGB iter  25: tr-p-rmse: 0.049030	tr-a-peak@32: 0.999831	tr-rmse: 0.430966	tr-rmse: 0.430966
2024-11-06 18:27:26 [DEBUG] XGB iter  50: tr-p-rmse: 0.049030	tr-a-peak@32: 0.999831	tr-rmse: 0.430965	tr-rmse: 0.430965
2024-11-06 18:27:26 [DEBUG] XGB stopped. Best iteration: [18] tr-p-rmse:0.04894	tr-a-peak@32:0.99983	tr-rmse:0.43111	tr-rmse:0.43111 
2024-11-06 18:27:26 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 77.6145 12.9295 12.9295 48
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 40
2024-11-06 18:27:26 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        77.6145 |      12.9295 |               12.9295 |     48 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     40 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 116
Total latency (us): 36.1314


Total trials: 116
Total latency (us): 36.1314

2024-11-06 18:27:26 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:27:28 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:27:28 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:27:28 [INFO] [task_scheduler.cc:237] [Updated] Task #3: "fused_layout_transform_reshape"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 77.6145 12.9295 12.9295 48
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 40
2024-11-06 18:27:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        77.6145 |      12.9295 |               12.9295 |     48 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     40 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 116
Total latency (us): 36.1314


Total trials: 116
Total latency (us): 36.1314

2024-11-06 18:27:28 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:27:41 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:27:43 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:27:45 [DEBUG] XGB validation: p-rmse: 0.295029	a-peak@32: 0.865659
2024-11-06 18:27:45 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 78.9262 12.7147 12.7147 56
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 40
2024-11-06 18:27:45 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        78.9262 |      12.7147 |               12.7147 |     56 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     40 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 124
Total latency (us): 35.9165


Total trials: 124
Total latency (us): 35.9165

2024-11-06 18:27:45 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:27:49 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:27:50 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:27:52 [DEBUG] XGB validation: p-rmse: 0.047716	a-peak@32: 1.000000
2024-11-06 18:27:52 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 78.9262 12.7147 12.7147 56
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 48
Total trials: 132
Total latency (us): 35.9165

2024-11-06 18:27:52 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        78.9262 |      12.7147 |               12.7147 |     56 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     48 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 132
Total latency (us): 35.9165

2024-11-06 18:27:52 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:28:05 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:28:07 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:28:09 [DEBUG] XGB validation: p-rmse: 0.379888	a-peak@32: 0.995790
2024-11-06 18:28:09 [DEBUG] XGB iter   0: tr-p-rmse: 0.458952	tr-a-peak@32: 0.995886	tr-rmse: 0.405198	tr-rmse: 0.405198
2024-11-06 18:28:09 [DEBUG] XGB iter  25: tr-p-rmse: 0.089273	tr-a-peak@32: 0.999562	tr-rmse: 0.477386	tr-rmse: 0.477386
2024-11-06 18:28:09 [DEBUG] XGB iter  50: tr-p-rmse: 0.089273	tr-a-peak@32: 0.999562	tr-rmse: 0.477385	tr-rmse: 0.477385
2024-11-06 18:28:10 [DEBUG] XGB stopped. Best iteration: [19] tr-p-rmse:0.08919	tr-a-peak@32:0.99956	tr-rmse:0.47753	tr-rmse:0.47753 
2024-11-06 18:28:10 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 78.9262 12.7147 12.7147 64
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 48
2024-11-06 18:28:10 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        78.9262 |      12.7147 |               12.7147 |     64 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     48 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 140
Total latency (us): 35.9165


Total trials: 140
Total latency (us): 35.9165

2024-11-06 18:28:10 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:28:11 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:28:11 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:28:11 [INFO] [task_scheduler.cc:237] [Updated] Task #3: "fused_layout_transform_reshape"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 78.9262 12.7147 12.7147 64
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 27.6800 9.0639 9.0639 48
2024-11-06 18:28:11 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        78.9262 |      12.7147 |               12.7147 |     64 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        27.6800 |       9.0639 |                9.0639 |     48 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 140
Total latency (us): 35.9165


Total trials: 140
Total latency (us): 35.9165

2024-11-06 18:28:11 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:28:15 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:28:27 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:28:29 [DEBUG] XGB validation: p-rmse: 0.151383	a-peak@32: 0.975659
2024-11-06 18:28:29 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 78.9262 12.7147 12.7147 64
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 56
Total trials: 148
Total latency (us): 35.4354

2024-11-06 18:28:29 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        78.9262 |      12.7147 |               12.7147 |     64 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     56 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 148
Total latency (us): 35.4354

2024-11-06 18:28:29 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:28:42 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:28:44 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:28:46 [DEBUG] XGB validation: p-rmse: 0.282471	a-peak@32: 0.996347
2024-11-06 18:28:46 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 72
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 56
2024-11-06 18:28:46 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     72 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     56 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 156
Total latency (us): 34.9648


Total trials: 156
Total latency (us): 34.9648

2024-11-06 18:28:46 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:28:48 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:28:48 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:28:48 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_layout_transform"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 72
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 56
2024-11-06 18:28:48 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     72 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |      
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     56 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 156
Total latency (us): 34.9648


Total trials: 156
Total latency (us): 34.9648

2024-11-06 18:28:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #3: "fused_layout_transform_reshape"
2024-11-06 18:28:50 [INFO] [task_scheduler.cc:260] Task #3 has finished. Remaining task(s): 4
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 72
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 56
2024-11-06 18:28:50 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     72 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     56 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 156
Total latency (us): 34.9648


Total trials: 156
Total latency (us): 34.9648

2024-11-06 18:28:50 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:29:03 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:29:05 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:29:07 [DEBUG] XGB validation: p-rmse: 0.503644	a-peak@32: 0.826634
2024-11-06 18:29:07 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 80
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 56
2024-11-06 18:29:07 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     80 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     56 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 164
Total latency (us): 34.9648


Total trials: 164
Total latency (us): 34.9648

2024-11-06 18:29:07 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:29:10 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:29:14 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:29:16 [DEBUG] XGB validation: p-rmse: 0.085098	a-peak@32: 0.992122
2024-11-06 18:29:16 [DEBUG] XGB iter   0: tr-p-rmse: 0.450609	tr-a-peak@32: 0.988297	tr-rmse: 0.418118	tr-rmse: 0.418118
2024-11-06 18:29:16 [DEBUG] XGB iter  25: tr-p-rmse: 0.114067	tr-a-peak@32: 1.000000	tr-rmse: 0.493538	tr-rmse: 0.493538
2024-11-06 18:29:16 [DEBUG] XGB iter  50: tr-p-rmse: 0.114067	tr-a-peak@32: 1.000000	tr-rmse: 0.493538	tr-rmse: 0.493538
2024-11-06 18:29:16 [DEBUG] XGB stopped. Best iteration: [16] tr-p-rmse:0.11400	tr-a-peak@32:1.00000	tr-rmse:0.49367	tr-rmse:0.49367 
2024-11-06 18:29:16 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 80
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 64
2024-11-06 18:29:16 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     80 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     64 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 172
Total latency (us): 34.9648


Total trials: 172
Total latency (us): 34.9648

2024-11-06 18:29:16 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:29:29 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:29:31 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:29:33 [DEBUG] XGB validation: p-rmse: 0.347296	a-peak@32: 1.000000
2024-11-06 18:29:33 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 88
2 fused_nn_max_pool2d 50176 1 13.4876 3.7202 3.7202 24
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 64
2024-11-06 18:29:33 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     88 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        13.4876 |       3.7202 |                3.7202 |     24 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     64 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 180
Total latency (us): 34.9648


Total trials: 180
Total latency (us): 34.9648

2024-11-06 18:29:33 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:29:36 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:29:38 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:29:40 [DEBUG] XGB validation: p-rmse: 0.077047	a-peak@32: 0.987611
2024-11-06 18:29:40 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 88
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 32
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 64
2024-11-06 18:29:40 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     88 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     32 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     64 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 188
Total latency (us): 34.7811


Total trials: 188
Total latency (us): 34.7811

2024-11-06 18:29:40 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:29:43 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:29:45 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:29:48 [DEBUG] XGB validation: p-rmse: 0.044324	a-peak@32: 1.000000
2024-11-06 18:29:48 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 88
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 40
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 64
2024-11-06 18:29:48 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     88 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     40 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     64 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 196
Total latency (us): 34.7811


Total trials: 196
Total latency (us): 34.7811

2024-11-06 18:29:48 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:30:01 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:30:03 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:30:05 [DEBUG] XGB validation: p-rmse: 0.372697	a-peak@32: 0.836836
2024-11-06 18:30:05 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 96
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 40
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2317 8.5828 8.5828 64
Total trials: 204
Total latency (us): 34.7811

2024-11-06 18:30:05 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     96 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     40 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2317 |       8.5828 |                8.5828 |     64 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 204
Total latency (us): 34.7811

2024-11-06 18:30:05 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:30:08 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:30:10 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:30:12 [DEBUG] XGB validation: p-rmse: 0.039160	a-peak@32: 0.999396
2024-11-06 18:30:12 [DEBUG] XGB iter   0: tr-p-rmse: 0.449608	tr-a-peak@32: 0.964817	tr-rmse: 0.432020	tr-rmse: 0.432020
2024-11-06 18:30:12 [DEBUG] XGB iter  25: tr-p-rmse: 0.132792	tr-a-peak@32: 0.999831	tr-rmse: 0.503330	tr-rmse: 0.503330
2024-11-06 18:30:12 [DEBUG] XGB iter  50: tr-p-rmse: 0.132792	tr-a-peak@32: 0.999831	tr-rmse: 0.503330	tr-rmse: 0.503330
2024-11-06 18:30:12 [DEBUG] XGB stopped. Best iteration: [18] tr-p-rmse:0.13278	tr-a-peak@32:0.99983	tr-rmse:0.50335	tr-rmse:0.50335 
2024-11-06 18:30:12 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 96
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 40
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 72
2024-11-06 18:30:12 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     96 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     40 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     72 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 212
Total latency (us): 34.7675


Total trials: 212
Total latency (us): 34.7675

2024-11-06 18:30:12 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:30:13 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:30:13 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:30:13 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_layout_transform"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 96
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 40
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 72
2024-11-06 18:30:13 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |     96 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     40 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     72 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 212
Total latency (us): 34.7675


Total trials: 212
Total latency (us): 34.7675

2024-11-06 18:30:13 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:30:26 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:30:28 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:30:30 [DEBUG] XGB validation: p-rmse: 0.341009	a-peak@32: 0.836577
2024-11-06 18:30:30 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 104
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 40
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 72
2024-11-06 18:30:30 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    104 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     40 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     72 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 220
Total latency (us): 34.7675


Total trials: 220
Total latency (us): 34.7675

2024-11-06 18:30:30 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:30:33 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:30:35 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:30:37 [DEBUG] XGB validation: p-rmse: 0.030518	a-peak@32: 0.988129
2024-11-06 18:30:37 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 104
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 72
2024-11-06 18:30:37 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    104 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     72 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 228
Total latency (us): 34.7675


Total trials: 228
Total latency (us): 34.7675

2024-11-06 18:30:37 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:30:41 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:30:43 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:30:45 [DEBUG] XGB validation: p-rmse: 0.090661	a-peak@32: 0.950780
2024-11-06 18:30:45 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 104
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 80
2024-11-06 18:30:45 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    104 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     80 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 236
Total latency (us): 34.7675


Total trials: 236
Total latency (us): 34.7675

2024-11-06 18:30:45 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:30:58 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:31:00 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:31:02 [DEBUG] XGB validation: p-rmse: 0.418442	a-peak@32: 0.817949
2024-11-06 18:31:02 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 112
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 80
Total trials: 244
Total latency (us): 34.7675

2024-11-06 18:31:02 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    112 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     80 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 244
Total latency (us): 34.7675

2024-11-06 18:31:02 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:31:16 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:31:18 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:31:20 [DEBUG] XGB validation: p-rmse: 0.424621	a-peak@32: 0.780059
2024-11-06 18:31:20 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 120
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 80
2024-11-06 18:31:20 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    120 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     80 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 252
Total latency (us): 34.7675


Total trials: 252
Total latency (us): 34.7675

2024-11-06 18:31:20 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:31:23 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:31:25 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:31:27 [DEBUG] XGB validation: p-rmse: 0.124876	a-peak@32: 0.993320
2024-11-06 18:31:27 [DEBUG] XGB iter   0: tr-p-rmse: 0.447523	tr-a-peak@32: 0.991634	tr-rmse: 0.436950	tr-rmse: 0.436950
2024-11-06 18:31:27 [DEBUG] XGB iter  25: tr-p-rmse: 0.154782	tr-a-peak@32: 1.000000	tr-rmse: 0.511629	tr-rmse: 0.511629
2024-11-06 18:31:27 [DEBUG] XGB iter  50: tr-p-rmse: 0.154782	tr-a-peak@32: 1.000000	tr-rmse: 0.511629	tr-rmse: 0.511629
2024-11-06 18:31:27 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.15475	tr-a-peak@32:1.00000	tr-rmse:0.51168	tr-rmse:0.51168 
2024-11-06 18:31:27 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 120
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 88
2024-11-06 18:31:27 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    120 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     88 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 260
Total latency (us): 34.7675


Total trials: 260
Total latency (us): 34.7675

2024-11-06 18:31:27 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:31:40 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:31:42 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:31:44 [DEBUG] XGB validation: p-rmse: 0.281354	a-peak@32: 0.977430
2024-11-06 18:31:44 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 128
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 88
2024-11-06 18:31:44 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    128 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     88 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 268
Total latency (us): 34.7675


Total trials: 268
Total latency (us): 34.7675

2024-11-06 18:31:44 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:31:46 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:31:46 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:31:46 [INFO] [task_scheduler.cc:237] [Updated] Task #0: "fused_layout_transform"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 128
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 88
2024-11-06 18:31:46 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    128 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     88 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 268
Total latency (us): 34.7675


Total trials: 268
Total latency (us): 34.7675

2024-11-06 18:31:46 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:31:49 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:31:51 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:31:53 [DEBUG] XGB validation: p-rmse: 0.511726	a-peak@32: 1.000000
2024-11-06 18:31:53 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 128
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 96
2024-11-06 18:31:53 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    128 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     96 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 276
Total latency (us): 34.7675


Total trials: 276
Total latency (us): 34.7675

2024-11-06 18:31:53 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:32:06 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:32:08 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:32:11 [DEBUG] XGB validation: p-rmse: 0.277912	a-peak@32: 0.959712
2024-11-06 18:32:11 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 136
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 96
2024-11-06 18:32:11 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    136 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     96 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 284
Total latency (us): 34.7675


Total trials: 284
Total latency (us): 34.7675

2024-11-06 18:32:11 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:32:24 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:32:26 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:32:28 [DEBUG] XGB validation: p-rmse: 0.148024	a-peak@32: 0.900631
2024-11-06 18:32:28 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 144
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 96
Total trials: 292
Total latency (us): 34.7675

2024-11-06 18:32:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    144 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |     96 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 292
Total latency (us): 34.7675

2024-11-06 18:32:28 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:32:32 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:32:33 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:32:35 [DEBUG] XGB validation: p-rmse: 0.534080	a-peak@32: 0.547863
2024-11-06 18:32:35 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 144
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 104
2024-11-06 18:32:35 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    144 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    104 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 300
Total latency (us): 34.7675


Total trials: 300
Total latency (us): 34.7675

2024-11-06 18:32:35 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:32:48 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:32:50 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:32:53 [DEBUG] XGB validation: p-rmse: 0.287411	a-peak@32: 0.953610
2024-11-06 18:32:53 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 152
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 104
2024-11-06 18:32:53 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    152 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    104 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 308
Total latency (us): 34.7675


Total trials: 308
Total latency (us): 34.7675

2024-11-06 18:32:53 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:32:56 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:32:58 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:33:00 [DEBUG] XGB validation: p-rmse: 0.375471	a-peak@32: 0.706074
2024-11-06 18:33:00 [DEBUG] XGB iter   0: tr-p-rmse: 0.462738	tr-a-peak@32: 0.912965	tr-rmse: 0.438690	tr-rmse: 0.438690
2024-11-06 18:33:00 [DEBUG] XGB iter  25: tr-p-rmse: 0.139792	tr-a-peak@32: 1.000000	tr-rmse: 0.512912	tr-rmse: 0.512912
2024-11-06 18:33:00 [DEBUG] XGB iter  50: tr-p-rmse: 0.139792	tr-a-peak@32: 1.000000	tr-rmse: 0.512912	tr-rmse: 0.512912
2024-11-06 18:33:00 [DEBUG] XGB stopped. Best iteration: [17] tr-p-rmse:0.13978	tr-a-peak@32:1.00000	tr-rmse:0.51293	tr-rmse:0.51293 
2024-11-06 18:33:00 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 81.9600 12.2440 12.2440 152
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
2024-11-06 18:33:00 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |        81.9600 |      12.2440 |               12.2440 |    152 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 316
Total latency (us): 34.7675


Total trials: 316
Total latency (us): 34.7675

2024-11-06 18:33:00 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:33:13 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:33:15 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:33:18 [DEBUG] XGB validation: p-rmse: 0.205324	a-peak@32: 0.866268
2024-11-06 18:33:18 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 160
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
Total trials: 324
Total latency (us): 32.3861

2024-11-06 18:33:18 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    160 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 324
Total latency (us): 32.3861

2024-11-06 18:33:18 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:33:31 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:33:33 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:33:35 [DEBUG] XGB validation: p-rmse: 0.424072	a-peak@32: 0.884443
2024-11-06 18:33:35 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 168
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
2024-11-06 18:33:35 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    168 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 332
Total latency (us): 32.3861


Total trials: 332
Total latency (us): 32.3861

2024-11-06 18:33:35 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:33:48 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:33:51 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:33:53 [DEBUG] XGB validation: p-rmse: 0.470635	a-peak@32: 0.918360
2024-11-06 18:33:53 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 176
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
2024-11-06 18:33:53 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    176 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 340
Total latency (us): 32.3861


Total trials: 340
Total latency (us): 32.3861

2024-11-06 18:33:53 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:34:06 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:08 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:10 [DEBUG] XGB validation: p-rmse: 0.555645	a-peak@32: 0.955161
2024-11-06 18:34:10 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
Total trials: 348
Total latency (us): 32.3861

2024-11-06 18:34:10 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |      
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 348
Total latency (us): 32.3861

2024-11-06 18:34:10 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #0: "fused_layout_transform"
2024-11-06 18:34:12 [INFO] [task_scheduler.cc:260] Task #0 has finished. Remaining task(s): 3
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 112
2024-11-06 18:34:12 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    112 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 348
Total latency (us): 32.3861


Total trials: 348
Total latency (us): 32.3861

2024-11-06 18:34:12 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:34:15 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:17 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:19 [DEBUG] XGB validation: p-rmse: 0.104416	a-peak@32: 0.993149
2024-11-06 18:34:19 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 48
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 120
Total trials: 356
Total latency (us): 32.3861

2024-11-06 18:34:19 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     48 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    120 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 356
Total latency (us): 32.3861

2024-11-06 18:34:19 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:34:22 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:25 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:27 [DEBUG] XGB validation: p-rmse: 0.024494	a-peak@32: 1.000000
2024-11-06 18:34:27 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 120
Total trials: 364
Total latency (us): 32.3861

2024-11-06 18:34:27 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    120 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 364
Total latency (us): 32.3861

2024-11-06 18:34:27 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:34:31 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:32 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:34 [DEBUG] XGB validation: p-rmse: 0.206462	a-peak@32: 0.999547
2024-11-06 18:34:34 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 128
Total trials: 372
Total latency (us): 32.3861

2024-11-06 18:34:34 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    128 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 372
Total latency (us): 32.3861

2024-11-06 18:34:34 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:34:38 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:40 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:42 [DEBUG] XGB validation: p-rmse: 0.206657	a-peak@32: 0.999329
2024-11-06 18:34:42 [DEBUG] XGB iter   0: tr-p-rmse: 0.465364	tr-a-peak@32: 0.972804	tr-rmse: 0.378529	tr-rmse: 0.378529
2024-11-06 18:34:42 [DEBUG] XGB iter  25: tr-p-rmse: 0.105192	tr-a-peak@32: 1.000000	tr-rmse: 0.459046	tr-rmse: 0.459046
2024-11-06 18:34:42 [DEBUG] XGB iter  50: tr-p-rmse: 0.105192	tr-a-peak@32: 1.000000	tr-rmse: 0.459046	tr-rmse: 0.459046
2024-11-06 18:34:42 [DEBUG] XGB stopped. Best iteration: [22] tr-p-rmse:0.10519	tr-a-peak@32:1.00000	tr-rmse:0.45904	tr-rmse:0.45904 
2024-11-06 18:34:42 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 136
Total trials: 380
Total latency (us): 32.3861

2024-11-06 18:34:42 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    136 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 380
Total latency (us): 32.3861

2024-11-06 18:34:42 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:34:45 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:34:45 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:34:45 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 136
Total trials: 380
Total latency (us): 32.3861

2024-11-06 18:34:45 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    136 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 380
Total latency (us): 32.3861

2024-11-06 18:34:45 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:34:48 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:50 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:52 [DEBUG] XGB validation: p-rmse: 0.097164	a-peak@32: 0.925082
2024-11-06 18:34:52 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 144
2024-11-06 18:34:52 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    144 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 388
Total latency (us): 32.3861


Total trials: 388
Total latency (us): 32.3861

2024-11-06 18:34:52 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:34:56 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:34:57 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:34:59 [DEBUG] XGB validation: p-rmse: 0.112814	a-peak@32: 0.951644
2024-11-06 18:34:59 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 152
2024-11-06 18:34:59 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    152 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 396
Total latency (us): 32.3861


Total trials: 396
Total latency (us): 32.3861

2024-11-06 18:34:59 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:35:03 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:35:15 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:35:17 [DEBUG] XGB validation: p-rmse: 0.058394	a-peak@32: 0.974092
2024-11-06 18:35:17 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 160
2024-11-06 18:35:17 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    160 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 404
Total latency (us): 32.3861


Total trials: 404
Total latency (us): 32.3861

2024-11-06 18:35:17 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:35:20 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:35:20 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:35:20 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 101.7493 9.8627 9.8627 184
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 160
Total trials: 404
Total latency (us): 32.3861

2024-11-06 18:35:20 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       101.7493 |       9.8627 |                9.8627 |    184 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    160 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 404
Total latency (us): 32.3861

2024-11-06 18:35:20 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:35:33 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:35:35 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:35:38 [DEBUG] XGB validation: p-rmse: 0.210817	a-peak@32: 0.970991
2024-11-06 18:35:38 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 108.2094 9.2739 9.2739 192
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 160
Total trials: 412
Total latency (us): 31.7973

2024-11-06 18:35:38 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       108.2094 |       9.2739 |                9.2739 |    192 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    160 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 412
Total latency (us): 31.7973

2024-11-06 18:35:38 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:35:51 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:35:53 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:35:55 [DEBUG] XGB validation: p-rmse: 0.132945	a-peak@32: 0.889768
2024-11-06 18:35:55 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 200
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 160
2024-11-06 18:35:55 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    200 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    160 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 420
Total latency (us): 31.5258


Total trials: 420
Total latency (us): 31.5258

2024-11-06 18:35:55 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:36:09 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:36:10 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:36:13 [DEBUG] XGB validation: p-rmse: 0.233211	a-peak@32: 0.925215
2024-11-06 18:36:13 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 208
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 160
2024-11-06 18:36:13 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    208 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    160 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 428
Total latency (us): 31.5258


Total trials: 428
Total latency (us): 31.5258

2024-11-06 18:36:13 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:36:16 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:36:18 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:36:20 [DEBUG] XGB validation: p-rmse: 0.071691	a-peak@32: 0.960906
2024-11-06 18:36:20 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 208
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 168
2024-11-06 18:36:20 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    208 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    168 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 436
Total latency (us): 31.5258


Total trials: 436
Total latency (us): 31.5258

2024-11-06 18:36:20 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:36:33 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:36:35 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:36:37 [DEBUG] XGB validation: p-rmse: 0.065844	a-peak@32: 0.975867
2024-11-06 18:36:37 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 168
Total trials: 444
Total latency (us): 31.5258

2024-11-06 18:36:37 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    168 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 444
Total latency (us): 31.5258

2024-11-06 18:36:37 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:36:41 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:36:42 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:36:44 [DEBUG] XGB validation: p-rmse: 0.051169	a-peak@32: 0.950947
2024-11-06 18:36:44 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 176
Total trials: 452
Total latency (us): 31.5258

2024-11-06 18:36:44 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    176 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 452
Total latency (us): 31.5258

2024-11-06 18:36:44 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #2: "fused_nn_max_pool2d"
2024-11-06 18:36:47 [INFO] [task_scheduler.cc:193] Sending 0 sample(s) to builder
2024-11-06 18:36:47 [INFO] [task_scheduler.cc:195] Sending 0 sample(s) to runner
2024-11-06 18:36:47 [INFO] [task_scheduler.cc:237] [Updated] Task #2: "fused_nn_max_pool2d"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 176
Total trials: 452
Total latency (us): 31.5258

2024-11-06 18:36:47 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    176 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 452
Total latency (us): 31.5258

2024-11-06 18:36:47 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:36:51 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:36:53 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:36:55 [DEBUG] XGB validation: p-rmse: 0.089216	a-peak@32: 0.829930
2024-11-06 18:36:55 [DEBUG] XGB iter   0: tr-p-rmse: 0.449478	tr-a-peak@32: 0.969132	tr-rmse: 0.372439	tr-rmse: 0.372439
2024-11-06 18:36:55 [DEBUG] XGB iter  25: tr-p-rmse: 0.088929	tr-a-peak@32: 0.999831	tr-rmse: 0.455742	tr-rmse: 0.455742
2024-11-06 18:36:55 [DEBUG] XGB iter  50: tr-p-rmse: 0.088929	tr-a-peak@32: 0.999831	tr-rmse: 0.455743	tr-rmse: 0.455743
2024-11-06 18:36:55 [DEBUG] XGB iter  75: tr-p-rmse: 0.088929	tr-a-peak@32: 0.999831	tr-rmse: 0.455743	tr-rmse: 0.455743
2024-11-06 18:36:55 [DEBUG] XGB stopped. Best iteration: [25] tr-p-rmse:0.08893	tr-a-peak@32:0.99983	tr-rmse:0.45574	tr-rmse:0.45574 
2024-11-06 18:36:55 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 184
Total trials: 460
Total latency (us): 31.5258

2024-11-06 18:36:55 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    184 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 460
Total latency (us): 31.5258

2024-11-06 18:36:55 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:36:58 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:37:00 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:37:02 [DEBUG] XGB validation: p-rmse: 0.137296	a-peak@32: 0.996112
2024-11-06 18:37:02 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 192
2024-11-06 18:37:02 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    192 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 468
Total latency (us): 31.5258


Total trials: 468
Total latency (us): 31.5258

2024-11-06 18:37:02 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #4: "fused_nn_dense_add"
2024-11-06 18:37:06 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:37:08 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:37:10 [DEBUG] XGB validation: p-rmse: 0.131698	a-peak@32: 0.985965
2024-11-06 18:37:10 [INFO] [task_scheduler.cc:237] [Updated] Task #4: "fused_nn_dense_add"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 111.4731 9.0023 9.0023 216
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 200
Total trials: 476
Total latency (us): 31.5258

2024-11-06 18:37:10 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       111.4731 |       9.0023 |                9.0023 |    216 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    200 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 476
Total latency (us): 31.5258

2024-11-06 18:37:10 [INFO] [task_scheduler.cc:180] TaskScheduler picks Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
2024-11-06 18:37:23 [INFO] [task_scheduler.cc:193] Sending 8 sample(s) to builder
2024-11-06 18:37:25 [INFO] [task_scheduler.cc:195] Sending 8 sample(s) to runner
2024-11-06 18:37:28 [DEBUG] XGB validation: p-rmse: 0.071621	a-peak@32: 0.996863
2024-11-06 18:37:28 [INFO] [task_scheduler.cc:237] [Updated] Task #1: "fused_nn_contrib_conv2d_NCHWc_add_nn_relu"
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 112.0600 8.9552 8.9552 224
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 200
Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       112.0600 |       8.9552 |                8.9552 |    224 |      
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    200 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [INFO] [task_scheduler.cc:260] Task #1 has finished. Remaining task(s): 2
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 112.0600 8.9552 8.9552 224 Y
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 200
2024-11-06 18:37:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       112.0600 |       8.9552 |                8.9552 |    224 |    Y 
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |      
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    200 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 484
Total latency (us): 31.4787


Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [INFO] [task_scheduler.cc:260] Task #2 has finished. Remaining task(s): 1
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 112.0600 8.9552 8.9552 224 Y
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56 Y
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 200
Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       112.0600 |       8.9552 |                8.9552 |    224 |    Y 
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |    Y 
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    200 |      
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [INFO] [task_scheduler.cc:260] Task #4 has finished. Remaining task(s): 0
Name FLOP Weight Speed (GFLOPS) Latency (us) Weighted Latency (us) Trials Done
0 fused_layout_transform 1 1 0.0003 3.1952 3.1952 2 Y
1 fused_nn_contrib_conv2d_NCHWc_add_nn_relu 1003520 1 112.0600 8.9552 8.9552 224 Y
2 fused_nn_max_pool2d 50176 1 14.1879 3.5365 3.5365 56 Y
3 fused_layout_transform_reshape 1 1 0.0001 7.2225 7.2225 2 Y
4 fused_nn_dense_add 250890 1 29.2783 8.5691 8.5691 200 Y
Total trials: 484
Total latency (us): 31.4787

2024-11-06 18:37:28 [DEBUG] [task_scheduler.cc:318] 
 ID |                                      Name |    FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
-------------------------------------------------------------------------------------------------------------------------------------------
  0 |                    fused_layout_transform |       1 |      1 |         0.0003 |       3.1952 |                3.1952 |      2 |    Y 
  1 | fused_nn_contrib_conv2d_NCHWc_add_nn_relu | 1003520 |      1 |       112.0600 |       8.9552 |                8.9552 |    224 |    Y 
  2 |                       fused_nn_max_pool2d |   50176 |      1 |        14.1879 |       3.5365 |                3.5365 |     56 |    Y 
  3 |            fused_layout_transform_reshape |       1 |      1 |         0.0001 |       7.2225 |                7.2225 |      2 |    Y 
  4 |                        fused_nn_dense_add |  250890 |      1 |        29.2783 |       8.5691 |                8.5691 |    200 |    Y 
-------------------------------------------------------------------------------------------------------------------------------------------
Total trials: 484
Total latency (us): 31.4787

После оптимизации можно скомпилировать нейронную с учетом построенных оптимизаций с помощью интерфейса MetaScheduler ms.relay_integration.compile_relay.

In [54]:
if is_x86():
    
    database = ms.database.JSONDatabase(
        f"{work_dir}/database_workload.json",
        f"{work_dir}/database_tuning_record.json",
        allow_missing=False
    )

    lib = ms.relay_integration.compile_relay(
        database, mod, target, params,
        opt_level=opt_level,
    )

В завершении измерим время вывода с использованием функции timeit_inference, определим качество работы модели с помощью функции get_accuracy и выполним проверку корректности работы оптимизированной модели, сравнив полученное значение показателя точности с референсным.

In [55]:
if is_x86():
    ms_cnn_predict, ms_cnn_times = timeit_inference(mod, lib, images)

    ms_cnn_accuracy = get_accuracy(labels, ms_cnn_predict)
    assert np.allclose(metric['cnn'], ms_cnn_accuracy, rtol=1e-5)

    ms_cnn_time = np.median(ms_cnn_times)
    print(f'Медианное время работы после оптимизации слоев с помощью MetaScheduler: {ms_cnn_time:.4f} мc')
Медианное время работы после оптимизации слоев с помощью MetaScheduler: 0.0415 мc

8.5. Анализ результатов¶

Для анализа результатов оптимизации нейронной сети с использованием различных методов построим прафик медианного времени выполнения.

In [56]:
fig, ax = plt.subplots()

name = ['Без оптимизации\nслоев', 'AutoTVM', 'Auto-scheduler', 'MetaScheduler']
times = [default_cnn_time, autotvm_cnn_time, autoscheduler_cnn_time, ms_cnn_time]

bars = ax.bar(name, times, label=name, color=bar_colors)
ax.set_title('Среднее время\nвыполнения (мс)', fontsize=18)

for bar, n, t in zip(bars, name, times):
    h = bar.get_height()
    if n == 'Без оптимизации\nслоев': h = h / 2
    if h != 0:
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            h,
            f'{round(t, 4)} с',
            ha='center',
            va='bottom',
            fontsize=15,
        )

ax.xaxis.label.set_size(40)
ax.set_title('Среднее время\nвыполнения (с)', fontsize=18)
plt.grid()
No description has been provided for this image

Вывод: оптимизация значительно ускоряет время работы сети.