Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Info

This document is based on registry.pfcomputing.internal/mncore-sdk/mncore-sdk-minimal:0.3.


How to Use MN-Core with MLSDK

In this document, we explain how to use MN-Core using MLSDK (Machine Learning Software Development Kit).

What is MLSDK?

MLSDK is a software development kit that includes a compiler, runtime software stack, and documentation to enable using MN-Core from PyTorch. Although it is called "Machine Learning," it can also be used for applications other than the machine learning field, as long as they are written with PyTorch.

Prerequisites

The followings are assumed in this document.

  • You are a user of
    • PFCP (Preferred Computing Platform)
  • You have a basic knowledge of the followings:
    • Pytorch
    • Kubernetes

Notes on the Backward Compatibility

MLSDK is currently unstable, and there may be future updates that will not be backward compatible; Breaking changes are likely to be introduced.

If you want to use the same version of MLSDK, please modify the container image accordingly.

MLSDK Images

MLSDK is distributed as container images at Amazon ECR repository, registry.pfcomputing.internal/mncore-sdk. We provide two types of images as of Jan 2025: registry.pfcomputing.internal/mncore-sdk/mncore-sdk-minimal and registry.pfcomputing.internal/mncore-sdk/mncore-sdk-full. The former is a minimal image that contains only libraries, executables, and scripts that are mandatory to run MN-Core 1/2. The latter is a fatter image that is built on the top of the minimal image, with the standard development tools (e.g., gcc, binutils, make, cmake, ...) and the development environments (e.g., JupyterLab).

We provide two types of tag series for the minimal and full images. One tag type is "release", which is a stable version of MLSDK, published roughly once every three months. The other tag type is "snapshot", which delivers the latest development branch (that may not be mature) to users for trial purposes. The release tag follows the vYY.MM format (e.g., v24.12 for a Dec 2024 release), and the snapshot tag follows the snapshot_YYYYMMDD_githash (e.g., snapshot_20241201_0123456789abcdef). In addition to the aforementioned tags that immutably correspond to a unique image, we also provide latest and snapshot_latest tags to allow users to obtain the latest image for each series.

Launching an MLSDK Container

If you are new to PFCP and want to launch a kubernetes pod (a collection of running containers) with access to MN-Core 1/2 devices, the tutorial in the PFCP documentation is a good starting point.

Once a pod is properly configured, you can use the command gpfn3-smi list (see the following example) to display the connected MN-Core 2 devices within the pod. The number 0 represents the device index of the MN-Core 2 device within the pod, and the string starting with mnc2 is the device ID within the pod. If nothing is displayed, it indicates that there may be an issue with the pod configuration. Please double-check the configuration and try again.

$ gpfn3-smi list
0: mnc2p28s0

How to use MLSDK API

From here, we explain how to run existing applications on MN-Core devices using the Python API (MLSDK API) included in MLSDK with several examples. You can also view the API reference.

First, we start with an example of running a pure function on MN-Core. Then, we demonstrate how to run a inference with a pretrained model. Finally, we introduce an example of model training.

Notes on Running Examples

You can locate all examples shown in this document in the MLSDK directory, where this README.md exists. Also, you can run any command that starts with $ as is.

Notes on Environment Variable Settings

You need to set several environment variables to use the MLSDK API. To set these variables, we recommend that you use the shell scripts /opt/pfn/pfcomp/codegen/build/codegen_preloads.sh and /opt/pfn/pfcomp/codegen/build/codegen_pythonpath.sh. You are advised to write a shell script that loads these two files and then executes any given python script like the following:

Notes on Python venv Use

If you create a Python virtual environment with venv, do not forget to use the --system-site-packages option, which allows the virtual environment to access packages on the system site.

#! /bin/bash

set -eux -o pipefail

source /opt/pfn/pfcomp/codegen/build/codegen_preloads.sh
source /opt/pfn/pfcomp/codegen/build/codegen_pythonpath.sh

exec "$@"

In this tutorial, we refer the shell script as exec_with_env.sh.

The MLSDK directory can be found at /opt/pfn/pfcomp/codegen/MLSDK/ in the image. Move to the directory before following the examples below.

Example: Running a Pure Function on MN-Core

First, let's go through an example of running a pure function, meaning the function has no state, on MN-Core.

Save the following code as add.py:

import torch
from mlsdk import CacheOptions, Context, MNDevice, storage


def run_add():
    device = MNDevice("mncore2:auto")
    context = Context(device)
    Context.switch_context(context)

    def add(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = input["y"]
        return {"out": x + y}

    sample = {"x": torch.randn(3, 4), "y": torch.randn(3, 4)}

    compiled_add = context.compile(
        add,
        sample,
        storage.path("/tmp/add_two_tensors"),
        options={"float_dtype": "float"},
        cache_options=CacheOptions("/tmp/add_two_tensors_cache"),
    )
    result = compiled_add({"x": torch.ones(3, 4), "y": torch.ones(3, 4)})
    result_on_cpu = result["out"].cpu()
    print(f"{result_on_cpu=}")
    assert torch.allclose(result_on_cpu, torch.ones(3, 4) * 2)


if __name__ == "__main__":
    run_add()

And type the following command to run it! You should see the result of the computation done on MN-Core 2.

$ ./examples/exec_with_env.sh python3 examples/add.py
...
result_on_cpu=tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]])

The MLSDK API enables the use of MN-Core through a concept called "context." A context is an object used to control and manage the MN-Core device. You can create a context like the following:

device = mlsdk.MNDevice("mncore2:auto")
context = Context(device)

In the code above, "mncore2:auto" specifies that the context will use a free MN-Core 2 device from all available devices. You can also specify like "mncore2:0" to instruct which device to use explicitly. The all available device indices can be listed by gpfn3-smi list.

To run a program on MN-Core, you need to compile a Python function into a function that runs on the MN-Core device using the compile method of the context. It takes three mandatory args, some optional kwargs, and returns a compiled function.

  1. add: user's function to be compiled. It is assumed to have the type of Callable[[dict[str, torch.Tensor]], dict[str, torch.Tensor]].
  2. sample: input example with the type of dict[str, torch.Tensor] to annotate dtype and shape for the compiler.
  3. storage.path("/tmp/add_two_tensors"): directory path to store all compiling artifacts, including text logs and binary outputs.

The third argument is also referred to as "codegen directory" (i.e. "codegen_dir"). Files generated in "codegen directory" are so useful for debugging or profiling that we provide "codegen-dashboard" tool to visualize them. The detail of "codegen-dashboard" is described in the latter section.

The optional kwargs affect behaviors of the compiler.

  • options: generally called "compile options". It is assumed to have the type of dict[str, str].
    • float_dtype: specifying which float dtype (double, float, mixed, half) to use as default. The mixed stands for Automatic Mixed Precision (AMP), and that's the default setting of MLSDK.
  • cache_options: similar to options, but especially specifying the way to reuse previous compilations.

The returned value is a callable object (CompiledFuncion) which has a similar signature to the input function (add). The difference is that types originally being torch.Tensor are processed as TensorLike (alias for Union[torch.Tensor, TensorProxy]). TensorProxy represents torch.Tensor, but whether the actual data resides on the host or device is managed by the context. The corresponding tensor gets accessable from the host by calling .cpu() on TensorProxy, though note that synchronization can occur.

Note that the returned function always expects inputs with the same shape. In other words, the arguments passed to the function must always have the same keys, and the value of torch.Tensor that corresponds to each key must always have the same shape. Tensors with equal shapes but different values are of course acceptable.

Cache and reuse previous compilation results

When cache_options kwargs is specified, the exported ONNX graph of the input function and the compilation results are stored in the specified path (/tmp/add_two_tensors_cache). Therefore you can port and reuse compiled functions with the caching feature. For example, you can compile a function with the cache_options argument and copy/move the cache directory to a different place, and then reuse the pre-compiled function by passing the cache_options argument so that it refers to the ported cache directory.

In addition to caching, MLSDK provides a more direct way to reuse a compiled function: the load_codegen_dir method. This method allows you to load an executable function directly from a codegen directory that was created in a previous compile call, completely bypassing the compilation step. For instance, after compiling the add function, you could load and run it in a separate process like this:

# In a different script or after the initial compilation
from mlsdk import Context, MNDevice, storage
import torch

device = MNDevice("mncore2:auto")
context = Context(device)
Context.switch_context(context)

# Load the previously compiled function
loaded_add = context.load_codegen_dir(storage.path("/tmp/add_two_tensors"))

# Now you can use it directly
result = loaded_add({"x": torch.ones(3, 4), "y": torch.ones(3, 4)})
result_on_cpu = result["out"].cpu()
print(f"Result from loaded function: {result_on_cpu=}")
assert torch.allclose(result_on_cpu, torch.ones(3, 4) * 2)

You can run the above code after the initial compilation to see that it produces the same result without needing to recompile.

$ ./examples/exec_with_env.sh python3 examples/load_codegen_dir.py
...
result_on_cpu=tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]])

Note that this method loads the artifacts without validation. It is the user's responsibility to ensure that the contents of the codegen directory are valid and compatible with the execution environment.

Example: Running Inference with a Trained Model

Next, let's go through an example of running inference with a trained model. Save the following code as infer.py.

import torch
from mlsdk import Context, MNDevice, set_tensor_name_in_module, storage


def run_infer():
    device = MNDevice("mncore2:auto")
    context = Context(device)
    Context.switch_context(context)

    model = torch.nn.Linear(4, 4)
    model.eval()
    set_tensor_name_in_module(model, "model")
    for p in model.parameters():
        context.register_param(p)
    for b in model.buffers():
        context.register_buffer(b)

    def infer(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = model(x)
        return {"out": y}

    sample = {"x": torch.randn(4, 4)}

    compiled_infer = context.compile(
        infer,
        sample,
        storage.path("/tmp/infer"),
        options={"float_dtype": "mixed"},
    )
    result = compiled_infer({"x": torch.ones(4, 4)})
    result_on_cpu = result["out"].cpu()
    print(result_on_cpu)


if __name__ == "__main__":
    run_infer()

Then, running ./examples/exec_with_env.sh python3 examples/infer.py will perform inference using the pretrained model (model = torch.nn.Linear(4, 4)). Let's do this.

$ ./examples/exec_with_env.sh python3 examples/infer.py
...
# Because the model weight is initialized randomly, you will see a different result.
tensor([[-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574]])

In this example, we introduce new APIs in addition to the ones used in add.py.

  • set_tensor_name_in_module(model, "model")
  • context.register_param(p)
  • context.register_buffer(b)

The first one is relatively simple. set_tensor_name_in_module sets names for all tensors in a given torch.nn.Model. You can get the set name using get_tensor_name function.

context.register_param and context.register_buffer registers a given tensor with a given name. After registration, you can synchronize registered tensors between PyTorch side and MN-Core side by calling synchronize. The usage of synchronize is explained in the next section.

The options={"float_dtype": "mixed"} kwargs annotates the compiler to apply AMP computing. For example, GEMM derived from torch.nn.Linear is executed in half precision, and some operations requiring high precision calculating are executed in float precision. Since MLSDK is primarily designed for machine learning workloads, this mixed behavior is the default setting.

Example: Training a Model

Finally, let's go through an example of training a model.

import torch
from mlsdk import (
    Context,
    MNCoreSGD,
    MNDevice,
    set_buffer_name_in_optimizer,
    set_tensor_name_in_module,
    storage,
)


def run_train():
    class Model(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = torch.nn.Linear(10, 1)

        def forward(self, *, x):
            return {"y": self.linear(x)}

    device = MNDevice("mncore2:auto")
    context = Context(device)
    Context.switch_context(context)

    sample = {"x": torch.ones(1, 10)}
    model = Model()
    model.linear.weight.data.fill_(1.0)
    model.linear.bias.data.fill_(1.0)
    model.train()

    set_tensor_name_in_module(model, "model")
    for p in model.parameters():
        context.register_param(p)
    for b in model.buffers():
        context.register_buffer(b)

    optimizer = MNCoreSGD(model.parameters(), 0.1, 0.9, 0.0)
    set_buffer_name_in_optimizer(optimizer, "optim0")
    context.register_optimizer_buffers(optimizer)

    def f(inp):
        optimizer.zero_grad()
        loss = torch.relu(model(**inp)["y"]).sum()
        loss.backward()  # type: ignore[no-untyped-call]
        optimizer.step()
        return {"loss": loss}

    compiled_f = context.compile(
        f,
        sample,
        storage.path("/tmp/train"),
    )

    print(f"Original {model.linear.bias=}")
    for _ in range(10):
        compiled_f(sample)
    context.synchronize()
    print(f"Optimized {model.linear.bias=}")


if __name__ == "__main__":
    run_train()

Let's run this example. You can see the model parameters are changed by the optimizer.

In this example, we use a new API synchronize(). It synchronizes tensors on MN-Core DRAM and PyTorch tensors. After calling the function, you can check the optimized parameters in your program.

$ ./examples/exec_with_env.sh python3 examples/train.py
...
Original model.linear.bias=Parameter containing:
tensor([0.9000], requires_grad=True)
Optimized model.linear.bias=Parameter containing:
tensor([-2.0413], requires_grad=True)

Processing Flow for Compiled Functions on MN-Core

When you compile and execute functions on MN-Core using MLSDK, an efficient combination of asynchronous and synchronous processing is employed internally. Here, we'll delve into the processing flow of how a compiled function operates.

Asynchronous Event Queue

When you invoke a compiled function, events for data input, actual computation execution, and retrieval of output results are each inserted into an internal queue.

These events strictly adhere to their invocation order and are processed sequentially on the MN-Core device. This allows multiple operations to be efficiently queued, maximizing the utilization of the device's resources.

Synchronous Processing During Output Retrieval

Typically, a compiled function call is asynchronous, returning control to the caller immediately.

However, if you need to use the output results on the host (CPU) side—for example, by calling the .cpu() the system will wait until the corresponding event is completed. This blocks the program's execution until the necessary data has finished computing on MN-Core and is transferred to host memory.

Advanced Data Transfer and Memory Management

In the previous discussions, data transfer for inputs to compiled functions and retrieval of their results implicitly occurred.

MLSDK provides APIs to explicitly control asynchronous data transfer, enabling more advanced performance optimization and flexible memory management.

TensorProxy and Explicit Data Transfer

A TensorProxy is an object that represents a torch.Tensor located either in the MN-Core device's memory or in host memory. Tensors used as inputs to compiled functions or registered via context.register_param (and similar methods) are internally handled as TensorProxy objects.

TensorProxy provides the load_from method to asynchronously load data from a torch.Tensor on host memory.

TensorProxy.load_from(tensor: torch.Tensor, clone: bool = True) -> None
  • tensor: A torch.Tensor object residing in host memory. The data from this tensor will be transferred to the device memory pointed to by the TensorProxy.

  • clone: Defaults to True.

    • True : The tensor's data is copied, and this copy is placed into the transfer queue. This makes it safe to modify the original tensor after the load_from call.
    • False: A direct reference to the tensor's data is placed into the queue. This can reduce copy overhead, but modifying the original tensor before the transfer completes may lead to undefined behavior.

Obtaining Input Proxies with CompiledFunction.allocate_input_proxy()

The CompiledFunction.allocate_input_proxy method explicitly allocates DRAM buffers on the MN-Core device based on the expected input shapes and types of the compiled function, returning a dictionary of corresponding TensorProxy objects.

Obtaining TensorProxy for Registered Values with context.get_registered_value_proxy()

For PyTorch tensors registered via context.register_param or context.register_buffer, it's also possible to explicitly obtain the TensorProxy that points to their device memory and use load_from to transfer data.

The context.get_registered_value_proxy method takes a torch.Tensor object previously registered with register_param or register_buffer as an argument, and returns its corresponding TensorProxy.

Example: TensorProxy and Explicit Data Transfer

import torch
from mlsdk import Context, MNDevice, set_tensor_name_in_module, storage

def run_explicit_data_transfer():
    device = MNDevice("mncore2:auto")
    context = Context(device)
    Context.switch_context(context)

    model = torch.nn.Linear(4, 4)
    model.eval()
    set_tensor_name_in_module(model, "model")
    for p in model.parameters():
        context.register_param(p)
    for b in model.buffers():
        context.register_buffer(b)

    def infer(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = model(x)
        return {"out": y}

    sample = {"x": torch.randn(4, 4)}

    compiled_infer = context.compile(
        infer,
        sample,
        storage.path("/tmp/proxy_transfer"),
    )

    input_proxies_allocated = compiled_infer.allocate_input_proxy()
    input_proxies_allocated["x"].load_from(torch.ones(4, 4), clone=False)

    for model_param in model.parameters():
        model_param_proxy = context.get_registered_value_proxy(model_param)
        model_param_proxy.load_from(model_param, clone=False)

    result = compiled_infer(input_proxies_allocated)
    result_on_cpu = result["out"].cpu()
    print(result_on_cpu)

if __name__ == "__main__":
    run_explicit_data_transfer()
./examples/exec_with_env.sh python3 examples/explicit_data_transfer_api.py

# Because the model weight is initialized randomly, you will see a different result.
tensor([[-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574]])

How to Port Your Workload to MN-Core 2

Running Your Workload Using pfvm:cpu

Before running your workload on MN-Core 2, we highly recommend running it on pfvm:cpu first. With this backend, you can ensure that your code can be converted to ONNX accurately. For more details, please check Backends other than mncore2:auto.

We, the MN-Core compiler team, believe that users can go through this process without specific knowledge of MN-Core. If the function passed to compile only contains static shapes, the process should be successful.

Note on Static Shape Requirements

When compiling functions for execution on the MN-Core 2 backend, it is important to note that the functions passed to the compiler must only contain static shapes. Static shapes refer to the fixed dimensions of tensors at compile-time and do not involve dynamic graphs such as if statements or loops. In particular, the following restrictions apply to the functions:

  • Tensors must have fixed dimensions at compile-time.
  • The shape of each tensor must be known and constant throughout the function.
  • Control flow constructs such as if statements, loops, and recursion are not supported in functions.

If the dynamic constructs in the function rely on runtime information or user input to determine loop counts or other control flow decisions, the compiled functions will always follow the same decisions based on the input shapes at the time of trace, regardless of any changes to the user input or runtime context.

Replacing pfvm:cpu with (emu2 or mncore2:auto)

After running your workload on pfvm:cpu, let's compile your code for MN-Core 2. If you have MN-Core 2 in your pod, you can use mncore2:auto. However, the emulator backend of MN-Core 2, emu2, is sufficient for checking if your code can run on MN-Core 2.

During this process, you may encounter several compiler errors. When your code fails to compile, one option is to adjust the input size. For example, adjusting the batch size or the image size can be effective. Also, it is recommended to use a batch size that is a power of 2.

Practical Examples Using MLSDK API

Here are some examples using MLSDK API that are included in the MLSDK container image.

MNIST

The script for training MNIST is available at mnist.py.

$ ./examples/exec_with_env.sh python3 examples/mnist.py
...
epoch 8, loss 0.14200307160075182
epoch 9, iter    0, loss 0.3466796875
epoch 9, iter  100, loss 0.12501551845286152
epoch 9, iter  200, loss 0.11559456260643193
epoch 9, iter  300, loss 0.12726582522408106
epoch 9, iter  400, loss 0.12854511630802673
epoch 9, iter  500, loss 0.13125214200771737
epoch 9, iter  600, loss 0.1303801415565605
epoch 9, iter  700, loss 0.1311912290039824
epoch 9, iter  800, loss 0.13359734881087937
epoch 9, iter  900, loss 0.13450224351935858
epoch 9, loss 0.13449914363812235
...
eval acc 0.9567

timm Model Inference

timm is a library that contains many computer vision models.

You can run a timm model on MN-Core 2 using run_timm.py. This script classifies an image and prints the result. Please use run_timm.sh to run the Python script. This shell script installs timm in venv environment and then runs the inference.

$ ./examples/run_timm.sh --model_name resnet50.a1h_in1k --batch_size 16
...
MNCore2 top-5 classes:
- espresso (967)
- cup (968)
- chocolate sauce, chocolate syrup (960)
- consomme (925)
- eggnog (969)
Torch top-5 classes:
- espresso (967)
- cup (968)
- chocolate sauce, chocolate syrup (960)
- eggnog (969)
- consomme (925)

You can use other timm models.

$ ./examples/run_timm.sh --model_name mobilenetv3_small_050.lamb_in1k --batch_size 16
...
MNCore2 top-5 classes:
- cup (968)
- trifle (927)
- face powder (551)
- ice cream, icecream (928)
- coffee mug (504)
Torch top-5 classes:
- cup (968)
- trifle (927)
- ice cream, icecream (928)
- face powder (551)
- coffee mug (504)

Image Generation from Text Prompt

You can get a beautiful image of Fujisan with the following command. We need to skip VAE decoder compilation for now.

$ ./examples/run_stable_diffusion.sh --skip_vae_decoder_compilation --prompt "Fujisan" --device mncore2:auto
...
Output image saved at /tmp/mlsdk_stable_diffusion_out/output.png

In case out of memory error happens, you can set the number of threads used for the compilation by the option num_compiler_threads, e.g. --num_compiler_threads 32.

Large Language Model (LLM) Inference

The script for running Llama 1B inference is available at llm_infer.py. You need to install the transformers library and set the environment variable MNCORE_USE_EXTERNAL_DATA_FORMAT=1 to run the script. run_llm_infer.sh does this setting in addition to what exec_with_env.sh does.

Currently, the compilation of the prefill phase requires specifying some environment variables. Therefore, we split the example into two parts, one for the compilation of the prefill phase, and the other for the decoding phase. This will be addressed in later releases.

Let's run the inference. The prefill phase runs on MN-Core, and the decoding phase runs on CPU.

You can specify the prompt by additionally passing arguments like --prompt 'What is the meaning of life?'. Note that the --num_compiler_threads option is also available here.

$ MNCORE_USE_LEGACY_ONNX_EXPORTER=1 MNCORE_USE_EXTERNAL_DATA_FORMAT=1 CODEGEN_TIME_SLICE_SCATTERED_INDEXING_BCAST=1 CODEGEN_OP_DEF=Gather=GatherBcast ./examples/run_llm_infer.sh --compile_prefill --prepare_attention_mask_on_cpu --device mncore2:auto
...
=========== Generated with compilation ==========
 </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
========== Generated with model.generate ==========
 <s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
Generated outputs matched.

Let's do the opposite: prefill on CPU, and decode on MN-Core.

$ MNCORE_USE_LEGACY_ONNX_EXPORTER=1 MNCORE_USE_EXTERNAL_DATA_FORMAT=1 ./examples/run_llm_infer.sh --compile_decode --prepare_attention_mask_on_cpu --device mncore2:auto
...
=========== Generated with compilation ==========
 </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
========== Generated with model.generate ==========
 <s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
Generated outputs matched.

After the compilation and execution, the above script performs a comparison of the generated sequence against the generation API of transformers. The results may not match exactly, depending on the model weights and prompt. However, they are unlikely to be completely different.

Tools for Debugging and Profiling

When working with MN-Core, you may often need to debug and profile complex scenarios. In this section, we will introduce some useful tools for debugging.

Backends Other than mncore2:auto

pfvm:cpu Backend

The compile function converts the passed function into an internal representation called ONNX before executing it. Occasionally, there may be errors in this conversion from the function to ONNX or ONNX to MN-Core machine instructions. To detect such cases, the pfvm:cpu backend comes in handy.

The pfvm:cpu backend allows you to run ONNX models on CPU without compiling it for MN-Core. You can use it by specifying "pfvm:cpu" instead of "mncore2:auto" when creating the context.

device = mlsdk.MNDevice("pfvm:cpu")
context = Context(device)
Context.switch_context(context)

Programs that does not work correctly with "pfvm:cpu" will not work with "mncore2:auto" either. Therefore, this functionality should be the first thing to try when a program does not work with MN-Core.

emu2 Backend

Even if you do not have an MN-Core 2 device installed on your system, you can still test your program's behavior on MN-Core 2 using an emulator. The "emu2" backend is specifically designed for this purpose. This is essentially an emulator for MN-Core 2 that allows your program to run in a simulated environment just like a real MN-Core 2 device would. By using this backend, you can ensure that your code will work as expected before running it on the real device.

Just like as other backends, you can use it with the following code.

device = mlsdk.MNDevice("emu2")
context = Context(device)
Context.switch_context(context)

codegen-dashboard

serve Subcommand

codegen-dashboard is a web-based dashboard for the MN-Core compiler. It allows users to view ONNX files, logs, error messages and more. The executable is named codegen-dashboard-external, and located under the /opt/pfn/pfcomp/codegen/build/codegen-dashboard/ directory. To start codegen-dashboard-external, execute the following command and open the indicated URL in your web browser.

$ /opt/pfn/pfcomp/codegen/build/codegen-dashboard/codegen-dashboard-external serve /tmp/add_two_tensors/add
Started to serve at http://localhost:8327

Please note that this document assumes that you are working inside a Kubernetes pod, hence http://localhost:8327/ is inaccessible from your local machine. To access the page, enable port forwarding using the command below:

$ kubectl port-forward <POD_NAME> 8327:8327

The codegen-dashboard serve <codegen_dir> [flags] command requires the location of the MN-Core compiler output directory, termed as codegen_dir. In Example: Running a Pure Function on the MN-Core, the directory is generated to /tmp/add_two_tensors. codegen-dashboard can also accept optional flags such as --port <port number>, --host <hostname>, etc. For the comprehensive list of available flags, refer to codegen-dashboard serve --help.

genhtml Subcommand

codegen-dashboard provides a genhtml <codegen_dir> [flags] subcommand. This command generates HTML files inside the specified codegen_dir, which can be accessed via file URIs. The genhtml subcommand offers a subset of the features available in the serve command, but is very handy when sharing with others.

$ /opt/pfn/pfcomp/codegen/build/codegen-dashboard/codegen-dashboard-external genhtml /tmp/add_two_tensors/add
Writing /tmp/add_two_tensors/add/model.onnx.html ...
Writing /tmp/add_two_tensors/add/logviewer.html ...
Writing /tmp/add_two_tensors/add/index.html ...

Use Perfetto for Profiling

Perfetto UI thumbnail

You can profile your code with Perfetto, which loads traced data from a local file and shows the detailed analysis in an interactive way. The module mncore.utils.perfetto_trace provides an easy way to generate files that can be loaded from Perfetto UI.

To profile your code, first import trace_scope function from the module:

from mlsdk import trace_scope

Then enclose the entrypoint of your code with:

with trace_scope("trace-file-name.pb"):
  your_entry_point() # e.g. main() or run() or whatever

And that's all! The trace file will be saved to "trace-file-name.pb" when you run the code. Here is a patch to enable perfetto tracing for add.py, which is already introduced in this tutorial.

Apply this patch and save to add_trace.py:

--- add.py
+++ add_trace.py
@@ -1,5 +1,5 @@
 import torch
-from mlsdk import CacheOptions, Context, MNDevice, storage
+from mlsdk import CacheOptions, Context, MNDevice, storage, trace_scope


 def run_add():
@@ -28,4 +28,5 @@


 if __name__ == "__main__":
-    run_add()
+    with trace_scope("trace.pb"):
+        run_add()

Type the following command to check the result:

$ ./exec_with_env.sh python3 add_trace.py
...
result_on_cpu=tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]])
[908.203] perfetto.cc:53691       Tracing session 1 ended, total sessions:0
$ ls *.pb
trace.pb

Then you can access Perfetto UI and open the file (trace.pb in this example) to check the result.