Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Info

This document is based on registry.pfcomputing.internal/mncore-sdk/mncore-sdk-minimal:v0.2.


How to Use MN-Core with MLSDK

In this document, we explain how to use MN-Core using MLSDK (Machine Learning Software Development Kit).

What is MLSDK?

MLSDK is a software development kit that includes a compiler, runtime software stack, and documentation to enable using MN-Core from PyTorch. Although it is called "Machine Learning," it can also be used for applications other than the machine learning field, as long as they are written with PyTorch.

Prerequisites

The followings are assumed in this document.

  • You are a user of
    • PFCP (Preferred Computing Platform)
  • You have a basic knowledge of the followings:
    • Pytorch
    • Kubernetes

Notes on the Backward Compatibility

MLSDK is currently unstable, and there may be future updates that will not be backward compatible; Breaking changes are likely to be introduced.

If you want to use the same version of MLSDK, please modify the container image accordingly.

MLSDK Images

MLSDK is distributed as container images at Amazon ECR repository, registry.pfcomputing.internal/mncore-sdk. We provide two types of images as of Jan 2025: registry.pfcomputing.internal/mncore-sdk/mncore-sdk-minimal and registry.pfcomputing.internal/mncore-sdk/mncore-sdk-full. The former is a minimal image that contains only libraries, executables, and scripts that are mandatory to run MN-Core 1/2. The latter is a fatter image that is built on the top of the minimal image, with the standard development tools (e.g., gcc, binutils, make, cmake, ...) and the development environments (e.g., JupyterLab).

We provide two types of tag series for the minimal and full images. One tag type is "release", which is a stable version of MLSDK, published roughly once every three months. The other tag type is "snapshot", which delivers the latest development branch (that may not be mature) to users for trial purposes. The release tag follows the vYY.MM format (e.g., v24.12 for a Dec 2024 release), and the snapshot tag follows the snapshot_YYYYMMDD_githash (e.g., snapshot_20241201_0123456789abcdef). In addition to the aforementioned tags that immutably correspond to a unique image, we also provide latest and snapshot_latest tags to allow users to obtain the latest image for each series.

Launching an MLSDK Container

If you are new to PFCP and want to launch a kubernetes pod (a collection of running containers) with access to MN-Core 1/2 devices, the tutorial in the PFCP documentation is a good starting point.

Once a pod is properly configured, you can use the command gpfn3-smi list (see the following example) to display the connected MN-Core 2 devices within the pod. The number 0 represents the device index of the MN-Core 2 device within the pod, and the string starting with mnc2 is the device ID within the pod. If nothing is displayed, it indicates that there may be an issue with the pod configuration. Please double-check the configuration and try again.

$ gpfn3-smi list
0: mnc2p28s0

How to use MLSDK API

From here, we explain how to run existing applications on MN-Core devices using the Python API (MLSDK API) included in MLSDK with several examples. The API reference is [../build/MLSDK/docs/index.txt](raw text version) or [../build/MLSDK/docs/html/index.html](HTML version).

First, we start with an example of running a pure function on MN-Core. Then, we demonstrate how to run a inference with a pretrained model. Finally, we introduce an example of model training.

Notes on Running Examples

You can locate all examples shown in this document in the MLSDK directory, where this README.md exists. In addition, you can run any command that starts with $ as is.

Notes on Environment Variable Settings

You need to set several environment variables to use the MLSDK API. To set these variables, we recommend that you use the shell scripts /opt/pfn/pfcomp/codegen/build/codegen_preloads.sh and /opt/pfn/pfcomp/codegen/build/codegen_pythonpath.sh. You are advised to write a shell script that loads these two files and then executes any given python script like the following:

#! /bin/bash

set -eux -o pipefail

source /opt/pfn/pfcomp/codegen/build/codegen_preloads.sh
source /opt/pfn/pfcomp/codegen/build/codegen_pythonpath.sh

exec "$@"

In this tutorial, we refer the shell script as exec_with_env.sh.

The MLSDK directory can be found at /opt/pfn/pfcomp/codegen/MLSDK/ in the image. Move to the directory before following the examples below.

Example: Running a Pure Function on MN-Core

First, let's go through an example of running a pure function, meaning the function has no state, on MN-Core.

Save the following code as add.py:

import torch
from mlsdk import CacheOptions, Context, MNDevice, storage


def run_add():
    device = MNDevice("mncore2:0")
    context = Context(device)
    Context.switch_context(context)

    def add(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = input["y"]
        return {"out": x + y}

    sample = {"x": torch.randn(3, 4), "y": torch.randn(3, 4)}

    compiled_add = context.compile(
        add,
        sample,
        storage.path("/tmp/add_two_tensors"),
        name="add",
        cache_options=CacheOptions("/tmp/add_two_tensors_cache"),
    )
    result = compiled_add({"x": torch.ones(3, 4), "y": torch.ones(3, 4)})
    result_on_cpu = result["out"].cpu()
    print(f"{result_on_cpu=}")
    assert torch.allclose(result_on_cpu, torch.ones(3, 4) * 2)


if __name__ == "__main__":
    run_add()

And type the following command to run it! You should see the result of the computation done on MN-Core 2.

$ ./examples/exec_with_env.sh python3 examples/add.py
...
result_on_cpu=tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]])

The MLSDK API enables the use of MN-Core through a concept called "context." A context is an object used to control and manage the MN-Core device. You can create a context like the following:

device = mlsdk.MNDevice("mncore2:0")
context = Context(device)

In the code above, "mncore2:0" specifies that the context will use the MN-Core 2 device of index 0. If there are multiple MN-Core 2 devices in a pod, you can specify "mncore2:1", "mncore2:2", and so on to use the second, third, and subsequent MN-Core 2 devices, respectively.

To run a program on MN-Core, you need to compile a Python function into a function that runs on the MN-Core device using the compile method of the context. It takes three positional arguments and returns a compiled function.

The first argument is the function to be compiled. The function should have the type of Callable[[dict[str, torch.Tensor]], dict[str, torch.Tensor]].

The second argument is an input example that is passed to the function being compiled. As mentioned later, the compiled function returned by compile always expects inputs with the same shape. Therefore, a dummy input is required to specify the input shape.

The third argument is the directory to store intermediate and final compile artifacts, serialized as binary files. This directory is referred to as the "codegen directory". The files generated in the codegen directory are extremely useful for debugging and profiling. There is a tool called "codegen-dashboard", which is described later, to visualize the information in the codegen directory.

The keyword argument name="add" is not required, but we set this for later explanations. See the section of codegen-dashboard.

The keyword argument cache_options=CacheOptions("/tmp/add_two_tensors_cache") enables caching of the exported ONNX graph and the compilation results of MLSDK, which will be stored in "/tmp/add_two_tensors_cache".

Note that you can port and reuse compiled functions with the caching feature. For example, you can compile a function with the cache_options argument and copy/move the cache directory to a different place, and then reuse the pre-compiled function by passing the cache_options argument so that it refers to the ported cache directory.

The returned value by compile is a function with the type Callable[[dict[str, TensorLike]], dict[str, TensorLike]]. TensorLike is an alias for Union[torch.Tensor, TensorProxy]. TensorProxy is an object that represents a torch.Tensor on the device managed by the context or on the host memory. If you call the .cpu() method on a TensorProxy, it returns torch.Tensor that holds the content of the TensorProxy.

The function returned by compile always expects inputs with the same shape. In other words, the arguments passed to the function must always have the same keys, and the value of torch.Tensor that corresponds to each key must always have the same shape.

Example: Running Inference with a Trained Model

Next, let's go through an example of running inference with a trained model. Save the following code as infer.py.

import torch
from mlsdk import Context, MNDevice, set_tensor_name_in_module, storage


def run_infer():
    device = MNDevice("mncore2:0")
    context = Context(device)
    Context.switch_context(context)

    model = torch.nn.Linear(4, 4)
    model.eval()
    set_tensor_name_in_module(model, "model")
    for p in model.parameters():
        context.register_param(p)
    for b in model.buffers():
        context.register_buffer(b)

    def infer(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = model(x)
        return {"out": y}

    sample = {"x": torch.randn(4, 4)}

    compiled_infer = context.compile(
        infer,
        sample,
        storage.path("/tmp/infer"),
    )
    result = compiled_infer({"x": torch.ones(4, 4)})
    result_on_cpu = result["out"].cpu()
    print(result_on_cpu)


if __name__ == "__main__":
    run_infer()

Then, running ./examples/exec_with_env.sh python3 examples/infer.py will perform inference using the pretrained model (model = torch.nn.Linear(4, 4)). Let's do this.

$ ./examples/exec_with_env.sh python3 examples/infer.py
...
# Because the model weight is initialized randomly, you will see a different result.
tensor([[-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574],
        [-0.4468, -0.2559,  0.0939,  0.8574]])

In this example, we introduce new APIs in addition to the ones used in add.py.

  • set_tensor_name_in_module(model, "model")
  • context.register_param(p)
  • context.register_buffer(b)

The first one is relatively simple. set_tensor_name_in_module sets names for all tensors in a given torch.nn.Model. You can get the set name using get_tensor_name function.

context.register_param and context.register_buffer registers a given tensor with a given name. After registration, you can synchronize registered tensors between PyTorch side and MN-Core side by calling synchronize. The usage of synchronize is explained in the next section.

Example: Training a Model

Finally, let's go through an example of training a model.

import torch
from mlsdk import (
    Context,
    MNCoreMomentumSGD,
    MNDevice,
    set_buffer_name_in_optimizer,
    set_tensor_name_in_module,
    storage,
)


def run_train():
    class Model(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear = torch.nn.Linear(10, 1)

        def forward(self, *, x):
            return {"y": self.linear(x)}

    device = MNDevice("mncore2:0")
    context = Context(device)
    Context.switch_context(context)

    sample = {"x": torch.ones(1, 10)}
    model = Model()
    model.linear.weight.data.fill_(1.0)
    model.linear.bias.data.fill_(1.0)
    model.train()

    set_tensor_name_in_module(model, "model")
    for p in model.parameters():
        context.register_param(p)
    for b in model.buffers():
        context.register_buffer(b)

    optimizer = MNCoreMomentumSGD(model.parameters(), 0.1, 0, 0.9, 1.0)
    set_buffer_name_in_optimizer(optimizer, "optim0")
    context.register_optimizer_buffers(optimizer)

    def f(inp):
        optimizer.zero_grad()
        loss = torch.relu(model(**inp)["y"]).sum()
        loss.backward()  # type: ignore[no-untyped-call]
        optimizer.step()
        return {"loss": loss}

    compiled_f = context.compile(
        f,
        sample,
        storage.path("/tmp/train"),
    )

    print(f"Original {model.linear.bias=}")
    for _ in range(10):
        compiled_f(sample)
    context.synchronize()
    print(f"Optimized {model.linear.bias=}")


if __name__ == "__main__":
    run_train()

Let's run this example. You can see the model parameters are changed by the optimizer.

In this example, we use a new API synchronize(). It synchronizes tensors on MN-Core DRAM and PyTorch tensors. After calling the function, you can check the optimized parameters in your program.

$ ./examples/exec_with_env.sh python3 examples/train.py
...
Original model.linear.bias=Parameter containing:
tensor([0.9000], requires_grad=True)
Optimized model.linear.bias=Parameter containing:
tensor([-2.0413], requires_grad=True)

How to Port Your Workload to MN-Core 2

Running Your Workload Using pfvm:cpu

Before running your workload on MN-Core 2, we highly recommend running it on pfvm:cpu first. With this backend, you can ensure that your code can be converted to ONNX accurately. For more details, please check Backends other than mncore2:0.

We, the MN-Core compiler team, believe that users can go through this process without specific knowledge of MN-Core. If the function passed to compile only contains static shapes, the process should be successful.

Note on Static Shape Requirements

When compiling functions for execution on the MN-Core 2 backend, it's important to note that the functions passed to the compiler must only contain static shapes. Static shapes refer to the fixed dimensions of tensors at compile-time and do not involve dynamic graphs such as if statements or loops. In particular, the following restrictions apply to the functions:

  • Tensors must have fixed dimensions at compile-time.
  • The shape of each tensor must be known and constant throughout the function.
  • Control flow constructs such as if statements, loops, and recursion are not supported in functions.

If the dynamic constructs in the function rely on runtime information or user input to determine loop counts or other control flow decisions, the compiled functions will always follow the same decisions based on the input shapes at the time of trace, regardless of any changes to the user input or runtime context.

Replacing pfvm:cpu with (emu2:0 or mncore2:0)

After running your workload on pfvm:cpu, let's compile your code for MN-Core 2. If you have MN-Core 2 in your pod, you can use mncore2:0. However, the emulator backend of MN-Core 2, emu2:0, is sufficient for checking if your code can run on MN-Core 2.

During this process, you may encounter several compiler errors. When your code fails to compile, one option is to adjust the input size. For example, adjusting the batch size or the image size can be effective. Also, it is recommended to use a batch size that is a power of 2.

Practical Examples Using MLSDK API

Here are some examples using MLSDK API that are included in the MLSDK container image.

MNIST

The script for training MNIST is available at mnist.py.

$ ./examples/exec_with_env.sh python3 examples/mnist.py
...
epoch 8, loss 0.14200307160075182
epoch 9, iter    0, loss 0.3466796875
epoch 9, iter  100, loss 0.12501551845286152
epoch 9, iter  200, loss 0.11559456260643193
epoch 9, iter  300, loss 0.12726582522408106
epoch 9, iter  400, loss 0.12854511630802673
epoch 9, iter  500, loss 0.13125214200771737
epoch 9, iter  600, loss 0.1303801415565605
epoch 9, iter  700, loss 0.1311912290039824
epoch 9, iter  800, loss 0.13359734881087937
epoch 9, iter  900, loss 0.13450224351935858
epoch 9, loss 0.13449914363812235
...
eval acc 0.9567

timm Model Inference

timm is a library that contains many computer vision models.

You can run a timm model on MN-Core 2 using run_timm.py. This script classifies an image and prints the result. Please use run_timm.sh to run the Python script. This shell script installs timm in venv environment and then runs the inference.

$ ./examples/run_timm.sh --model_name resnet50.a1h_in1k --batch_size 16
...
MNCore2 top-5 classes:
- espresso (967)
- cup (968)
- chocolate sauce, chocolate syrup (960)
- consomme (925)
- eggnog (969)
Torch top-5 classes:
- espresso (967)
- cup (968)
- chocolate sauce, chocolate syrup (960)
- eggnog (969)
- consomme (925)

You can use other timm models.

$ ./examples/run_timm.sh --model_name mobilenetv3_small_050.lamb_in1k --batch_size 16
...
MNCore2 top-5 classes:
- cup (968)
- trifle (927)
- face powder (551)
- ice cream, icecream (928)
- coffee mug (504)
Torch top-5 classes:
- cup (968)
- trifle (927)
- ice cream, icecream (928)
- face powder (551)
- coffee mug (504)

Image Generation from Text Prompt

You can get a beautiful image of Fujisan with the following command. We need to skip VAE decoder compilation for now.

$ ./examples/run_stable_diffusion.sh --skip_vae_decoder_compilation --prompt "Fujisan" --device mncore2:0
...
Output image saved at /tmp/mlsdk_stable_diffusion_out/output.png

In case out of memory error happens, you can set the number of threads used for the compilation by the option num_compiler_threads, e.g. --num_compiler_threads 32.

Large Language Model (LLM) Inference

The script for running Llama 1B inference is available at llm_infer.py. You need to install the transformers library and set the environment variable MNCORE_USE_EXTERNAL_DATA_FORMAT=1 to run the script. run_llm_infer.sh does this setting in addition to what exec_with_env.sh does.

Currently, the compilation of the prefill phase requires specifying some environment variables. Therefore, we split the example into two parts, one for the compilation of the prefill phase, and the other for the decoding phase. This will be addressed in later releases.

Let's run the inference. The prefill phase runs on MN-Core, and the decoding phase runs on CPU.

You can specify the prompt by additionally passing arguments like --prompt 'What is the meaning of life?'. Note that the --num_compiler_threads option is also available here.

$ MNCORE_USE_LEGACY_ONNX_EXPORTER=1 CODEGEN_TIME_SLICE_SCATTERED_INDEXING_BCAST=1 CODEGEN_OP_DEF=Gather=GatherBcast ./examples/run_llm_infer.sh --compile_prefill --prepare_attention_mask_on_cpu --device mncore2:0
...
=========== Generated with compilation ==========
 </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
========== Generated with model.generate ==========
 <s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
Generated outputs matched.

Let's do the opposite: prefill on CPU, and decode on MN-Core.

$ MNCORE_USE_LEGACY_ONNX_EXPORTER=1 ./examples/run_llm_infer.sh --compile_decode --prepare_attention_mask_on_cpu --device mncore2:0
...
=========== Generated with compilation ==========
 </s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
========== Generated with model.generate ==========
 <s> <|system|>
You are a friendly chatbot who is an expert on MN-Core.</s>
<|user|>
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of "just" 90 days using 16 A100-40G GPUs 🚀🚀. The training has started on 2023-09-01.</s>
<|assistant|>
Yes, that's correct. The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens using 16 A100-40G GPUs. The training has started on 2023-09
Generated outputs matched.

After the compilation and execution, the above script performs a comparison of the generated sequence against the generation API of transformers. The results may not match exactly, depending on the model weights and prompt. However, they are unlikely to be completely different.

Tools for Debugging and Profiling

When working with MN-Core, you may often need to debug and profile complex scenarios. In this section, we will introduce some useful tools for debugging.

Backends Other than mncore2:0

pfvm:cpu Backend

The compile function converts the passed function into an internal representation called ONNX before executing it. Occasionally, there may be errors in this conversion from the function to ONNX or ONNX to MN-Core machine instructions. To detect such cases, the pfvm:cpu backend comes in handy.

The pfvm:cpu backend allows you to run ONNX models on CPU without compiling it for MN-Core. You can use it by specifying "pfvm:cpu" instead of "mncore2:0" when creating the context.

device = mlsdk.MNDevice("pfvm:cpu")
context = Context(device)
Context.switch_context(context)

Programs that does not work correctly with "pfvm:cpu" will not work with "mncore2:0" either. Therefore, this functionality should be the first thing to try when a program does not work with MN-Core.

emu2:0 Backend

Even if you don't have an MN-Core 2 device installed on your system, you can still test your program's behavior on MN-Core 2 using an emulator. The "emu2:0" backend is specifically designed for this purpose. This is essentially an emulator for MN-Core 2 that allows your program to run in a simulated environment just like a real MN-Core 2 device would. By using this backend, you can ensure that your code will work as expected before running it on the real device.

Just like as other backends, you can use it with the following code.

device = mlsdk.MNDevice("emu2:0")
context = Context(device)
Context.switch_context(context)

codegen-dashboard

serve Subcommand

codegen-dashboard is a web-based dashboard for the MN-Core compiler. It allows users to view ONNX files, logs, error messages and more. The executable is named codegen-dashboard-external, and located under the /opt/pfn/pfcomp/codegen/build/codegen-dashboard/ directory. To start codegen-dashboard-external, execute the following command and open the indicated URL in your web browser.

$ /opt/pfn/pfcomp/codegen/build/codegen-dashboard/codegen-dashboard-external serve /tmp/add_two_tensors/add
Started to serve at http://localhost:8327

Please note that this document assumes that you are working inside a Kubernetes pod, hence http://localhost:8327/ is inaccessible from your local machine. To access the page, enable port forwarding using the command below:

$ kubectl port-forward <POD_NAME> 8327:8327

The codegen-dashboard serve <codegen_dir> [flags] command requires the location of the MN-Core compiler output directory, termed as codegen_dir. In Example: Running a Pure Function on the MN-Core, the directory is generated to /tmp/add_two_tensors/add 1. codegen-dashboard can also accept optional flags such as --port <port number>, --host <hostname>, etc. For the comprehensive list of available flags, refer to codegen-dashboard serve --help.

genhtml Subcommand

codegen-dashboard provides a genhtml <codegen_dir> [flags] subcommand. This command generates HTML files inside the specified codegen_dir, which can be accessed via file URIs. The genhtml subcommand offers a subset of the features available in the serve command, but is very handy when sharing with others.

$ /opt/pfn/pfcomp/codegen/build/codegen-dashboard/codegen-dashboard-external genhtml /tmp/add_two_tensors/add
Writing /tmp/add_two_tensors/add/model.onnx.html ...
Writing /tmp/add_two_tensors/add/logviewer.html ...
Writing /tmp/add_two_tensors/add/index.html ...

Use Perfetto for Profiling

Perfetto UI

You can profile your code with Perfetto, which loads traced data from a local file and shows the detailed analysis in an interactive way. The module mncore.utils.perfetto_trace provides an easy way to generate files that can be loaded from Perfetto UI.

To profile your code, first import trace_scope function from the module:

from mlsdk import trace_scope

Then enclose the entrypoint of your code with:

with trace_scope("trace-file-name.pb"):
  your_entry_point() # e.g. main() or run() or whatever

And that's all! The trace file will be saved to "trace-file-name.pb" when you run the code. Here is an example of adding up two tensors, which is already introduced in this tutorial, but with tracing enabled. Save the following code with add_trace.py:

import torch
from mlsdk import CacheOptions, Context, MNDevice, storage, trace_scope


def run_add():
    device = MNDevice("mncore2:0")
    context = Context(device)
    Context.switch_context(context)

    def add(input: dict[str, torch.Tensor]) -> dict[str, torch.Tensor]:
        x = input["x"]
        y = input["y"]
        return {"out": x + y}

    sample = {"x": torch.randn(3, 4), "y": torch.randn(3, 4)}

    compiled_add = context.compile(
        add,
        sample,
        storage.path("/tmp/add_two_tensors"),
        name="add",
        cache_options=CacheOptions("/tmp/add_two_tensors_cache"),
    )
    result = compiled_add({"x": torch.ones(3, 4), "y": torch.ones(3, 4)})
    result_on_cpu = result["out"].cpu()
    print(f"{result_on_cpu=}")
    assert torch.allclose(result_on_cpu, torch.ones(3, 4) * 2)


if __name__ == "__main__":
    with trace_scope("trace.pb"):
        run_add()

Type the following command to check the result:

$ ./exec_with_env.sh python3 add_trace.py
...
result_on_cpu=tensor([[2., 2., 2., 2.],
        [2., 2., 2., 2.],
        [2., 2., 2., 2.]])
[908.203] perfetto.cc:53691       Tracing session 1 ended, total sessions:0
$ ls *.pb
trace.pb

Then you can access Perfetto UI and open the file (trace.pb in this example) to check the result.


  1. If you omit the name keyword argument in compile() method, the codegen_dir will be generated to /tmp/add_two_tensors/<random string> by default.