Puzzle 17: 1D 합성곱 Op

MAX 그래프로 파이썬 연동하기

GPU 퍼즐 여정의 Part IV에 진입했습니다: MAX 그래프 커스텀 Op으로 파이썬 연동하기.

이전 퍼즐들에서는 Mojo로 효율적인 GPU 커널을 작성하는 방법을 배웠습니다. 이제부터는 다음을 알아봅니다:

커널을 파이썬에서 호출할 수 있는 커스텀 연산으로 패키징하기

MAX 그래프 시스템과 통합하여 머신러닝을 가속하기

하이레벨 파이썬 API와 로우레벨 GPU 코드 사이의 간극 메우기

이를 통해 익숙한 파이썬 환경에서 작업하면서도 Mojo GPU 커널의 성능을 활용할 수 있습니다.

개요

Puzzle 13: 1D 합성곱에서 GPU에서 효율적으로 동작하는 1D 합성곱 커널을 구현했습니다. 이번에는 이 커널을 MAX 그래프를 통해 파이썬에서 직접 호출할 수 있는 커스텀 연산으로 변환합니다.

사용할 1D 합성곱 커널은 이미 구현되어 있습니다:

comptime TPB = 15
comptime BLOCKS_PER_GRID = (2, 1)


def conv1d_kernel[
    input_size: Int,
    conv_size: Int,
    OutLayout: TensorLayout,
    InLayout: TensorLayout,
    ConvLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    kernel: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # first: need to account for padding
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB + conv_size - 1]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[conv_size]())
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        var next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]
        else:
            shared_a[TPB + local_i] = 0

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: output.ElementType = 0

        comptime for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum

이 퍼즐의 핵심 요소는 다음과 같습니다:

커스텀 op 등록: @compiler.register 데코레이터를 통해 Mojo 함수를 파이썬에 노출하는 방법 이해하기
커스텀 op 패키징: Mojo 코드를 MAX 그래프에서 사용할 수 있도록 패키징하는 방법 익히기
파이썬 통합: MAX 그래프를 통해 파이썬에서 커스텀 연산 호출하기
크로스 언어 데이터 흐름: 파이썬과 GPU 사이의 데이터 타입과 메모리 관리하기

이 커스텀 연산은 다음과 같은 일을 수행합니다:

파이썬에서 NumPy 배열을 입력으로 받기
이 데이터를 GPU로 전송하기
최적화된 합성곱 커널 실행하기
결과를 파이썬으로 반환하기

이 퍼즐을 완성하면 파이썬의 풍부한 생태계와 Mojo의 강력한 GPU 성능을 잇는 매끄러운 다리를 만들게 됩니다.

완성할 코드

이 퍼즐을 완성하려면 conv1d.mojo에서 conv1d_kernel을 호출하는 한 줄만 채우면 됩니다:

import compiler
from std.runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from std.memory import UnsafePointer
from std.gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    def execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        output: OutputTensor[rank=1, static_spec=_],
        input: InputTensor[rank=output.rank, static_spec=_],
        kernel: InputTensor[rank=output.rank, static_spec=_],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        var output_tensor = output.to_layout_tensor()
        var input_tensor = input.to_layout_tensor()
        var kernel_tensor = kernel.to_layout_tensor()

        comptime if target == "gpu":
            var gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output_tensor.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)

전체 파일 보기: problems/p17/op/conv1d.mojo

다음 명령으로 퍼즐을 실행할 수 있습니다:

pixi run p17

pixi run -e amd p17

pixi run -e apple p17

uv run poe p17

성공하면 다음과 비슷한 출력을 볼 수 있습니다:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

이 출력은 커스텀 MAX 그래프 연산이 1D 합성곱 알고리즘을 올바르게 구현했음을 나타냅니다.

솔루션

이 퍼즐을 풀려면 1D 합성곱 커널을 MAX 그래프 시스템과 통합해야 합니다. 핵심은 Conv1DCustomOp 구조체의 execute 메서드에서 커널을 올바르게 호출하는 것입니다.

풀이는 다음과 같습니다:

            comptime kernel = conv1d_kernel[
                input_size, conv_size, OutLayout, OutLayout, ConvLayout
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                output_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )

이 한 줄이 수행하는 중요한 작업들은 다음과 같습니다:

GPU 컨텍스트(gpu_ctx의 타입은 DeviceContext)에서 enqueue_function을 호출하여 커널 실행 예약
필요한 레이아웃과 크기 정보를 컴파일 타임 파라미터로 전달
출력, 입력, 커널 텐서를 런타임 인자로 제공
적절한 차원으로 실행 그리드 구성

전체 맥락에서 어떻게 동작하는지 살펴보겠습니다:

파이썬-Mojo 통합 흐름

파이썬 쪽 (problems/p17/p17.py):
- 입력과 커널용 NumPy 배열 생성
- MAX 그래프로 연산을 감싸는 conv_1d() 함수 호출
- NumPy 배열을 Buffer.from_numpy(input).to(device)로 MAX driver Buffer로 변환
- custom_extensions=[mojo_kernels]로 커스텀 연산 패키지 로드
그래프 구축:
- TensorType으로 입력 및 출력 텐서 타입 정의
- parameters={...}를 통해 연산의 파라미터 지정
- Graph("conv_1d_graph", ...)로 연산 그래프 생성
- ops.custom(name="conv1d", ...)로 커스텀 연산 호출
커스텀 op 등록:
- @compiler.register("conv1d") 데코레이터가 연산을 MAX 그래프에 노출. @compiler.register 참고
- execute 메서드의 파라미터가 인터페이스(입력, 출력, 컨텍스트) 정의
- 입출력 텐서가 커널에서 사용할 수 있도록 TileTensor로 변환
- Device context가 GPU 메모리 할당과 커널 실행 관리
커널 실행:
- model.execute(...)가 호출되면 conv1d_kernel이 데이터 수신
- grid_dim과 block_dim으로 GPU 스레드 구성 설정
- result.to(CPU())로 결과를 CPU로 전송
- NumPy 검증으로 기대 출력과 결과 비교

핵심 구성 요소 상세

커스텀 Op 구조체:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    def execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[dtype = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # 구현

target은 디바이스 타입(“gpu” 또는 “cpu”)을 나타냄
input_size와 conv_size는 파이썬에서 전달되는 파라미터
텐서 타입이 올바른 shape과 타입 검사 보장
반환 타입은 적절한 오류 처리 위해 raises

텐서 변환:
```
output_tensor = output.to_layout_tensor()
input_tensor = input.to_layout_tensor()
kernel_tensor = kernel.to_layout_tensor()
```
- MAX 그래프 텐서를 Mojo TileTensor로 변환
- 커널이 텐서를 직접 다룰 수 있게 해줌
- 컴파일 타임 최적화를 위해 레이아웃 추출
Device Context 사용:
```
gpu_ctx = ctx.get_device_context()
gpu_ctx.enqueue_memset(...)  # 출력 버퍼 초기화
gpu_ctx.enqueue_function[..., ...](...) # 커널 예약
```
- 디바이스 컨텍스트가 GPU 리소스를 관리
- 메모리 연산으로 올바른 버퍼 상태를 보장
- 함수를 큐에 등록하여 커널 실행을 예약

이 풀이는 파이썬 데이터가 MAX 그래프를 거쳐 GPU에서 실행되고 다시 돌아오는 전체 흐름을 보여줍니다. Mojo의 강력한 타입 시스템과 매개변수화 함수를 활용하여 효율적이고 타입 안전한 가속 연산을 만들어냅니다.

MAX 그래프 커스텀 op 이해하기

더 자세한 내용은 아래 튜토리얼을 참고하세요:

Get started with MAX Graph in Python

MAX Graph custom op for GPUs

커스텀 op 등록

커스텀 연산을 만드는 핵심은 @compiler.register 데코레이터와 관련 구조체입니다:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    def execute[...](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[type = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # 구현

등록의 핵심 구성 요소:

데코레이터에 전달하는 이름("conv1d")이 파이썬 코드에서 이 연산을 호출할 때 사용하는 이름
구조체에는 올바른 시그니처를 가진 execute 메서드가 있어야 함
OutputTensor와 InputTensor 타입이 파이썬 데이터와의 인터페이스를 정의
DeviceContextPtr이 실행 환경에 대한 접근을 제공

커스텀 op 패키징

커스텀 연산을 파이썬에서 사용하려면 먼저 패키징해야 합니다:

mojo package op -o op.mojopkg

이 명령은:

Mojo 코드를 배포 가능한 패키지로 컴파일
MAX 그래프가 연산을 이해하는 데 필요한 메타데이터 생성
파이썬에서 로드할 수 있는 바이너리 아티팩트(op.mojopkg)를 생성

패키지는 MAX 그래프가 찾을 수 있는 위치에 배치해야 하며, 보통 파이썬 코드에서 접근 가능한 디렉토리에 둡니다.

파이썬 통합

파이썬 쪽에서 커스텀 연산을 사용하는 방법은 다음과 같습니다:

# Mojo 연산이 포함된 디렉토리 경로
mojo_kernels = Path(__file__).parent / "op"

# 커스텀 conv1d 연산으로 그래프 구성
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # 커스텀 op 패키지 로드
) as graph:
    # 그래프의 입력 정의
    input_value, kernel_value = graph.inputs

    # 이름으로 커스텀 연산 사용
    output = ops.custom(
        name="conv1d",  # @compiler.register의 이름과 일치해야 함
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

핵심 요소는 다음과 같습니다:

custom_extensions로 커스텀 연산의 경로 지정
등록된 연산 이름으로 ops.custom 호출
연산의 시그니처에 맞는 입력 값과 파라미터 전달