Puzzle 17: 1D ν•©μ„±κ³± Op

MAX κ·Έλž˜ν”„λ‘œ 파이썬 μ—°λ™ν•˜κΈ°

GPU 퍼즐 μ—¬μ •μ˜ Part IV에 μ§„μž…ν–ˆμŠ΅λ‹ˆλ‹€: MAX κ·Έλž˜ν”„ μ»€μŠ€ν…€ Op으둜 파이썬 μ—°λ™ν•˜κΈ°.

이전 νΌμ¦λ“€μ—μ„œλŠ” Mojo둜 효율적인 GPU 컀널을 μž‘μ„±ν•˜λŠ” 방법을 λ°°μ› μŠ΅λ‹ˆλ‹€. μ΄μ œλΆ€ν„°λŠ” λ‹€μŒμ„ μ•Œμ•„λ΄…λ‹ˆλ‹€:

  • 컀널을 νŒŒμ΄μ¬μ—μ„œ ν˜ΈμΆœν•  수 μžˆλŠ” μ»€μŠ€ν…€ μ—°μ‚°μœΌλ‘œ νŒ¨ν‚€μ§•ν•˜κΈ°
  • MAX κ·Έλž˜ν”„ μ‹œμŠ€ν…œκ³Ό ν†΅ν•©ν•˜μ—¬ λ¨Έμ‹ λŸ¬λ‹μ„ κ°€μ†ν•˜κΈ°
  • ν•˜μ΄λ ˆλ²¨ 파이썬 API와 둜우레벨 GPU μ½”λ“œ μ‚¬μ΄μ˜ κ°„κ·Ή λ©”μš°κΈ°

이λ₯Ό 톡해 μ΅μˆ™ν•œ 파이썬 ν™˜κ²½μ—μ„œ μž‘μ—…ν•˜λ©΄μ„œλ„ Mojo GPU μ»€λ„μ˜ μ„±λŠ₯을 ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

κ°œμš”

Puzzle 13: 1D ν•©μ„±κ³±μ—μ„œ GPUμ—μ„œ 효율적으둜 λ™μž‘ν•˜λŠ” 1D ν•©μ„±κ³± 컀널을 κ΅¬ν˜„ν–ˆμŠ΅λ‹ˆλ‹€. μ΄λ²ˆμ—λŠ” 이 컀널을 MAX κ·Έλž˜ν”„λ₯Ό 톡해 νŒŒμ΄μ¬μ—μ„œ 직접 ν˜ΈμΆœν•  수 μžˆλŠ” μ»€μŠ€ν…€ μ—°μ‚°μœΌλ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€.

μ‚¬μš©ν•  1D ν•©μ„±κ³± 컀널은 이미 κ΅¬ν˜„λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€:

comptime TPB = 15
comptime BLOCKS_PER_GRID = (2, 1)


fn conv1d_kernel[
    in_layout: Layout,
    out_layout: Layout,
    conv_layout: Layout,
    input_size: Int,
    conv_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
    kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    # first: need to account for padding
    shared_a = LayoutTensor[
        dtype,
        Layout.row_major(TPB + conv_size - 1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_b = LayoutTensor[
        dtype,
        Layout.row_major(conv_size),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]
        else:
            shared_a[TPB + local_i] = 0

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: output.element_type = 0

        @parameter
        for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


이 퍼즐의 핡심 μš”μ†ŒλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

  1. μ»€μŠ€ν…€ op 등둝: @compiler.register λ°μ½”λ ˆμ΄ν„°λ₯Ό 톡해 Mojo ν•¨μˆ˜λ₯Ό νŒŒμ΄μ¬μ— λ…ΈμΆœν•˜λŠ” 방법 μ΄ν•΄ν•˜κΈ°
  2. μ»€μŠ€ν…€ op νŒ¨ν‚€μ§•: Mojo μ½”λ“œλ₯Ό MAX κ·Έλž˜ν”„μ—μ„œ μ‚¬μš©ν•  수 μžˆλ„λ‘ νŒ¨ν‚€μ§•ν•˜λŠ” 방법 읡히기
  3. 파이썬 톡합: MAX κ·Έλž˜ν”„λ₯Ό 톡해 νŒŒμ΄μ¬μ—μ„œ μ»€μŠ€ν…€ μ—°μ‚° ν˜ΈμΆœν•˜κΈ°
  4. 크둜슀 μ–Έμ–΄ 데이터 흐름: 파이썬과 GPU μ‚¬μ΄μ˜ 데이터 νƒ€μž…κ³Ό λ©”λͺ¨λ¦¬ κ΄€λ¦¬ν•˜κΈ°

이 μ»€μŠ€ν…€ 연산은 λ‹€μŒκ³Ό 같은 일을 μˆ˜ν–‰ν•©λ‹ˆλ‹€:

  • νŒŒμ΄μ¬μ—μ„œ NumPy 배열을 μž…λ ₯으둜 λ°›κΈ°
  • 이 데이터λ₯Ό GPU둜 μ „μ†‘ν•˜κΈ°
  • μ΅œμ ν™”λœ ν•©μ„±κ³± 컀널 μ‹€ν–‰ν•˜κΈ°
  • κ²°κ³Όλ₯Ό 파이썬으둜 λ°˜ν™˜ν•˜κΈ°

이 퍼즐을 μ™„μ„±ν•˜λ©΄ 파이썬의 ν’λΆ€ν•œ μƒνƒœκ³„μ™€ Mojo의 κ°•λ ₯ν•œ GPU μ„±λŠ₯을 μž‡λŠ” λ§€λ„λŸ¬μš΄ 닀리λ₯Ό λ§Œλ“€κ²Œ λ©λ‹ˆλ‹€.

μ™„μ„±ν•  μ½”λ“œ

이 퍼즐을 μ™„μ„±ν•˜λ €λ©΄ conv1d.mojoμ—μ„œ conv1d_kernel을 ν˜ΈμΆœν•˜λŠ” ν•œ μ€„λ§Œ μ±„μš°λ©΄ λ©λ‹ˆλ‹€:

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        output: OutputTensor[rank=1],
        input: InputTensor[rank = output.rank],
        kernel: InputTensor[rank = output.rank],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        output_tensor = output.to_layout_tensor()
        input_tensor = input.to_layout_tensor()
        kernel_tensor = kernel.to_layout_tensor()
        comptime in_layout = input_tensor.layout
        comptime out_layout = output_tensor.layout
        comptime conv_layout = kernel_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output_tensor.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)


전체 파일 보기: problems/p17/op/conv1d.mojo

λ‹€μŒ λͺ…λ ΉμœΌλ‘œ 퍼즐을 μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

pixi run p17
pixi run -e amd p17
pixi run -e apple p17
uv run poe p17

μ„±κ³΅ν•˜λ©΄ λ‹€μŒκ³Ό λΉ„μŠ·ν•œ 좜λ ₯을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

이 좜λ ₯은 μ»€μŠ€ν…€ MAX κ·Έλž˜ν”„ 연산이 1D ν•©μ„±κ³± μ•Œκ³ λ¦¬μ¦˜μ„ μ˜¬λ°”λ₯΄κ²Œ κ΅¬ν˜„ν–ˆμŒμ„ λ‚˜νƒ€λƒ…λ‹ˆλ‹€.

μ†”λ£¨μ…˜

이 퍼즐을 ν’€λ €λ©΄ 1D ν•©μ„±κ³± 컀널을 MAX κ·Έλž˜ν”„ μ‹œμŠ€ν…œκ³Ό 톡합해야 ν•©λ‹ˆλ‹€. 핡심은 Conv1DCustomOp ꡬ쑰체의 execute λ©”μ„œλ“œμ—μ„œ 컀널을 μ˜¬λ°”λ₯΄κ²Œ ν˜ΈμΆœν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

ν’€μ΄λŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

            comptime kernel = conv1d_kernel[
                in_layout, out_layout, conv_layout, input_size, conv_size
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                output_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )
이 ν•œ 쀄이 μˆ˜ν–‰ν•˜λŠ” μ€‘μš”ν•œ μž‘μ—…λ“€μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:
  1. GPU μ»¨ν…μŠ€νŠΈ(gpu_ctx의 νƒ€μž…μ€ DeviceContext)μ—μ„œ enqueue_function을 ν˜ΈμΆœν•˜μ—¬ 컀널 μ‹€ν–‰ μ˜ˆμ•½
  2. ν•„μš”ν•œ λ ˆμ΄μ•„μ›ƒκ³Ό 크기 정보λ₯Ό 컴파일 νƒ€μž„ νŒŒλΌλ―Έν„°λ‘œ 전달
  3. 좜λ ₯, μž…λ ₯, 컀널 ν…μ„œλ₯Ό λŸ°νƒ€μž„ 인자둜 제곡
  4. μ μ ˆν•œ μ°¨μ›μœΌλ‘œ μ‹€ν–‰ κ·Έλ¦¬λ“œ ꡬ성

전체 λ§₯λ½μ—μ„œ μ–΄λ–»κ²Œ λ™μž‘ν•˜λŠ”μ§€ μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€:

파이썬-Mojo 톡합 흐름

  1. 파이썬 μͺ½ (problems/p17/p17.py):

    • μž…λ ₯κ³Ό μ»€λ„μš© NumPy λ°°μ—΄ 생성
    • MAX κ·Έλž˜ν”„λ‘œ 연산을 κ°μ‹ΈλŠ” conv_1d() ν•¨μˆ˜ 호좜
    • NumPy 배열을 Buffer.from_numpy(input).to(device)둜 MAX driver Buffer둜 λ³€ν™˜
    • custom_extensions=[mojo_kernels]둜 μ»€μŠ€ν…€ μ—°μ‚° νŒ¨ν‚€μ§€ λ‘œλ“œ
  2. κ·Έλž˜ν”„ ꡬ좕:

    • TensorType으둜 μž…λ ₯ 및 좜λ ₯ ν…μ„œ νƒ€μž… μ •μ˜
    • parameters={...}λ₯Ό 톡해 μ—°μ‚°μ˜ νŒŒλΌλ―Έν„° μ§€μ •
    • Graph("conv_1d_graph", ...)둜 μ—°μ‚° κ·Έλž˜ν”„ 생성
    • ops.custom(name="conv1d", ...)둜 μ»€μŠ€ν…€ μ—°μ‚° 호좜
  3. μ»€μŠ€ν…€ op 등둝:

    • @compiler.register("conv1d") λ°μ½”λ ˆμ΄ν„°κ°€ 연산을 MAX κ·Έλž˜ν”„μ— λ…ΈμΆœ. @compiler.register μ°Έκ³ 
    • execute λ©”μ„œλ“œμ˜ νŒŒλΌλ―Έν„°κ°€ μΈν„°νŽ˜μ΄μŠ€(μž…λ ₯, 좜λ ₯, μ»¨ν…μŠ€νŠΈ) μ •μ˜
    • μž…μΆœλ ₯ ν…μ„œκ°€ μ»€λ„μ—μ„œ μ‚¬μš©ν•  수 μžˆλ„λ‘ LayoutTensor둜 λ³€ν™˜
    • Device contextκ°€ GPU λ©”λͺ¨λ¦¬ ν• λ‹Ήκ³Ό 컀널 μ‹€ν–‰ 관리
  4. 컀널 μ‹€ν–‰:

    • model.execute(...)κ°€ 호좜되면 conv1d_kernel이 데이터 μˆ˜μ‹ 
    • grid_dimκ³Ό block_dim으둜 GPU μŠ€λ ˆλ“œ ꡬ성 μ„€μ •
    • result.to(CPU())둜 κ²°κ³Όλ₯Ό CPU둜 전솑
    • NumPy κ²€μ¦μœΌλ‘œ κΈ°λŒ€ 좜λ ₯κ³Ό κ²°κ³Ό 비ꡐ

핡심 ꡬ성 μš”μ†Œ 상세

  1. μ»€μŠ€ν…€ Op ꡬ쑰체:

    @compiler.register("conv1d")
    struct Conv1DCustomOp:
        @staticmethod
        fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
            output: OutputTensor[rank=1],
            input: InputTensor[dtype = output.dtype, rank = output.rank],
            kernel: InputTensor[dtype = output.dtype, rank = output.rank],
            ctx: DeviceContextPtr,
        ) raises:
            # κ΅¬ν˜„
    
    • target은 λ””λ°”μ΄μŠ€ νƒ€μž…(β€œgpu” λ˜λŠ” β€œcpu”)을 λ‚˜νƒ€λƒ„
    • input_size와 conv_sizeλŠ” νŒŒμ΄μ¬μ—μ„œ μ „λ‹¬λ˜λŠ” νŒŒλΌλ―Έν„°
    • ν…μ„œ νƒ€μž…μ΄ μ˜¬λ°”λ₯Έ shapeκ³Ό νƒ€μž… 검사 보μž₯
    • λ°˜ν™˜ νƒ€μž…μ€ μ μ ˆν•œ 였λ₯˜ 처리 μœ„ν•΄ raises
  2. ν…μ„œ λ³€ν™˜:

    output_tensor = output.to_layout_tensor()
    input_tensor = input.to_layout_tensor()
    kernel_tensor = kernel.to_layout_tensor()
    
    • MAX κ·Έλž˜ν”„ ν…μ„œλ₯Ό Mojo LayoutTensor둜 λ³€ν™˜
    • 컀널이 ν…μ„œλ₯Ό 직접 λ‹€λ£° 수 있게 ν•΄μ€Œ
    • 컴파일 νƒ€μž„ μ΅œμ ν™”λ₯Ό μœ„ν•΄ λ ˆμ΄μ•„μ›ƒ μΆ”μΆœ
  3. Device Context μ‚¬μš©:

    gpu_ctx = ctx.get_device_context()
    gpu_ctx.enqueue_memset(...)  # 좜λ ₯ 버퍼 μ΄ˆκΈ°ν™”
    gpu_ctx.enqueue_function[..., ...](...) # 컀널 μ˜ˆμ•½
    
    • λ””λ°”μ΄μŠ€ μ»¨ν…μŠ€νŠΈκ°€ GPU λ¦¬μ†ŒμŠ€λ₯Ό 관리
    • λ©”λͺ¨λ¦¬ μ—°μ‚°μœΌλ‘œ μ˜¬λ°”λ₯Έ 버퍼 μƒνƒœλ₯Ό 보μž₯
    • ν•¨μˆ˜λ₯Ό 큐에 λ“±λ‘ν•˜μ—¬ 컀널 싀행을 μ˜ˆμ•½

이 ν’€μ΄λŠ” 파이썬 데이터가 MAX κ·Έλž˜ν”„λ₯Ό 거쳐 GPUμ—μ„œ μ‹€ν–‰λ˜κ³  λ‹€μ‹œ λŒμ•„μ˜€λŠ” 전체 흐름을 λ³΄μ—¬μ€λ‹ˆλ‹€. Mojo의 κ°•λ ₯ν•œ νƒ€μž… μ‹œμŠ€ν…œκ³Ό λ§€κ°œλ³€μˆ˜ν™” ν•¨μˆ˜λ₯Ό ν™œμš©ν•˜μ—¬ 효율적이고 νƒ€μž… μ•ˆμ „ν•œ 가속 연산을 λ§Œλ“€μ–΄λƒ…λ‹ˆλ‹€.

MAX κ·Έλž˜ν”„ μ»€μŠ€ν…€ op μ΄ν•΄ν•˜κΈ°

더 μžμ„Έν•œ λ‚΄μš©μ€ μ•„λž˜ νŠœν† λ¦¬μ–Όμ„ μ°Έκ³ ν•˜μ„Έμš”:

μ»€μŠ€ν…€ op 등둝

μ»€μŠ€ν…€ 연산을 λ§Œλ“œλŠ” 핡심은 @compiler.register λ°μ½”λ ˆμ΄ν„°μ™€ κ΄€λ ¨ κ΅¬μ‘°μ²΄μž…λ‹ˆλ‹€:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[...](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[type = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # κ΅¬ν˜„

λ“±λ‘μ˜ 핡심 ꡬ성 μš”μ†Œ:

  • λ°μ½”λ ˆμ΄ν„°μ— μ „λ‹¬ν•˜λŠ” 이름("conv1d")이 파이썬 μ½”λ“œμ—μ„œ 이 연산을 ν˜ΈμΆœν•  λ•Œ μ‚¬μš©ν•˜λŠ” 이름
  • κ΅¬μ‘°μ²΄μ—λŠ” μ˜¬λ°”λ₯Έ μ‹œκ·Έλ‹ˆμ²˜λ₯Ό κ°€μ§„ execute λ©”μ„œλ“œκ°€ μžˆμ–΄μ•Ό 함
  • OutputTensor와 InputTensor νƒ€μž…μ΄ 파이썬 λ°μ΄ν„°μ™€μ˜ μΈν„°νŽ˜μ΄μŠ€λ₯Ό μ •μ˜
  • DeviceContextPtr이 μ‹€ν–‰ ν™˜κ²½μ— λŒ€ν•œ 접근을 제곡

μ»€μŠ€ν…€ op νŒ¨ν‚€μ§•

μ»€μŠ€ν…€ 연산을 νŒŒμ΄μ¬μ—μ„œ μ‚¬μš©ν•˜λ €λ©΄ λ¨Όμ € νŒ¨ν‚€μ§•ν•΄μ•Ό ν•©λ‹ˆλ‹€:

mojo package op -o op.mojopkg

이 λͺ…령은:

  1. Mojo μ½”λ“œλ₯Ό 배포 κ°€λŠ₯ν•œ νŒ¨ν‚€μ§€λ‘œ 컴파일
  2. MAX κ·Έλž˜ν”„κ°€ 연산을 μ΄ν•΄ν•˜λŠ” 데 ν•„μš”ν•œ 메타데이터 생성
  3. νŒŒμ΄μ¬μ—μ„œ λ‘œλ“œν•  수 μžˆλŠ” λ°”μ΄λ„ˆλ¦¬ μ•„ν‹°νŒ©νŠΈ(op.mojopkg)λ₯Ό 생성

νŒ¨ν‚€μ§€λŠ” MAX κ·Έλž˜ν”„κ°€ 찾을 수 μžˆλŠ” μœ„μΉ˜μ— λ°°μΉ˜ν•΄μ•Ό ν•˜λ©°, 보톡 파이썬 μ½”λ“œμ—μ„œ μ ‘κ·Ό κ°€λŠ₯ν•œ 디렉토리에 λ‘‘λ‹ˆλ‹€.

파이썬 톡합

파이썬 μͺ½μ—μ„œ μ»€μŠ€ν…€ 연산을 μ‚¬μš©ν•˜λŠ” 방법은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

# Mojo 연산이 ν¬ν•¨λœ 디렉토리 경둜
mojo_kernels = Path(__file__).parent / "op"

# μ»€μŠ€ν…€ conv1d μ—°μ‚°μœΌλ‘œ κ·Έλž˜ν”„ ꡬ성
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # μ»€μŠ€ν…€ op νŒ¨ν‚€μ§€ λ‘œλ“œ
) as graph:
    # κ·Έλž˜ν”„μ˜ μž…λ ₯ μ •μ˜
    input_value, kernel_value = graph.inputs

    # μ΄λ¦„μœΌλ‘œ μ»€μŠ€ν…€ μ—°μ‚° μ‚¬μš©
    output = ops.custom(
        name="conv1d",  # @compiler.register의 이름과 μΌμΉ˜ν•΄μ•Ό 함
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

핡심 μš”μ†ŒλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

  1. custom_extensions둜 μ»€μŠ€ν…€ μ—°μ‚°μ˜ 경둜 μ§€μ •
  2. λ“±λ‘λœ μ—°μ‚° μ΄λ¦„μœΌλ‘œ ops.custom 호좜
  3. μ—°μ‚°μ˜ μ‹œκ·Έλ‹ˆμ²˜μ— λ§žλŠ” μž…λ ₯ κ°’κ³Ό νŒŒλΌλ―Έν„° 전달