Puzzle 17: 1D ν©μ±κ³± Op
MAX κ·Έλνλ‘ νμ΄μ¬ μ°λνκΈ°
GPU νΌμ¦ μ¬μ μ Part IVμ μ§μ νμ΅λλ€: MAX κ·Έλν 컀μ€ν OpμΌλ‘ νμ΄μ¬ μ°λνκΈ°.
μ΄μ νΌμ¦λ€μμλ Mojoλ‘ ν¨μ¨μ μΈ GPU 컀λμ μμ±νλ λ°©λ²μ λ°°μ μ΅λλ€. μ΄μ λΆν°λ λ€μμ μμλ΄ λλ€:
- 컀λμ νμ΄μ¬μμ νΈμΆν μ μλ 컀μ€ν μ°μ°μΌλ‘ ν¨ν€μ§νκΈ°
- MAX κ·Έλν μμ€ν κ³Ό ν΅ν©νμ¬ λ¨Έμ λ¬λμ κ°μνκΈ°
- νμ΄λ 벨 νμ΄μ¬ APIμ λ‘μ°λ 벨 GPU μ½λ μ¬μ΄μ κ°κ·Ή λ©μ°κΈ°
μ΄λ₯Ό ν΅ν΄ μ΅μν νμ΄μ¬ νκ²½μμ μμ νλ©΄μλ Mojo GPU 컀λμ μ±λ₯μ νμ©ν μ μμ΅λλ€.
κ°μ
Puzzle 13: 1D ν©μ±κ³±μμ GPUμμ ν¨μ¨μ μΌλ‘ λμνλ 1D ν©μ±κ³± 컀λμ ꡬννμ΅λλ€. μ΄λ²μλ μ΄ μ»€λμ MAX κ·Έλνλ₯Ό ν΅ν΄ νμ΄μ¬μμ μ§μ νΈμΆν μ μλ 컀μ€ν μ°μ°μΌλ‘ λ³νν©λλ€.
μ¬μ©ν 1D ν©μ±κ³± 컀λμ μ΄λ―Έ ꡬνλμ΄ μμ΅λλ€:
comptime TPB = 15
comptime BLOCKS_PER_GRID = (2, 1)
fn conv1d_kernel[
in_layout: Layout,
out_layout: Layout,
conv_layout: Layout,
input_size: Int,
conv_size: Int,
dtype: DType = DType.float32,
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
):
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
local_i = Int(thread_idx.x)
# first: need to account for padding
shared_a = LayoutTensor[
dtype,
Layout.row_major(TPB + conv_size - 1),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
shared_b = LayoutTensor[
dtype,
Layout.row_major(conv_size),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
if global_i < input_size:
shared_a[local_i] = input[global_i]
# second: load elements needed for convolution at block boundary
if local_i < conv_size - 1:
# indices from next block
next_idx = global_i + TPB
if next_idx < input_size:
shared_a[TPB + local_i] = input[next_idx]
else:
shared_a[TPB + local_i] = 0
if local_i < conv_size:
shared_b[local_i] = kernel[local_i]
barrier()
if global_i < input_size:
var local_sum: output.element_type = 0
@parameter
for j in range(conv_size):
if local_i + j < TPB + conv_size - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
output[global_i] = local_sum
μ΄ νΌμ¦μ ν΅μ¬ μμλ λ€μκ³Ό κ°μ΅λλ€:
- 컀μ€ν
op λ±λ‘:
@compiler.registerλ°μ½λ μ΄ν°λ₯Ό ν΅ν΄ Mojo ν¨μλ₯Ό νμ΄μ¬μ λ ΈμΆνλ λ°©λ² μ΄ν΄νκΈ° - 컀μ€ν op ν¨ν€μ§: Mojo μ½λλ₯Ό MAX κ·Έλνμμ μ¬μ©ν μ μλλ‘ ν¨ν€μ§νλ λ°©λ² μ΅νκΈ°
- νμ΄μ¬ ν΅ν©: MAX κ·Έλνλ₯Ό ν΅ν΄ νμ΄μ¬μμ 컀μ€ν μ°μ° νΈμΆνκΈ°
- ν¬λ‘μ€ μΈμ΄ λ°μ΄ν° νλ¦: νμ΄μ¬κ³Ό GPU μ¬μ΄μ λ°μ΄ν° νμ κ³Ό λ©λͺ¨λ¦¬ κ΄λ¦¬νκΈ°
μ΄ μ»€μ€ν μ°μ°μ λ€μκ³Ό κ°μ μΌμ μνν©λλ€:
- νμ΄μ¬μμ NumPy λ°°μ΄μ μ λ ₯μΌλ‘ λ°κΈ°
- μ΄ λ°μ΄ν°λ₯Ό GPUλ‘ μ μ‘νκΈ°
- μ΅μ νλ ν©μ±κ³± 컀λ μ€ννκΈ°
- κ²°κ³Όλ₯Ό νμ΄μ¬μΌλ‘ λ°ννκΈ°
μ΄ νΌμ¦μ μμ±νλ©΄ νμ΄μ¬μ νλΆν μνκ³μ Mojoμ κ°λ ₯ν GPU μ±λ₯μ μλ λ§€λλ¬μ΄ λ€λ¦¬λ₯Ό λ§λ€κ² λ©λλ€.
μμ±ν μ½λ
μ΄ νΌμ¦μ μμ±νλ €λ©΄ conv1d.mojoμμ conv1d_kernelμ νΈμΆνλ ν μ€λ§ μ±μ°λ©΄ λ©λλ€:
import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer
@compiler.register("conv1d")
struct Conv1DCustomOp:
@staticmethod
fn execute[
# The kind of device this will be run on: "cpu" or "gpu"
target: StaticString,
input_size: Int,
conv_size: Int,
dtype: DType = DType.float32,
](
output: OutputTensor[rank=1],
input: InputTensor[rank = output.rank],
kernel: InputTensor[rank = output.rank],
# the context is needed for some GPU calls
ctx: DeviceContextPtr,
) raises:
output_tensor = output.to_layout_tensor()
input_tensor = input.to_layout_tensor()
kernel_tensor = kernel.to_layout_tensor()
comptime in_layout = input_tensor.layout
comptime out_layout = output_tensor.layout
comptime conv_layout = kernel_tensor.layout
@parameter
if target == "gpu":
gpu_ctx = ctx.get_device_context()
# making sure the output tensor is zeroed out before the kernel is called
gpu_ctx.enqueue_memset(
DeviceBuffer[output_tensor.dtype](
gpu_ctx,
output_tensor.ptr,
input_size,
owning=False,
),
0,
)
# FILL ME IN with 1 line calling our conv1d_kernel
elif target == "cpu":
# we can fallback to CPU
pass
else:
raise Error("Unsupported target: " + target)
μ 체 νμΌ λ³΄κΈ°: problems/p17/op/conv1d.mojo
λ€μ λͺ λ ΉμΌλ‘ νΌμ¦μ μ€νν μ μμ΅λλ€:
pixi run p17
pixi run -e amd p17
pixi run -e apple p17
uv run poe p17
μ±κ³΅νλ©΄ λ€μκ³Ό λΉμ·ν μΆλ ₯μ λ³Ό μ μμ΅λλ€:
Input array: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.]
Verification passed: Custom kernel results match NumPy calculation
μ΄ μΆλ ₯μ 컀μ€ν MAX κ·Έλν μ°μ°μ΄ 1D ν©μ±κ³± μκ³ λ¦¬μ¦μ μ¬λ°λ₯΄κ² ꡬννμμ λνλ λλ€.
μ루μ
μ΄ νΌμ¦μ νλ €λ©΄ 1D ν©μ±κ³± 컀λμ MAX κ·Έλν μμ€ν
κ³Ό ν΅ν©ν΄μΌ ν©λλ€. ν΅μ¬μ Conv1DCustomOp ꡬ쑰체μ execute λ©μλμμ 컀λμ μ¬λ°λ₯΄κ² νΈμΆνλ κ²μ
λλ€.
νμ΄λ λ€μκ³Ό κ°μ΅λλ€:
comptime kernel = conv1d_kernel[
in_layout, out_layout, conv_layout, input_size, conv_size
]
gpu_ctx.enqueue_function[kernel, kernel](
output_tensor,
input_tensor,
kernel_tensor,
grid_dim=BLOCKS_PER_GRID,
block_dim=(TPB, 1),
)
- GPU 컨ν
μ€νΈ(
gpu_ctxμ νμ μ DeviceContext)μμ enqueue_functionμ νΈμΆνμ¬ μ»€λ μ€ν μμ½ - νμν λ μ΄μμκ³Ό ν¬κΈ° μ 보λ₯Ό μ»΄νμΌ νμ νλΌλ―Έν°λ‘ μ λ¬
- μΆλ ₯, μ λ ₯, 컀λ ν μλ₯Ό λ°νμ μΈμλ‘ μ 곡
- μ μ ν μ°¨μμΌλ‘ μ€ν 그리λ ꡬμ±
μ 체 λ§₯λ½μμ μ΄λ»κ² λμνλμ§ μ΄ν΄λ³΄κ² μ΅λλ€:
νμ΄μ¬-Mojo ν΅ν© νλ¦
-
νμ΄μ¬ μͺ½ (problems/p17/p17.py):
- μ λ ₯κ³Ό 컀λμ© NumPy λ°°μ΄ μμ±
- MAX κ·Έλνλ‘ μ°μ°μ κ°μΈλ
conv_1d()ν¨μ νΈμΆ - NumPy λ°°μ΄μ
Buffer.from_numpy(input).to(device)λ‘ MAX driver Bufferλ‘ λ³ν custom_extensions=[mojo_kernels]λ‘ μ»€μ€ν μ°μ° ν¨ν€μ§ λ‘λ
-
κ·Έλν ꡬμΆ:
- TensorTypeμΌλ‘ μ λ ₯ λ° μΆλ ₯ ν μ νμ μ μ
parameters={...}λ₯Ό ν΅ν΄ μ°μ°μ νλΌλ―Έν° μ§μ Graph("conv_1d_graph", ...)λ‘ μ°μ° κ·Έλν μμ±ops.custom(name="conv1d", ...)λ‘ μ»€μ€ν μ°μ° νΈμΆ
-
컀μ€ν op λ±λ‘:
@compiler.register("conv1d")λ°μ½λ μ΄ν°κ° μ°μ°μ MAX κ·Έλνμ λ ΈμΆ. @compiler.register μ°Έκ³executeλ©μλμ νλΌλ―Έν°κ° μΈν°νμ΄μ€(μ λ ₯, μΆλ ₯, 컨ν μ€νΈ) μ μ- μ μΆλ ₯ ν μκ° μ»€λμμ μ¬μ©ν μ μλλ‘ LayoutTensorλ‘ λ³ν
- Device contextκ° GPU λ©λͺ¨λ¦¬ ν λΉκ³Ό 컀λ μ€ν κ΄λ¦¬
-
컀λ μ€ν:
model.execute(...)κ° νΈμΆλλ©΄conv1d_kernelμ΄ λ°μ΄ν° μμgrid_dimκ³Όblock_dimμΌλ‘ GPU μ€λ λ κ΅¬μ± μ€μ result.to(CPU())λ‘ κ²°κ³Όλ₯Ό CPUλ‘ μ μ‘- NumPy κ²μ¦μΌλ‘ κΈ°λ μΆλ ₯κ³Ό κ²°κ³Ό λΉκ΅
ν΅μ¬ κ΅¬μ± μμ μμΈ
-
컀μ€ν Op ꡬ쑰체:
@compiler.register("conv1d") struct Conv1DCustomOp: @staticmethod fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32]( output: OutputTensor[rank=1], input: InputTensor[dtype = output.dtype, rank = output.rank], kernel: InputTensor[dtype = output.dtype, rank = output.rank], ctx: DeviceContextPtr, ) raises: # ꡬνtargetμ λλ°μ΄μ€ νμ (βgpuβ λλ βcpuβ)μ λνλinput_sizeμconv_sizeλ νμ΄μ¬μμ μ λ¬λλ νλΌλ―Έν°- ν μ νμ μ΄ μ¬λ°λ₯Έ shapeκ³Ό νμ κ²μ¬ 보μ₯
- λ°ν νμ
μ μ μ ν μ€λ₯ μ²λ¦¬ μν΄
raises
-
ν μ λ³ν:
output_tensor = output.to_layout_tensor() input_tensor = input.to_layout_tensor() kernel_tensor = kernel.to_layout_tensor()- MAX κ·Έλν ν μλ₯Ό Mojo LayoutTensorλ‘ λ³ν
- 컀λμ΄ ν μλ₯Ό μ§μ λ€λ£° μ μκ² ν΄μ€
- μ»΄νμΌ νμ μ΅μ νλ₯Ό μν΄ λ μ΄μμ μΆμΆ
-
Device Context μ¬μ©:
gpu_ctx = ctx.get_device_context() gpu_ctx.enqueue_memset(...) # μΆλ ₯ λ²νΌ μ΄κΈ°ν gpu_ctx.enqueue_function[..., ...](...) # 컀λ μμ½- λλ°μ΄μ€ 컨ν μ€νΈκ° GPU 리μμ€λ₯Ό κ΄λ¦¬
- λ©λͺ¨λ¦¬ μ°μ°μΌλ‘ μ¬λ°λ₯Έ λ²νΌ μνλ₯Ό 보μ₯
- ν¨μλ₯Ό νμ λ±λ‘νμ¬ μ»€λ μ€νμ μμ½
μ΄ νμ΄λ νμ΄μ¬ λ°μ΄ν°κ° MAX κ·Έλνλ₯Ό κ±°μ³ GPUμμ μ€νλκ³ λ€μ λμμ€λ μ 체 νλ¦μ 보μ¬μ€λλ€. Mojoμ κ°λ ₯ν νμ μμ€ν κ³Ό λ§€κ°λ³μν ν¨μλ₯Ό νμ©νμ¬ ν¨μ¨μ μ΄κ³ νμ μμ ν κ°μ μ°μ°μ λ§λ€μ΄λ λλ€.
MAX κ·Έλν 컀μ€ν op μ΄ν΄νκΈ°
λ μμΈν λ΄μ©μ μλ νν 리μΌμ μ°Έκ³ νμΈμ:
컀μ€ν op λ±λ‘
컀μ€ν
μ°μ°μ λ§λλ ν΅μ¬μ @compiler.register λ°μ½λ μ΄ν°μ κ΄λ ¨ ꡬ쑰체μ
λλ€:
@compiler.register("conv1d")
struct Conv1DCustomOp:
@staticmethod
fn execute[...](
output: OutputTensor[rank=1],
input: InputTensor[dtype = output.dtype, rank = output.rank],
kernel: InputTensor[type = output.dtype, rank = output.rank],
ctx: DeviceContextPtr,
) raises:
# ꡬν
λ±λ‘μ ν΅μ¬ κ΅¬μ± μμ:
- λ°μ½λ μ΄ν°μ μ λ¬νλ μ΄λ¦(
"conv1d")μ΄ νμ΄μ¬ μ½λμμ μ΄ μ°μ°μ νΈμΆν λ μ¬μ©νλ μ΄λ¦ - ꡬ쑰체μλ μ¬λ°λ₯Έ μκ·Έλμ²λ₯Ό κ°μ§
executeλ©μλκ° μμ΄μΌ ν¨ - OutputTensorμ InputTensor νμ μ΄ νμ΄μ¬ λ°μ΄ν°μμ μΈν°νμ΄μ€λ₯Ό μ μ
- DeviceContextPtrμ΄ μ€ν νκ²½μ λν μ κ·Όμ μ 곡
컀μ€ν op ν¨ν€μ§
컀μ€ν μ°μ°μ νμ΄μ¬μμ μ¬μ©νλ €λ©΄ λ¨Όμ ν¨ν€μ§ν΄μΌ ν©λλ€:
mojo package op -o op.mojopkg
μ΄ λͺ λ Ήμ:
- Mojo μ½λλ₯Ό λ°°ν¬ κ°λ₯ν ν¨ν€μ§λ‘ μ»΄νμΌ
- MAX κ·Έλνκ° μ°μ°μ μ΄ν΄νλ λ° νμν λ©νλ°μ΄ν° μμ±
- νμ΄μ¬μμ λ‘λν μ μλ λ°μ΄λ리 μν°ν©νΈ(
op.mojopkg)λ₯Ό μμ±
ν¨ν€μ§λ MAX κ·Έλνκ° μ°Ύμ μ μλ μμΉμ λ°°μΉν΄μΌ νλ©°, λ³΄ν΅ νμ΄μ¬ μ½λμμ μ κ·Ό κ°λ₯ν λλ ν 리μ λ‘λλ€.
νμ΄μ¬ ν΅ν©
νμ΄μ¬ μͺ½μμ 컀μ€ν μ°μ°μ μ¬μ©νλ λ°©λ²μ λ€μκ³Ό κ°μ΅λλ€:
# Mojo μ°μ°μ΄ ν¬ν¨λ λλ ν 리 κ²½λ‘
mojo_kernels = Path(__file__).parent / "op"
# 컀μ€ν
conv1d μ°μ°μΌλ‘ κ·Έλν ꡬμ±
with Graph(
"conv_1d_graph",
input_types=[...],
custom_extensions=[mojo_kernels], # 컀μ€ν
op ν¨ν€μ§ λ‘λ
) as graph:
# κ·Έλνμ μ
λ ₯ μ μ
input_value, kernel_value = graph.inputs
# μ΄λ¦μΌλ‘ 컀μ€ν
μ°μ° μ¬μ©
output = ops.custom(
name="conv1d", # @compiler.registerμ μ΄λ¦κ³Ό μΌμΉν΄μΌ ν¨
values=[input_value, kernel_value],
out_types=[...],
parameters={
"input_size": input_tensor.shape[0],
"conv_size": kernel_tensor.shape[0],
"dtype": dtype,
},
)[0].tensor
ν΅μ¬ μμλ λ€μκ³Ό κ°μ΅λλ€:
custom_extensionsλ‘ μ»€μ€ν μ°μ°μ κ²½λ‘ μ§μ - λ±λ‘λ μ°μ° μ΄λ¦μΌλ‘
ops.customνΈμΆ - μ°μ°μ μκ·Έλμ²μ λ§λ μ λ ₯ κ°κ³Ό νλΌλ―Έν° μ λ¬