vectorize - SIMD μ μ΄
κ°μ
μ΄ νΌμ¦μμλ μλ 벑ν°νμ vectorizeλ₯Ό μ¬μ©νμ¬ GPU 컀λ λ΄μμ SIMD μ°μ°μ μ λ°νκ² μ μ΄νλ κ³ κΈ λ²‘ν°ν κΈ°λ²μ νꡬν©λλ€. 벑ν°νλ μ°μ°μ λν΄ λ κ°μ§ λ€λ₯Έ μ κ·Όλ²μ ꡬνν©λλ€:
- μλ 벑ν°ν: λͺ μμ μΈλ±μ€ κ³μ°μ ν΅ν μ§μ μ μΈ SIMD μ μ΄
- Mojoμ vectorize ν¨μ: μλ κ²½κ³ κ²μ¬λ₯Ό ν¬ν¨ν κ³ μμ€ λ²‘ν°ν
λ μ κ·Όλ² λͺ¨λ νμΌλ§ κ°λ μ κΈ°λ°μΌλ‘ νμ§λ§, μ μ΄, μμ μ±, μ±λ₯ μ΅μ ν κ°μ νΈλ μ΄λμ€νκ° λ€λ¦ λλ€.
ν΅μ¬ ν΅μ°°: 벑ν°ν μ λ΅μ μ±λ₯ μꡬ μ¬νκ³Ό 볡μ‘λ μμ€μ λ°λΌ λ¬λ¦¬ μ νν΄μΌ ν©λλ€.
ν΅μ¬ κ°λ
μ΄ νΌμ¦μμ λ°°μΈ λ΄μ©:
- λͺ μμ μΈλ±μ€ κ΄λ¦¬λ₯Ό ν΅ν μλ SIMD μ°μ°
- μμ νκ³ μλμ μΈ λ²‘ν°νλ₯Ό μν Mojoμ vectorize ν¨μ
- μ΅μ μ SIMD μ λ ¬μ μν μ²ν¬ κΈ°λ° λ©λͺ¨λ¦¬ ꡬμ±
- κ²½κ³ μ‘°κ±΄μ μν κ²½κ³ κ²μ¬ μ λ΅
- μλ μ μ΄μ μμ μ± κ°μ μ±λ₯ νΈλ μ΄λμ€ν
μ΄μ κ³Ό λμΌν μνμ μ°μ°: \[\Large \text{output}[i] = a[i] + b[i]\]
νμ§λ§ μ΅λ μ±λ₯μ μν μ κ΅ν 벑ν°ν μ λ΅μ μ¬μ©ν©λλ€.
μ€μ
- λ²‘ν° ν¬κΈ°:
SIZE = 1024 - νμΌ ν¬κΈ°:
TILE_SIZE = 32 - λ°μ΄ν° νμ
:
DType.float32 - SIMD ν: GPU μμ‘΄μ
- λ μ΄μμ:
Layout.row_major(SIZE)(1D ν μ°μ )
1. μλ 벑ν°ν λ°©μ
μμ±ν μ½λ
fn manual_vectorized_tiled_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
rank: Int,
size: Int,
tile_size: Int,
](
output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size groups of simd_width elements
comptime chunk_size = tile_size * simd_width
@parameter
@always_inline
fn process_manual_vectorized_tiles[
num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
print("tile_id:", tile_id)
output_tile = output.tile[chunk_size](tile_id)
a_tile = a.tile[chunk_size](tile_id)
b_tile = b.tile[chunk_size](tile_id)
# FILL IN (7 lines at most)
# Number of tiles needed: each tile processes chunk_size elements
num_tiles = (size + chunk_size - 1) // chunk_size
elementwise[
process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
](num_tiles, ctx)
μ 체 νμΌ λ³΄κΈ°: problems/p23/p23.mojo
ν
1. μ²ν¬ κ΅¬μ± μ΄ν΄νκΈ°
comptime chunk_size = tile_size * simd_width # 32 * 4 = μ²ν¬λΉ 128κ° μμ
κ° νμΌμ μ΄μ λ¨μν μμ°¨ μμκ° μλ μ¬λ¬ SIMD κ·Έλ£Ήμ ν¬ν¨ν©λλ€.
2. μ μ μΈλ±μ€ κ³μ°
global_start = tile_id * chunk_size + i * simd_width
μ²ν¬ λ΄ κ° SIMD 벑ν°μ μ νν μ μ μμΉλ₯Ό κ³μ°ν©λλ€.
3. ν μ μ§μ μ κ·Ό
a_vec = a.load[simd_width](global_start, 0) # μ μ ν
μμμ λ‘λ
output.store[simd_width](global_start, 0, ret) # μ μ ν
μμ μ μ₯
μ°Έκ³ : νμΌ λ·°κ° μλ μλ³Έ ν μμ μ κ·Όν©λλ€.
4. μ£Όμ νΉμ±
- λ λ§μ μ μ΄, λ λ§μ 볡μ‘μ±, μ μ ν μ μ κ·Ό
- νλμ¨μ΄μ λν μλ²½ν SIMD μ λ ¬
- μλ κ²½κ³ κ²μ¬ νμ
μλ 벑ν°ν μ€ν
pixi run p23 --manual-vectorized
pixi run -e amd p23 --manual-vectorized
pixi run -e apple p23 --manual-vectorized
uv run poe p23 --manual-vectorized
νΌμ¦μ΄ μμ§ νλ¦¬μ§ μμ κ²½μ° λ€μκ³Ό κ°μ΄ μΆλ ₯λ©λλ€:
SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
tile_id: 4
tile_id: 5
tile_id: 6
tile_id: 7
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])
μλ 벑ν°ν νμ΄
fn manual_vectorized_tiled_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
rank: Int,
size: Int,
tile_size: Int,
](
output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size groups of simd_width elements
comptime chunk_size = tile_size * simd_width
@parameter
@always_inline
fn process_manual_vectorized_tiles[
num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
output_tile = output.tile[chunk_size](tile_id)
a_tile = a.tile[chunk_size](tile_id)
b_tile = b.tile[chunk_size](tile_id)
@parameter
for i in range(tile_size):
global_start = tile_id * chunk_size + i * simd_width
a_vec = a.aligned_load[simd_width](Index(global_start))
b_vec = b.aligned_load[simd_width](Index(global_start))
ret = a_vec + b_vec
# print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret)
output.store[simd_width](Index(global_start), ret)
# Number of tiles needed: each tile processes chunk_size elements
num_tiles = (size + chunk_size - 1) // chunk_size
elementwise[
process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
](num_tiles, ctx)
μλ 벑ν°ν μ¬μΈ΅ λΆμ
μλ 벑ν°νλ λͺ μμ μΈλ±μ€ κ³μ°μ ν΅ν΄ SIMD μ°μ°μ λν μ§μ μ μΈ μ μ΄λ₯Ό μ 곡ν©λλ€:
- μ²ν¬ κΈ°λ° κ΅¬μ±:
chunk_size = tile_size * simd_width - μ μ μΈλ±μ±: λ©λͺ¨λ¦¬ μμΉμ μ§μ κ³μ°
- μλ κ²½κ³ κ΄λ¦¬: κ²½κ³ μ‘°κ±΄μ μ§μ μ²λ¦¬
μν€ν μ²μ λ©λͺ¨λ¦¬ λ μ΄μμ:
comptime chunk_size = tile_size * simd_width # 32 * 4 = 128
μ²ν¬ κ΅¬μ± μκ°ν (TILE_SIZE=32, SIMD_WIDTH=4):
μλ³Έ λ°°μ΄: [0, 1, 2, 3, ..., 1023]
μ²ν¬ 0 (thread 0): [0:128] β 128κ° μμ = 4κ°μ© 32κ° SIMD κ·Έλ£Ή
μ²ν¬ 1 (thread 1): [128:256] β λ€μ 128κ° μμ
μ²ν¬ 2 (thread 2): [256:384] β λ€μ 128κ° μμ
...
μ²ν¬ 7 (thread 7): [896:1024] β λ§μ§λ§ 128κ° μμ
νλμ μ²ν¬ λ΄ μ²λ¦¬:
@parameter
for i in range(tile_size): # i = 0, 1, 2, ..., 31
global_start = tile_id * chunk_size + i * simd_width
# tile_id=0μΌ λ: global_start = 0, 4, 8, 12, ..., 124
# tile_id=1μΌ λ: global_start = 128, 132, 136, 140, ..., 252
μ±λ₯ νΉμ±:
- μ€λ λ μ: 8κ° μ€λ λ (1024 Γ· 128 = 8)
- μ€λ λλΉ μμ λ: 128κ° μμ (κ° 4κ° μμμ SIMD μ°μ° 32ν)
- λ©λͺ¨λ¦¬ ν¨ν΄: μλ²½ν SIMD μ λ ¬μ κ°μΆ λν μ²ν¬
- μ€λ²ν€λ: μ΅μ - νλμ¨μ΄μ μ§μ λ§€ν
- μμ μ±: μλ κ²½κ³ κ²μ¬ νμ
μ£Όμ μ₯μ :
- μμΈ‘ κ°λ₯ν μΈλ±μ±: λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄μ λν μ νν μ μ΄
- μ΅μ μ μ λ ¬: SIMD μ°μ°μ΄ νλμ¨μ΄μ μλ²½ν μ λ ¬
- μ΅λ μ²λ¦¬λ: μμ μ± κ²μ¬λ‘ μΈν μ€λ²ν€λ μμ
- νλμ¨μ΄ μ΅μ ν: GPU SIMD μ λμ μ§μ λ§€ν
μ£Όμ κ³Όμ :
- μΈλ±μ€ 볡μ‘μ±: μ μ μμΉμ μλ κ³μ°
- κ²½κ³ μ²λ¦¬ μ± μ: κ²½κ³ μ‘°κ±΄μ μ§μ μ²λ¦¬ν΄μΌ ν¨
- λλ²κΉ λμ΄λ: μ νμ± κ²μ¦μ΄ λ 볡μ‘
2. Mojo vectorize λ°©μ
μμ±ν μ½λ
fn vectorize_within_tiles_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
rank: Int,
size: Int,
tile_size: Int,
](
output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size elements (not SIMD groups)
@parameter
@always_inline
fn process_tile_with_vectorize[
num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start
print(
"tile_id:",
tile_id,
"tile_start:",
tile_start,
"tile_end:",
tile_end,
"actual_tile_size:",
actual_tile_size,
)
# FILL IN (9 lines at most)
num_tiles = (size + tile_size - 1) // tile_size
elementwise[
process_tile_with_vectorize, num_threads_per_tile, target="gpu"
](num_tiles, ctx)
μ 체 νμΌ λ³΄κΈ°: problems/p23/p23.mojo
ν
1. νμΌ κ²½κ³ κ³μ°
tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start
λ§μ§λ§ νμΌμ΄ tile_sizeλ³΄λ€ μμ μ μλ κ²½μ°λ₯Ό μ²λ¦¬ν©λλ€.
2. 벑ν°ν ν¨μ ν¨ν΄
fn vectorized_add[
width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
global_idx = tile_start + i
if global_idx + width <= size: # κ²½κ³ κ²μ¬
# SIMD μ°μ° μ½λ
width λ§€κ°λ³μλ vectorize ν¨μμ μν΄ μλμΌλ‘ κ²°μ λ©λλ€.
3. vectorize νΈμΆ
vectorize[simd_width](actual_tile_size, vectorized_add)
μ 곡λ SIMD νμΌλ‘ 벑ν°ν 루νλ₯Ό μλ μ²λ¦¬ν©λλ€.
4. μ£Όμ νΉμ±
- μλ λλ¨Έμ§ μ²λ¦¬, λ΄μ₯ μμ μ±, νμΌ κΈ°λ° μ κ·Ό
- λͺ μμ SIMD ν λ§€κ°λ³μ μ¬μ©
- λ΄μ₯ κ²½κ³ κ²μ¬μ μλ λλ¨Έμ§ μμ μ²λ¦¬
Mojo vectorize μ€ν
uv run poe p23 --vectorized
pixi run p23 --vectorized
νΌμ¦μ΄ μμ§ νλ¦¬μ§ μμ κ²½μ° λ€μκ³Ό κ°μ΄ μΆλ ₯λ©λλ€:
SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0 tile_start: 0 tile_end: 32 actual_tile_size: 32
tile_id: 1 tile_start: 32 tile_end: 64 actual_tile_size: 32
tile_id: 2 tile_start: 64 tile_end: 96 actual_tile_size: 32
tile_id: 3 tile_start: 96 tile_end: 128 actual_tile_size: 32
...
tile_id: 29 tile_start: 928 tile_end: 960 actual_tile_size: 32
tile_id: 30 tile_start: 960 tile_end: 992 actual_tile_size: 32
tile_id: 31 tile_start: 992 tile_end: 1024 actual_tile_size: 32
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])
Mojo vectorize νμ΄
fn vectorize_within_tiles_elementwise_add[
layout: Layout,
dtype: DType,
simd_width: Int,
num_threads_per_tile: Int,
rank: Int,
size: Int,
tile_size: Int,
](
output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
ctx: DeviceContext,
) raises:
# Each tile contains tile_size elements (not SIMD groups)
@parameter
@always_inline
fn process_tile_with_vectorize[
num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
](indices: IndexList[rank]) capturing -> None:
tile_id = indices[0]
tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start
fn vectorized_add[
width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
global_idx = tile_start + i
if global_idx + width <= size:
a_vec = a.aligned_load[width](Index(global_idx))
b_vec = b.aligned_load[width](Index(global_idx))
result = a_vec + b_vec
output.store[width](Index(global_idx), result)
# Use vectorize within each tile
vectorize[simd_width](actual_tile_size, vectorized_add)
num_tiles = (size + tile_size - 1) // tile_size
elementwise[
process_tile_with_vectorize, num_threads_per_tile, target="gpu"
](num_tiles, ctx)
Mojo vectorize μ¬μΈ΅ λΆμ
Mojoμ vectorize ν¨μλ λ΄μ₯ μμ μ±κ³Ό ν¨κ» μλ 벑ν°νλ₯Ό μ 곡ν©λλ€:
- λͺ μμ SIMD ν λ§€κ°λ³μ: μ¬μ©ν simd_widthλ₯Ό μ§μ μ§μ
- λ΄μ₯ κ²½κ³ κ²μ¬: λ²νΌ μ€λ²νλ‘μ°λ₯Ό μλμΌλ‘ λ°©μ§
- μλ λλ¨Έμ§ μ²λ¦¬: λ¨μ μμλ₯Ό μλμΌλ‘ μ²λ¦¬
- μ€μ²© ν¨μ ν¨ν΄: 벑ν°ν λ‘μ§μ κΉλν λΆλ¦¬
νμΌ κΈ°λ° κ΅¬μ±:
tile_start = tile_id * tile_size # 0, 32, 64, 96, ...
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start
μλ 벑ν°ν λ©μ»€λμ¦:
fn vectorized_add[
width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
global_idx = tile_start + i
if global_idx + width <= size:
# μλ SIMD μ΅μ ν
vectorizeμ λμ λ°©μ:
- μλ μ²ν¬ λΆν :
actual_tile_sizeλ₯Ό μ§μ νsimd_widthμ μ²ν¬λ‘ λΆν - λλ¨Έμ§ μ²λ¦¬: λ¨μ μμλ₯Ό λ μμ νμΌλ‘ μλ μ²λ¦¬
- κ²½κ³ μμ μ±: λ²νΌ μ€λ²νλ‘μ°λ₯Ό μλμΌλ‘ λ°©μ§
- 루ν κ΄λ¦¬: 벑ν°ν 루νλ₯Ό μλμΌλ‘ μ²λ¦¬
μ€ν μκ°ν (TILE_SIZE=32, SIMD_WIDTH=4):
Tile 0 μ²λ¦¬:
vectorize νΈμΆ 0: μμ [0:4]λ₯Ό SIMD_WIDTH=4λ‘ μ²λ¦¬
vectorize νΈμΆ 1: μμ [4:8]λ₯Ό SIMD_WIDTH=4λ‘ μ²λ¦¬
...
vectorize νΈμΆ 7: μμ [28:32]λ₯Ό SIMD_WIDTH=4λ‘ μ²λ¦¬
ν©κ³: 8ν μλ SIMD μ°μ°
μ±λ₯ νΉμ±:
- μ€λ λ μ: 32κ° μ€λ λ (1024 Γ· 32 = 32)
- μ€λ λλΉ μμ λ: 32κ° μμ (μλ SIMD μ²ν¬ λΆν )
- λ©λͺ¨λ¦¬ ν¨ν΄: μλ 벑ν°νλ₯Ό κ°μΆ μμ νμΌ
- μ€λ²ν€λ: μ½κ° - μλ μ΅μ ν λ° κ²½κ³ κ²μ¬
- μμ μ±: λ΄μ₯ κ²½κ³ κ²μ¬μ κ²½κ³ μ‘°κ±΄ μ²λ¦¬
μ±λ₯ λΉκ΅μ λͺ¨λ² μ¬λ‘
κ° μ κ·Όλ²μ μ ν κΈ°μ€
μλ 벑ν°νλ₯Ό μ νν λ:
- μ΅λ μ±λ₯μ΄ μ€μν κ²½μ°
- μμΈ‘ κ°λ₯νκ³ μ λ ¬λ λ°μ΄ν° ν¨ν΄μ΄ μλ κ²½μ°
- λ©λͺ¨λ¦¬ μ κ·Όμ λν μ λ¬Έκ° μμ€μ μ μ΄κ° νμν κ²½μ°
- μλμΌλ‘ κ²½κ³ μμ μ±μ 보μ₯ν μ μλ κ²½μ°
- νλμ¨μ΄λ³ μ΅μ νκ° νμν κ²½μ°
Mojo vectorizeλ₯Ό μ νν λ:
- κ°λ° μλμ μμ μ±μ΄ μ°μ μΈ κ²½μ°
- λΆκ·μΉνκ±°λ λμ μΈ λ°μ΄ν° ν¬κΈ°λ₯Ό λ€λ£¨λ κ²½μ°
- μλ κ²½κ³ μ‘°κ±΄ κ΄λ¦¬ λμ μλ λλ¨Έμ§ μ²λ¦¬λ₯Ό μνλ κ²½μ°
- κ²½κ³ κ²μ¬ 볡μ‘λκ° μ€λ₯λ₯Ό μ λ°ν μ μλ κ²½μ°
- μλ 루ν κ΄λ¦¬λ³΄λ€ κΉλν 벑ν°ν ν¨ν΄μ μ νΈνλ κ²½μ°
κ³ κΈ μ΅μ ν μΈμ¬μ΄νΈ
λ©λͺ¨λ¦¬ λμν νμ©:
μλ: 8 μ€λ λ Γ 32 SIMD μ°μ° = μ΄ 256ν SIMD μ°μ°
vectorize: 32 μ€λ λ Γ 8 SIMD μ°μ° = μ΄ 256ν SIMD μ°μ°
λ λ€ λΉμ·ν μ΄ μ²λ¦¬λμ λ¬μ±νμ§λ§, λ³λ ¬μ± μ λ΅μ΄ λ€λ¦ λλ€.
μΊμ λμ:
- μλ: λν μ²ν¬κ° L1 μΊμλ₯Ό μ΄κ³Όν μ μμ§λ§, μλ²½ν μμ°¨ μ κ·Ό
- vectorize: μμ νμΌμ΄ μΊμμ λ μ λ§κ³ , μλ λλ¨Έμ§ μ²λ¦¬
νλμ¨μ΄ λ§€ν:
- μλ: μν νμ©κ³Ό SIMD μ λ λ§€νμ λν μ§μ μ μ΄
- vectorize: μλ 루ν λ° λλ¨Έμ§ κ΄λ¦¬λ₯Ό ν΅ν κ°μνλ 벑ν°ν
λͺ¨λ² μ¬λ‘ μμ½
μλ 벑ν°ν λͺ¨λ² μ¬λ‘:
- μΈλ±μ€ κ³μ°μ νμ μ μ€νκ² κ²μ¦
- κ°λ₯νλ©΄
chunk_sizeμ μ»΄νμΌ νμ μμ μ¬μ© - μΊμ μ΅μ νλ₯Ό μν΄ λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄ νλ‘νμΌλ§
- μ΅μ μ SIMD μ±λ₯μ μν μ λ ¬ μꡬ μ¬ν κ³ λ €
Mojo vectorize λͺ¨λ² μ¬λ‘:
- λ°μ΄ν°μ νλμ¨μ΄μ μ ν©ν SIMD ν μ ν
- λ―ΈμΈ μ΅μ νλ³΄λ€ μκ³ λ¦¬μ¦μ λͺ νμ±μ μ§μ€
- κΉλν 벑ν°ν λ‘μ§μ μν΄ μ€μ²© νλΌλ―Έν° ν¨μ μ¬μ©
- κ²½κ³ μ‘°κ±΄μλ μλ κ²½κ³ κ²μ¬μ λλ¨Έμ§ μ²λ¦¬ μ λ’°
λ μ κ·Όλ² λͺ¨λ GPU μ±λ₯ μ΅μ ν λꡬ λͺ¨μμμ μ ν¨ν μ λ΅μ λλ€. μλ 벑ν°νλ μ΅λνμ μ μ΄λ₯Ό, Mojoμ vectorizeλ μμ μ±κ³Ό μλ λλ¨Έμ§ μ²λ¦¬λ₯Ό μ 곡ν©λλ€.
λ€μ λ¨κ³
μΈ κ°μ§ κΈ°λ³Έ ν¨ν΄μ λͺ¨λ μ΄ν΄νλ€λ©΄:
- π§ GPU μ€λ λ© vs SIMD κ°λ : μ€ν κ³μΈ΅ ꡬ쑰 μ΄ν΄
- π Mojo λ²€μΉλ§νΉ: μ±λ₯ λΆμκ³Ό μ΅μ ν
π‘ ν΅μ¬ μμ½: 벑ν°ν μ λ΅μ μ±λ₯ μꡬ μ¬νμ λ°λΌ λ¬λ¦¬ μ νν΄μΌ ν©λλ€. μλ 벑ν°νλ μ΅λνμ μ μ΄λ₯Ό, Mojoμ vectorize ν¨μλ μμ μ±κ³Ό μλ λλ¨Έμ§ μ²λ¦¬λ₯Ό μ 곡ν©λλ€. ꡬ체μ μΈ μ±λ₯ μꡬ μ¬νκ³Ό κ°λ° μ μ½ μ‘°κ±΄μ λ°λΌ μ ννμΈμ.