Introduction to LayoutTensor

Let’s take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: 🥁 … the LayoutTensor.

💡 This is a motivational overview of LayoutTensor’s capabilities. Don’t worry about understanding everything now - we’ll explore each feature in depth as we progress through the puzzles.

The challenge: Growing complexity

Let’s look at the challenges we’ve faced so far:

# Puzzle 1: Simple indexing
output[i] = a[i] + 10.0

# Puzzle 2: Multiple array management
output[i] = a[i] + b[i]

# Puzzle 3: Bounds checking
if i < size:
    output[i] = a[i] + 10.0

As dimensions grow, code becomes more complex:

# Traditional 2D indexing for row-major 2D matrix
idx = row * WIDTH + col
if row < height and col < width:
    output[idx] = a[idx] + 10.0

The solution: A peek at LayoutTensor

LayoutTensor will help us tackle these challenges with elegant solutions. Here’s a glimpse of what’s coming:

Natural Indexing: Use tensor[i, j] instead of manual offset calculations
Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
Performance Optimization: Efficient memory access patterns for GPU

A taste of what’s ahead

Let’s look at a few examples of what LayoutTensor can do. Don’t worry about understanding all the details now - we’ll cover each feature thoroughly in upcoming puzzles.

Basic usage example

from layout import Layout, LayoutTensor

# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)

# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())

# Access elements naturally
tensor[0, 0] = 1.0  # First element
tensor[1, 2] = 2.0  # Last element

To learn more about Layout and LayoutTensor, see these guides from the Mojo manual

Quick example

Let’s put everything together with a simple example that demonstrates the basics of LayoutTensor:

from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)


fn kernel[
    dtype: DType, layout: Layout
](tensor: LayoutTensor[dtype, layout, MutAnyOrigin]):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)


def main():
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH)
    a.enqueue_fill(0)
    tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a)
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function_checked[kernel[dtype, layout], kernel[dtype, layout]](
        tensor, grid_dim=1, block_dim=1
    )

    ctx.synchronize()

When we run this code with:

pixi run layout_tensor_intro

pixi run -e amd layout_tensor_intro

pixi run -e apple layout_tensor_intro

uv run poe layout_tensor_intro

Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

Let’s break down what’s happening:

We create a 2 x 3 tensor with row-major layout
Initially, all elements are zero
Using natural indexing, we modify a single element
The change is reflected in our output

This simple example demonstrates key LayoutTensor benefits:

Clean syntax for tensor creation and access
Automatic memory layout handling
Natural multi-dimensional indexing

While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:

Multi-threaded GPU operations
Shared memory optimizations
Complex tiling strategies
Hardware-accelerated computations

Ready to start your GPU programming journey with LayoutTensor? Let’s dive into the puzzles!

💡 Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.