Introduction to TileTensor

Let’s take a quick break from solving puzzles to preview a powerful abstraction that will make our GPU programming journey more enjoyable: šŸ„ā€¦ the TileTensor.

šŸ’” This is a motivational overview of TileTensor’s capabilities. Don’t worry about understanding everything now - we’ll explore each feature in depth as we progress through the puzzles.

The challenge: Growing complexity

Let’s look at the challenges we’ve faced so far:

# Puzzle 1: Simple indexing
output[i] = a[i] + 10.0

# Puzzle 2: Multiple array management
output[i] = a[i] + b[i]

# Puzzle 3: Bounds checking
if i < size:
    output[i] = a[i] + 10.0

As dimensions grow, code becomes more complex:

# Traditional 2D indexing for row-major 2D matrix
idx = row * WIDTH + col
if row < height and col < width:
    output[idx] = a[idx] + 10.0

The solution: A peek at TileTensor

TileTensor will help us tackle these challenges with elegant solutions. Here’s a glimpse of what’s coming:

  1. Natural Indexing: Use tensor[i, j] instead of manual offset calculations
  2. Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
  3. Performance Optimization: Efficient memory access patterns for GPU

A taste of what’s ahead

Let’s look at a few examples of what TileTensor can do. Don’t worry about understanding all the details now - we’ll cover each feature thoroughly in upcoming puzzles.

Basic usage example

from layout import TileTensor
from layout.tile_layout import row_major

# Define layout
comptime HEIGHT = 2
comptime WIDTH = 3
comptime layout = row_major[HEIGHT, WIDTH]()
comptime LayoutType = type_of(layout)

# Create tensor
tensor = TileTensor(buffer, layout)

# Access elements naturally
tensor[0, 0] = 1.0  # First element
tensor[1, 2] = 2.0  # Last element

To learn more about Layout and TileTensor, see these guides from the Mojo manual

Quick example

Let’s put everything together with a simple example that demonstrates the basics of TileTensor:

# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
from std.gpu.host import DeviceContext
from layout import TileTensor
from layout.tile_layout import row_major

comptime HEIGHT = 2
comptime WIDTH = 3
comptime dtype = DType.float32
comptime layout = row_major[HEIGHT, WIDTH]()
comptime LayoutType = type_of(layout)


def kernel(
    tensor: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)


def main() raises:
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH)
    a.enqueue_fill(0)
    tensor = TileTensor(a, layout)
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel, kernel](tensor, grid_dim=1, block_dim=1)

    ctx.synchronize()

When we run this code with:

pixi run tile_tensor_intro
pixi run -e amd tile_tensor_intro
pixi run -e apple tile_tensor_intro
uv run poe tile_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

Let’s break down what’s happening:

  1. We create a 2 x 3 tensor with row-major layout
  2. Initially, all elements are zero
  3. Using natural indexing, we modify a single element
  4. The change is reflected in our output

This simple example demonstrates key TileTensor benefits:

  • Clean syntax for tensor creation and access
  • Automatic memory layout handling
  • Natural multi-dimensional indexing

While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:

  • Multi-threaded GPU operations
  • Shared memory optimizations
  • Complex tiling strategies
  • Hardware-accelerated computations

Ready to start your GPU programming journey with TileTensor? Let’s dive into the puzzles!

šŸ’” Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.