[RFC] PCF Enablement and GPU Codegen Rework #24399

qedawkins · 2026-05-08T01:16:02Z

qedawkins
May 8, 2026
Collaborator

This proposal covers new IR, transforms, and pipeline changes for integrating
the PCF dialect into GPU code generation, first focusing on LLVMGPU.

There are a few goals here, both short term and long term preparing for future changes.
As it stands LLVMGPU codegen has been building up and maintaining
two main lowering pipelines: VectorDistribute and TileAndFuse. VectorDistribute
was introduced initially for the purpose of handling attention and modern matrix
multiplication intrinsics, both of which heavily feature subgroup level operations.
TileAndFuse was added with the primary goal of being a backbone pipeline that could
begin to consolidate the numerous bespoke lowering pipelines that existed in the
LLVMGPU backend at the time (many of these pipelines still exist :( ).

Supporting new hardware and problems have given us a chance to review our lowering
pipelines and try to unify them. This proposal takes steps towards unifying the
two pipelines, as well as tries to make it easier to handle:

Warp/wave specialization
Pipelining

And to eventually make it easier to handle:

Async instructions
[Shared memory] swizzling

A short list of problems outside the scope of this proposal, but are planned for
future changes:

Stream-K
Handling grouped GEMM/sparsity
Workgroup specialization
Dynamic shared memory/workgroup/subgroup size

The general direction is to start weaning the compiler off of static
parameters everywhere and instead design the top level for dynamism from the ground
up. This proposal only covers the control-flow side, but the expectation is that we
will want to take a closer look at the ops and types we use for the meat of a dispatch
afterwards.

PCF Baseline

This section summarizes the landed portions of the PCF dialect. The ops give a structured
way to represent parallel control flow with more flexibility. The core pieces are:

!pcf.sref<shape x element_type, scope>: a shaped reference whose memory
space and synchronization meaning come from its scope. sync(#scope) is the
shorthand used when the reference is synchronized on return from the parent
scope.
pcf.generic: a parallel region over the native worker set of a scope. Its
body receives worker ids and worker counts. Results are snapshots of tied
sref arguments after all workers return.
pcf.loop: a scoped parallel loop with explicit count(...) operands. Its
body receives iteration ids instead of native worker ids, and the scope
decides how those iterations map to workers. This can be thought of as an
analog for scf.forall but without the structured terminator.
pcf.alloc, pcf.read_slice, and pcf.write_slice: scoped allocation and
slice access on sref values.
pcf.yield, pcf.return, and pcf.br.cond_return: the structural
terminators and branch support needed to keep PCF ops region based.

The current lowering story for the core is also already in place:
tokens are resolved first, pcf.sref is converted to concrete memref types,
and the remaining structural PCF ops are lowered to scf/cf. Existing fusion
utilities can move producers and consumers across pcf.generic/pcf.loop
boundaries by rewriting pcf.read_slice and pcf.write_slice.

A workgroup tile can therefore be represented today as a pcf.loop over the
workgroup tile grid:

%wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>)
    count(%wg_m_count, %wg_n_count)
  execute(%acc_wg = %empty)[%wg_m_id: index, %wg_n_id: index]
    : (!pcf.sref<?x?xf32,
         sync(#iree_codegen.workgroup_scope<linearize>)>)
      -> (tensor<?x?xf32>) {
  %acc_tile = pcf.read_slice %acc_wg[%m_offset, %n_offset]
      [%m_tile, %n_tile] [1, 1]
    : !pcf.sref<?x?xf32,
        sync(#iree_codegen.workgroup_scope<linearize>)>
      to tensor<?x?xf32>
  // Tiled compute updates %acc_tile.
  pcf.write_slice %tile into %acc_wg[%m_offset, %n_offset]
      [%m_tile, %n_tile] [1, 1]
    : tensor<?x?xf32> into !pcf.sref<?x?xf32,
        sync(#iree_codegen.workgroup_scope<linearize>)>
  pcf.return
}

What is a way to ease the distribution process. pcf.generic already means
"each worker runs this body with an id"; it does not directly model "this body is
a cooperative program for this group" or the named phases inside that cooperative
program. Directly jumping to distributed code has proven challenging, especially
when considering algorithms that need cross-worker communication.

On its own, PCF as it stands does not provide too much beyond what existing
dialects provide. It mainly supports more advanced fusion options. There are a
few key missing pieces:

First class async primitives (not this)
Structured IR for specialization (this)
Ops and transforms to better facilitate distribution (also some of this)

Essentially what exists now is the base, the IR proposed here are extensions, and
the new transforms + pipeline start connecting the pieces.

New IR

This section covers new types + ops proposed, primarily for specialization.
The extension keeps the existing scoped sref model and adds a collective
execution layer. The key distinction is:

pcf.generic/pcf.loop: distributed worker code. Worker identity is part of
the region interface.
pcf.shared_executor: collective code for a group. The region receives a
group handle, and later distribution decides how each operation maps to
workers.

Types

`!pcf.threadgroup`

!pcf.threadgroup<#scope> is a handle for a reentrant cooperating worker group at
a scope. The threadgroup is a special case covering all workers at the scope. It
deliberately does not expose worker ids. It carries a struct field to ferry values
between regions of execution:

!pcf.threadgroup<#iree_gpu.subgroup_scope>
!pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>
!pcf.threadgroup<#iree_gpu.subgroup_scope, private = {index, i32, vector<4x4xf32>}>

def PCF_ThreadGroupType : TypeDef<PCF_Dialect, "ThreadGroup", []> {
  let mnemonic = "threadgroup";

  let parameters = (ins
      AttrParameter<"ScopeAttrInterface", "scope">:$scope,
      OptionalArrayRefParameter<"Type">:$structTypes,
      OptionalArrayRefParameter<"Type">:$privateTypes);

  let hasCustomAssemblyFormat = 1;
}

The default struct field carries collective values, values held collectively across
all workers in the group. !pcf.threadgroup also carries a private struct field for
values held locally within one worker at the scope (i.e. every worker carries one copy
of the struct). Because the intended common case is to use this type with collective
values, we give that struct type prettier printing.

Another way to interpret this op is as a class that contains the thread id as a private
field which can only be accessed through structured ops described below.

`!pcf.cluster`

!pcf.cluster<#scope, bounds, id> represents a rectangular subset of a
threadgroup's worker grid. Bounds are half-open affine ranges. The affine map
uses d0, d1, ... for explicit SSA-dependent split values and s0, s1,
... for the implicit worker counts of the scope.

!pcf.cluster<#scope, (0 -> d0), left>
!pcf.cluster<#scope, (d0 -> s0), right>
!pcf.cluster<#scope, (0 -> d0) x (d1 -> s1), tg.c01>

left, right, and tg.c01 are cluster ids. They uniquely identify the cluster
within the region or IR where it is used.

Clusters are a more general case of a threadgroup, and thus have the same overall
structure. They can carry either private or shared struct payloads:

!pcf.cluster<#scope, (0 -> s0), {tensor<64xf32>}, cluster_id>
!pcf.cluster<#scope, (0 -> s0), private = {f32}, cluster_id>
!pcf.cluster<#scope, (0 -> s0), {tensor<64xf32>}, private = {f32}, cluster_id>

The private/shared split is important because it keeps "per worker" values and
"collective" values separate in the type system, which tells us which kinds of values
are valid to forward to a region executing them.

Cluster ids use #pcf.ns_sym internally and print as dot-qualified names such
as tg.left or outer.inner.leaf. This ties the cluster to the parent operation
that defined them, giving it a unique identifier within the region. The bounds map
alone is not enough to identify a cluster uniquely because it can depend on SSA
values. Private namespaces give the compiler a stable identity for region-wide
conversion without relying on globally unique string names.

The purpose of the bounds map is to link the global id to the cluster's local id.
Contiguous rectangular threadblocks is more restrictive than we'll realistically want
eventually, but for now it makes that mapping easy.

def PCF_ClusterType : TypeDef<PCF_Dialect, "Cluster", []> {
  let mnemonic = "cluster";

  let parameters = (ins
      AttrParameter<"ScopeAttrInterface", "scope">:$scope,
      "AffineMap":$boundsMap,
      OptionalArrayRefParameter<"Type">:$privateTypes,
      OptionalArrayRefParameter<"Type">:$sharedTypes,
      "NamespacedSymbolAttr":$id);
}

Ops

`pcf.shared_executor`

pcf.shared_executor is a collective region at a scope. It has the same
structure as pcf.generic (initializer -> execute), but the execute region
receives a !pcf.threadgroup instead of id/count arguments.

%result = pcf.shared_executor scope(#scope) initialize {
  %scratch = pcf.alloc() : !pcf.sref<128xf32, #scope>
  pcf.yield %scratch : !pcf.sref<128xf32, #scope>
} -> (%scratch_arg: !pcf.sref<128xf32, #scope>)
  execute(%in_ref <- %input, %out_ref = %init)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?xf32, #scope>,
       !pcf.sref<?xf32, sync(#scope)>)
    -> (tensor<?xf32>) {
  // Collective body over %tg.
  // %in_ref is readonly; %out_ref is tied to %result.
  // %scratch_arg is initialized once for the shared executor.
  pcf.return
}

Readonly operands use <- and produce no result. Readwrite operands use =,
are tied to tensor/memref inits, and are snapshotted as results, matching the
existing PCF tied-result model.

Every operation within the execute region of the shared_executor is a collective.
The semantics for how that operation is executed across the workers is either
up to the operation itself or unspecified. In simpler terms, this can be thought
of as an undistributed version of pcf.generic. More examples are shown below.

def SharedExecutorOp : PCF_Op<"shared_executor", [
    AttrSizedOperandSegments,
    AutomaticAllocationScope,
    RecursiveMemoryEffects,
    DeclareOpInterfaceMethods<PCF_NamespaceOpInterface>,
  ]> {
  let arguments = (ins
      PCF_ScopeAttrInterface:$scope,
      Variadic<AnyRankedTensorOrMemRef>:$readonlyInits,
      Variadic<AnyRankedTensorOrMemRef>:$readwriteInits,
      IntProp<"int64_t">:$num_leading_args,
      IntProp<"int64_t">:$num_readonly_refs);

  let results = (outs
      Variadic<AnyRankedTensorOrMemRef>:$results);

  let regions = (region
      MaxSizedRegion<1>:$initializer,
      MinSizedRegion<1>:$region);
}

`pcf.shared_executor.tile_group`

tile_group partitions a threadgroup or cluster into named clusters. This is
the structural operation that makes warp/wave specialization representable:
different clusters can retain different bodies until distribution and lowering.

pcf.shared_executor scope(#scope) initialize {
  pcf.yield
} -> ()
  execute()
    [%tg: !pcf.threadgroup<#scope>]
    : () -> () {
  pcf.shared_executor.tile_group %tg ns(tg) split [[%k]]
      (%left: !pcf.cluster<#scope, (0 -> d0), tg.left>,
       %right: !pcf.cluster<#scope, (d0 -> s0), tg.right>) {
    pcf.return
  } : !pcf.threadgroup<#scope>
  pcf.return
}

The lowering computes the current worker's cluster index from the scope worker
ids and emits scf.index_switch, cloning only the matching cluster's code into
each case. For example:

pcf.shared_executor scope(#scope) initialize {
  pcf.yield
} -> ()
  execute()
    [%tg: !pcf.threadgroup<#scope>]
    : () -> () {
  pcf.shared_executor.tile_group %tg ns(tile) split [[%split]]
      (%cluster0: !pcf.cluster<#scope, (0 -> d0), tile.c0>,
       %cluster1: !pcf.cluster<#scope, (d0 -> s0), tile.c1>) {
    "foo.cluster0_work"(%cluster0)
        : (!pcf.cluster<#scope, (0 -> d0), tile.c0>) -> ()
    "foo.cluster1_work"(%cluster1)
        : (!pcf.cluster<#scope, (d0 -> s0), tile.c1>) -> ()
    pcf.return
  } : !pcf.threadgroup<#scope>
  pcf.return
}

After distribution, each worker computes the cluster it belongs to and the two
cluster bodies become separate switch cases:

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%is_cluster1 = arith.cmpi uge, %worker_id, %split : index
%cluster_id = arith.select %is_cluster1, %c1, %c0 : index
scf.index_switch %cluster_id
case 0 {
  "foo.cluster0_work"() : () -> ()
  scf.yield
}
case 1 {
  "foo.cluster1_work"() : () -> ()
  scf.yield
}
default {
  // Structurally required and unreachable after selector construction.
  scf.yield
}

Mechanically this is implemented by replicating the tile_group region for
each cluster and then converting each region with a simple type converter that
either forwards the struct fields or zeroes out the type.

def TileGroupOp : PCF_Op<"shared_executor.tile_group", [
    RecursiveMemoryEffects,
    SingleBlock,
    DeclareOpInterfaceMethods<PCF_NamespaceOpInterface>,
  ]> {
  let arguments = (ins
      AnyType:$source,
      Variadic<Index>:$splitPoints,
      DenseI64ArrayAttr:$numSplitsPerDim,
      OptionalAttr<StrAttr>:$nsName);

  let results = (outs);

  let regions = (region
      MinSizedRegion<1>:$body);
}

`pcf.shared_executor.run_cluster`

run_cluster marks a collective phase over a cluster. It consumes shared
payloads from its source clusters and can yield a new shared payload:

%next = pcf.shared_executor.run_cluster(%iter)[]
    (%value: tensor<?xf32>) {
  %next_value = "some.collective_op"(%value) : (tensor<?xf32>) -> tensor<?xf32>
  pcf.cluster_yield %next_value : tensor<?xf32>
} : (!pcf.cluster<#scope, (0 -> s0), {tensor<?xf32>}, phase>)
  -> !pcf.cluster<#scope, (0 -> s0), {tensor<?xf32>}, phase>

Importantly this is the op that actually shifts from the parent context to just the
worker subset specified by the cluster, not the tile_group typically wrapping it.

def RunClusterOp : PCF_Op<"shared_executor.run_cluster", [
    AttrSizedOperandSegments,
    RecursiveMemoryEffects,
    SingleBlock,
    SingleBlockImplicitTerminator<
        "mlir::iree_compiler::IREE::PCF::ClusterYieldOp">,
  ]> {
  let arguments = (ins
      Variadic<PCF_AnyCluster>:$sources,
      Variadic<Index>:$rangeValues);

  let results = (outs
      Optional<PCF_AnyCluster>:$result);

  let regions = (region
      MinSizedRegion<1>:$body);
}

`pcf.shared_executor.run_thread`

run_thread is the per-worker counterpart to run_cluster. It consumes private
cluster payloads, provides tile-relative thread ids, and yields private payloads.
It is the form produced after a collective region has been distributed to
individual workers.

%next = pcf.shared_executor.run_thread(%acc)[%k]
    (%lane_acc: f32)[%tid: index] {
  // %tid is the thread id for the local cluster.
  pcf.cluster_yield %lane_acc : f32
} : (!pcf.cluster<#scope, (0 -> d0), private: {f32}, acc>)
  -> !pcf.cluster<#scope, (0 -> d0), private: {f32}, acc>

def RunThreadOp : PCF_Op<"shared_executor.run_thread", [
    AttrSizedOperandSegments,
    RecursiveMemoryEffects,
    SingleBlock,
    SingleBlockImplicitTerminator<
        "mlir::iree_compiler::IREE::PCF::ClusterYieldOp">,
  ]> {
  let arguments = (ins
      Variadic<PCF_AnyCluster>:$sources,
      Variadic<Index>:$rangeValues);

  let results = (outs
      Optional<PCF_AnyCluster>:$result);

  let regions = (region
      MinSizedRegion<1>:$body);
}

`pcf.subview` and `pcf.expand_shape`

These are sref view ops analogous to memref.subview and
tensor.expand_shape. They are more for completeness than anything but were left
out of earlier iterations of the dialect design and ultimately proved necessary for
handling MMAs.

Examples

The next section shows a few IR samples of what we can represent with the above.

Horizontal Warp/Wave Specialization

A common pattern in high performance kernels involve a basic producer/consumer
wave split. Half, say, of the waves will load data while the other half does the
compute. Written by hand with thread level code this would look something like this.

Click to expand

%tid_x = gpu.thread_id x
%wave_id = arith.divui %tid_x, %wave_size : index
%producer_wave_count = arith.constant 2 : index
%is_producer = arith.cmpi ult, %wave_id, %producer_wave_count : index
%lhs_stage = memref.alloc()
    : memref<3x64x32xf16, #gpu.address_space<workgroup>>
%rhs_stage = memref.alloc()
    : memref<3x32x64xf16, #gpu.address_space<workgroup>>
%zero_f16 = arith.constant 0.0 : f16
%zero_acc = arith.constant dense<0.0> : vector<16x16xf32>

scf.if %is_producer {
  scf.for %k_tile_id = %c0 to %k_tiles step %c1 {
    %stage = arith.remui %k_tile_id, %c3 : index
    %k_offset = arith.muli %k_tile_id, %c32 : index
    %lhs_vec = vector.transfer_read %lhs[%m_offset, %k_offset], %zero_f16
        {in_bounds = [true, true]}
      : memref<?x?xf16>, vector<64x32xf16>
    %rhs_vec = vector.transfer_read %rhs[%k_offset, %n_offset], %zero_f16
        {in_bounds = [true, true]}
      : memref<?x?xf16>, vector<32x64xf16>
    vector.transfer_write %lhs_vec, %lhs_stage[%stage, %c0, %c0]
        {in_bounds = [true, true]}
      : vector<64x32xf16>,
        memref<3x64x32xf16, #gpu.address_space<workgroup>>
    vector.transfer_write %rhs_vec, %rhs_stage[%stage, %c0, %c0]
        {in_bounds = [true, true]}
      : vector<32x64xf16>,
        memref<3x32x64xf16, #gpu.address_space<workgroup>>
    gpu.barrier memfence [#gpu.address_space<workgroup>]
  }
} else {
  %acc_final = scf.for %k_tile_id = %c0 to %k_tiles step %c1
      iter_args(%acc_iter = %zero_acc) -> (vector<16x16xf32>) {
    %stage = arith.remui %k_tile_id, %c3 : index
    gpu.barrier memfence [#gpu.address_space<workgroup>]
    %lhs_panel = vector.transfer_read %lhs_stage[%stage, %c0, %c0], %zero_f16
        {in_bounds = [true, true]}
      : memref<3x64x32xf16, #gpu.address_space<workgroup>>,
        vector<16x32xf16>
    %rhs_panel = vector.transfer_read %rhs_stage[%stage, %c0, %c0], %zero_f16
        {in_bounds = [true, true]}
      : memref<3x32x64xf16, #gpu.address_space<workgroup>>,
        vector<32x16xf16>
    %acc_next = vector.contract
        {indexing_maps = [
           affine_map<(m, n, k) -> (m, k)>,
           affine_map<(m, n, k) -> (k, n)>,
           affine_map<(m, n, k) -> (m, n)>],
         iterator_types = ["parallel", "parallel", "reduction"],
         kind = #vector.kind<add>}
        %lhs_panel, %rhs_panel, %acc_iter
      : vector<16x32xf16>, vector<32x16xf16> into vector<16x16xf32>
    scf.yield %acc_next : vector<16x16xf32>
  }
  vector.transfer_write %acc_final, %output[%m_offset, %n_offset]
      {in_bounds = [true, true]}
    : vector<16x16xf32>, memref<?x?xf32>
}

There are a few issues with a representation like this. First, we're typically ingesting
the program as a single instruction stream. Earmarking operations for different groups,
distributing, and splitting the streams in one go is too large of a jump for a single
transformation. We would much prefer to break it up. Second, it is substantially harder
to match related regions of code with this representation. It requires matching
synchronization and control flow between the two regions. Instead with reentrant cluster
types we can write the same like the following:
different code while still being represented inside one structured region.

Click to expand

%result = pcf.shared_executor scope(#scope) initialize {
  %lhs_stage = pcf.alloc() : !pcf.sref<3x?x?xf32, #scope>
  %rhs_stage = pcf.alloc() : !pcf.sref<3x?x?xf32, #scope>
  pcf.yield %lhs_stage, %rhs_stage
    : !pcf.sref<3x?x?xf32, #scope>, !pcf.sref<3x?x?xf32, #scope>
} -> (%lhs_stage: !pcf.sref<3x?x?xf32, #scope>,
      %rhs_stage: !pcf.sref<3x?x?xf32, #scope>)
  execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %init)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, sync(#scope)>)
    -> (tensor<?x?xf32>) {
  %producer_wave_count = arith.constant 2 : index

  pcf.shared_executor.tile_group %tg ns(wave) split [[%producer_wave_count]]
      (%producer: !pcf.cluster<#scope, (0 -> d0), wave.producer>,
       %consumer: !pcf.cluster<#scope, (d0 -> s0), wave.consumer>) {
    %acc_start = pcf.shared_executor.run_thread(%consumer)[%producer_wave_count]
        ()[%tid: index] {
      %acc_empty = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>
      %zero = arith.constant 0.0 : f32
      %acc_init = linalg.fill ins(%zero : f32)
          outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_init : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (d0 -> s0), wave.consumer>)
      -> !pcf.cluster<#scope, (d0 -> s0),
                      private = {tensor<?x?xf32>}, wave.consumer>

    %acc_done = scf.for %k = %c0 to %k_tiles step %c1
        iter_args(%acc_iter = %acc_start)
        -> (!pcf.cluster<#scope, (d0 -> s0),
                           private = {tensor<?x?xf32>}, wave.consumer>) {
      %k_offset = arith.muli %k, %k_tile : index
      %stage = arith.remui %k, %c3 : index
      pcf.shared_executor.run_thread(%producer)[%producer_wave_count]
          ()[%tid: index] {
        %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %k_offset]
            [%m_tile, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %n_offset]
            [%k_tile, %n_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %lhs_write_stage = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_tile, %k_tile] [1, 1, 1]
          : !pcf.sref<3x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_write_stage = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_tile] [1, 1, 1]
          : !pcf.sref<3x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        pcf.write_slice %lhs_tile into %lhs_write_stage[%c0, %c0]
            [%m_tile, %k_tile] [1, 1]
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>
        pcf.write_slice %rhs_tile into %rhs_write_stage[%c0, %c0]
            [%k_tile, %n_tile] [1, 1]
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (0 -> d0), wave.producer>)

      %acc_next = pcf.shared_executor.run_thread(%acc_iter)[%producer_wave_count]
          (%acc_value: tensor<?x?xf32>)[%tid: index] {
        %lhs_read_stage = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_tile, %k_tile] [1, 1, 1]
          : !pcf.sref<3x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_read_stage = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_tile] [1, 1, 1]
          : !pcf.sref<3x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_read_stage[%c0, %c0]
            [%m_tile, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_read_stage[%c0, %c0]
            [%k_tile, %n_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (d0 -> s0),
                          private = {tensor<?x?xf32>}, wave.consumer>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        private = {tensor<?x?xf32>}, wave.consumer>
      scf.yield %acc_next
        : !pcf.cluster<#scope, (d0 -> s0),
                       private = {tensor<?x?xf32>}, wave.consumer>
    }

    pcf.shared_executor.run_thread(%acc_done)[%producer_wave_count]
        (%acc_value: tensor<?x?xf32>)[%tid: index] {
      pcf.write_slice %acc_value into %acc_ref[%m_offset, %n_offset]
          [%m_tile, %n_tile] [1, 1]
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#scope)>
      pcf.cluster_yield
    } : (!pcf.cluster<#scope, (d0 -> s0),
                        private = {tensor<?x?xf32>}, wave.consumer>)
    pcf.return
  } : !pcf.threadgroup<#scope>

  pcf.return
}

Vertical Warp/Wave Specialization

The previous example shows "horizontal" specialization (different workers
performing completely different tasks towards a larger goal). This example
is meant to illustrate vertical specialization (different workers doing the
same task but in different orders). This is useful for overlapping execution
of different instruction types. This lets us do something similar to what
we do in the pingpong ukernels in the ROCM backend. Start with a collective
matmul with outlined distinct regions:

Click to expand

%result = pcf.shared_executor scope(#scope) initialize {
  %lhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  %rhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  pcf.yield %lhs_stage, %rhs_stage
    : !pcf.sref<2x?x?xf32, #scope>, !pcf.sref<2x?x?xf32, #scope>
} -> (%lhs_stage: !pcf.sref<2x?x?xf32, #scope>,
      %rhs_stage: !pcf.sref<2x?x?xf32, #scope>)
  execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %init)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, sync(#scope)>)
    -> (tensor<?x?xf32>) {
  %lhs_stage0 = pcf.subview %lhs_stage[%c0, %c0, %c0]
      [1, %m_tile, %k_tile] [1, 1, 1]
    : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
  %rhs_stage0 = pcf.subview %rhs_stage[%c0, %c0, %c0]
      [1, %k_tile, %n_tile] [1, 1, 1]
    : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>

  // Submit tile 0 copies before entering the pipelined loop.
  %copy_tg0 = pcf.shared_executor.run_cluster(%tg)[]
      () {
    %empty_token = pcf.copy_token.empty : !pcf.copy_token<#scope>
    %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %c0]
        [%m_tile, %k_tile] [1, 1]
      : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
    %rhs_tile = pcf.read_slice %rhs_ref[%c0, %n_offset]
        [%k_tile, %n_tile] [1, 1]
      : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
    %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage0[%c0, %c0]
        [%m_tile, %k_tile] [1, 1] after %empty_token
      : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
        !pcf.copy_token<#scope>
    %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage0[%c0, %c0]
        [%k_tile, %n_tile] [1, 1] after %lhs_token
      : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
        !pcf.copy_token<#scope>
    pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
  } : (!pcf.threadgroup<#scope>)
    -> !pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>

  // Initialize the accumulator carried by the compute phase.
  %acc_tg0 = pcf.shared_executor.run_cluster(%tg)[]
      () {
    %acc_empty = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>
    %zero = arith.constant 0.0 : f32
    %acc_init = linalg.fill ins(%zero : f32)
        outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
    pcf.cluster_yield %acc_init : tensor<?x?xf32>
  } : (!pcf.threadgroup<#scope>)
    -> !pcf.threadgroup<#scope, {tensor<?x?xf32>}>

  %copy_tg_done, %acc_tg_partial, %last_stage =
      scf.for %k = %c1 to %k_tiles step %c1
      iter_args(%copy_tg_iter = %copy_tg0,
                %acc_tg_iter = %acc_tg0,
                %prev_stage_iter = %c0)
      -> (!pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>,
          !pcf.threadgroup<#scope, {tensor<?x?xf32>}>,
          index) {
    %stage = arith.remui %k, %c2 : index
    %k_offset = arith.muli %k, %k_tile : index
    %lhs_write_stage = pcf.subview %lhs_stage[%stage, %c0, %c0]
        [1, %m_tile, %k_tile] [1, 1, 1]
      : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
    %rhs_write_stage = pcf.subview %rhs_stage[%stage, %c0, %c0]
        [1, %k_tile, %n_tile] [1, 1, 1]
      : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
    %lhs_read_stage = pcf.subview %lhs_stage[%prev_stage_iter, %c0, %c0]
        [1, %m_tile, %k_tile] [1, 1, 1]
      : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
    %rhs_read_stage = pcf.subview %rhs_stage[%prev_stage_iter, %c0, %c0]
        [1, %k_tile, %n_tile] [1, 1, 1]
      : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>

    // Submit copies for the current tile.
    %copy_tg_next = pcf.shared_executor.run_cluster(%copy_tg_iter)[]
        (%prev_token: !pcf.copy_token<#scope>) {
      %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %k_offset]
          [%m_tile, %k_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %n_offset]
          [%k_tile, %n_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_write_stage[%c0, %c0]
          [%m_tile, %k_tile] [1, 1] after %prev_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_write_stage[%c0, %c0]
          [%k_tile, %n_tile] [1, 1] after %lhs_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
    } : (!pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>)
      -> !pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>

    // Wait for the previous loop's copies carried in %copy_tg_iter.
    %ready_tg = pcf.shared_executor.barrier %copy_tg_iter
        : !pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>
          -> !pcf.threadgroup<#scope>

    // Compute the previous stage while current copies are in flight.
    %acc_tg_next = pcf.shared_executor.run_cluster(%ready_tg, %acc_tg_iter)[]
        (%acc_value: tensor<?x?xf32>) {
      %lhs_panel = pcf.read_slice %lhs_read_stage[%c0, %c0]
          [%m_tile, %k_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %rhs_panel = pcf.read_slice %rhs_read_stage[%c0, %c0]
          [%k_tile, %n_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %acc_next = linalg.matmul
          ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
          outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_next : tensor<?x?xf32>
    } : (!pcf.threadgroup<#scope>,
         !pcf.threadgroup<#scope, {tensor<?x?xf32>}>)
      -> !pcf.threadgroup<#scope, {tensor<?x?xf32>}>

    scf.yield %copy_tg_next, %acc_tg_next, %stage
      : !pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>,
        !pcf.threadgroup<#scope, {tensor<?x?xf32>}>,
        index
  }

  // Wait for the final copies submitted by the loop.
  %ready_tg_done = pcf.shared_executor.barrier %copy_tg_done
      : !pcf.threadgroup<#scope, {!pcf.copy_token<#scope>}>
        -> !pcf.threadgroup<#scope>
  %lhs_stage_done = pcf.subview %lhs_stage[%last_stage, %c0, %c0]
      [1, %m_tile, %k_tile] [1, 1, 1]
    : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
  %rhs_stage_done = pcf.subview %rhs_stage[%last_stage, %c0, %c0]
      [1, %k_tile, %n_tile] [1, 1, 1]
    : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
  // Drain the last compute stage after all copies have completed.
  %acc_tg_done = pcf.shared_executor.run_cluster(%ready_tg_done, %acc_tg_partial)[]
      (%acc_value: tensor<?x?xf32>) {
    %lhs_panel = pcf.read_slice %lhs_stage_done[%c0, %c0]
        [%m_tile, %k_tile] [1, 1]
      : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
    %rhs_panel = pcf.read_slice %rhs_stage_done[%c0, %c0]
        [%k_tile, %n_tile] [1, 1]
      : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
    %acc_next = linalg.matmul
        ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
        outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
    pcf.cluster_yield %acc_next : tensor<?x?xf32>
  } : (!pcf.threadgroup<#scope>,
       !pcf.threadgroup<#scope, {tensor<?x?xf32>}>)
    -> !pcf.threadgroup<#scope, {tensor<?x?xf32>}>

  pcf.return
}

(note: some of the IR here, namely copy_token, is included for illustration
purposes. It is just a suggestion and isn't part of the proposal).

Because the worker count itself is dynamic, code running on a threadgroup can
be split by just splitting threadgroup. This would look something like this:

Click to expand

%result = pcf.shared_executor scope(#scope) initialize {
  %lhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  %rhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  pcf.yield %lhs_stage, %rhs_stage
    : !pcf.sref<2x?x?xf32, #scope>, !pcf.sref<2x?x?xf32, #scope>
} -> (%lhs_stage: !pcf.sref<2x?x?xf32, #scope>,
      %rhs_stage: !pcf.sref<2x?x?xf32, #scope>)
  execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %init)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, sync(#scope)>)
    -> (tensor<?x?xf32>) {
  %half_wave_count = arith.divui %scope_wave_count, %c2 : index
  %m_half = arith.divui %m_tile, %c2 : index
  %n_half = arith.divui %n_tile, %c2 : index

  pcf.shared_executor.tile_group %tg ns(vertical) split [[%half_wave_count]]
      (%lower: !pcf.cluster<#scope, (0 -> d0), vertical.lower>,
       %upper: !pcf.cluster<#scope, (d0 -> s0), vertical.upper>) {
    // Initialize the lower half's copy token.
    %copy_lower0 = pcf.shared_executor.run_cluster(%lower)[%half_wave_count]
        () {
      %empty_token = pcf.copy_token.empty : !pcf.copy_token<#scope>
      pcf.cluster_yield %empty_token : !pcf.copy_token<#scope>
    } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      {!pcf.copy_token<#scope>}, vertical.lower>

    // Initialize the upper half's copy token.
    %copy_upper0 = pcf.shared_executor.run_cluster(%upper)[%half_wave_count]
        () {
      %empty_token = pcf.copy_token.empty : !pcf.copy_token<#scope>
      pcf.cluster_yield %empty_token : !pcf.copy_token<#scope>
    } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>)
      -> !pcf.cluster<#scope, (d0 -> s0),
                      {!pcf.copy_token<#scope>}, vertical.upper>

    // Initialize the lower half's accumulator.
    %acc_lower0 = pcf.shared_executor.run_cluster(%lower)[%half_wave_count]
        () {
      %acc_empty = tensor.empty(%m_half, %n_half) : tensor<?x?xf32>
      %zero = arith.constant 0.0 : f32
      %acc_init = linalg.fill ins(%zero : f32)
          outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_init : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      {tensor<?x?xf32>}, vertical.lower>

    // Initialize the upper half's accumulator.
    %acc_upper0 = pcf.shared_executor.run_cluster(%upper)[%half_wave_count]
        () {
      %acc_empty = tensor.empty(%m_half, %n_half) : tensor<?x?xf32>
      %zero = arith.constant 0.0 : f32
      %acc_init = linalg.fill ins(%zero : f32)
          outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_init : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>)
      -> !pcf.cluster<#scope, (d0 -> s0),
                      {tensor<?x?xf32>}, vertical.upper>

    %copy_lower_done, %copy_upper_done, %acc_lower_done, %acc_upper_done =
        scf.for %k = %c0 to %k_tiles step %c1
        iter_args(%copy_lower_iter = %copy_lower0,
                  %copy_upper_iter = %copy_upper0,
                  %acc_lower_iter = %acc_lower0,
                  %acc_upper_iter = %acc_upper0)
        -> (!pcf.cluster<#scope, (0 -> d0),
                           {!pcf.copy_token<#scope>}, vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0),
                         {!pcf.copy_token<#scope>}, vertical.upper>,
            !pcf.cluster<#scope, (0 -> d0),
                         {tensor<?x?xf32>}, vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0),
                         {tensor<?x?xf32>}, vertical.upper>) {
      %stage = arith.remui %k, %c2 : index
      %k_offset = arith.muli %k, %k_tile : index
      %upper_m_offset = arith.addi %m_offset, %m_half : index
      %upper_n_offset = arith.addi %n_offset, %n_half : index

      // Lower half submits copies for tile %k.
      %copy_lower_next =
          pcf.shared_executor.run_cluster(%copy_lower_iter)[%half_wave_count]
          (%prev_token: !pcf.copy_token<#scope>) {
        %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %k_offset]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %n_offset]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1] after %prev_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1] after %lhs_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
      } : (!pcf.cluster<#scope, (0 -> d0),
                           {!pcf.copy_token<#scope>}, vertical.lower>)
        -> !pcf.cluster<#scope, (0 -> d0),
                        {!pcf.copy_token<#scope>}, vertical.lower>

      // Upper half submits copies for tile %k.
      %copy_upper_next =
          pcf.shared_executor.run_cluster(%copy_upper_iter)[%half_wave_count]
          (%prev_token: !pcf.copy_token<#scope>) {
        %lhs_tile = pcf.read_slice %lhs_ref[%upper_m_offset, %k_offset]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %upper_n_offset]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %m_half, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %n_half]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1] after %prev_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1] after %lhs_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
      } : (!pcf.cluster<#scope, (d0 -> s0),
                           {!pcf.copy_token<#scope>}, vertical.upper>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        {!pcf.copy_token<#scope>}, vertical.upper>

      // Wait for both halves' current copies for tile %k.
      %ready_lower, %ready_upper = pcf.shared_executor.barrier
          %copy_lower_next, %copy_upper_next[%half_wave_count]
        : (!pcf.cluster<#scope, (0 -> d0),
                        {!pcf.copy_token<#scope>}, vertical.lower>,
           !pcf.cluster<#scope, (d0 -> s0),
                        {!pcf.copy_token<#scope>}, vertical.upper>)
          -> (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
              !pcf.cluster<#scope, (d0 -> s0), vertical.upper>)

      // Lower half computes tile %k after both copy phases are ready.
      %acc_lower_next =
          pcf.shared_executor.run_cluster(%ready_lower, %acc_lower_iter)
          [%half_wave_count](%acc_value: tensor<?x?xf32>) {
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
           !pcf.cluster<#scope, (0 -> d0),
                        {tensor<?x?xf32>}, vertical.lower>)
        -> !pcf.cluster<#scope, (0 -> d0),
                        {tensor<?x?xf32>}, vertical.lower>

      // Upper half computes tile %k after both copy phases are ready.
      %acc_upper_next =
          pcf.shared_executor.run_cluster(%ready_upper, %acc_upper_iter)
          [%half_wave_count](%acc_value: tensor<?x?xf32>) {
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %m_half, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %n_half]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>,
           !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>

      scf.yield %copy_lower_next, %copy_upper_next,
                %acc_lower_next, %acc_upper_next
        : !pcf.cluster<#scope, (0 -> d0),
                       {!pcf.copy_token<#scope>}, vertical.lower>,
          !pcf.cluster<#scope, (d0 -> s0),
                       {!pcf.copy_token<#scope>}, vertical.upper>,
          !pcf.cluster<#scope, (0 -> d0),
                       {tensor<?x?xf32>}, vertical.lower>,
          !pcf.cluster<#scope, (d0 -> s0),
                       {tensor<?x?xf32>}, vertical.upper>
    }
    pcf.return
  } : !pcf.threadgroup<#scope>

  pcf.return
}

Now here the code is doing the exact same as in the first sample, just
in two parts. After splitting it though, we can now shear the clusters
by moving half of them one-step past the barrier.

Click to expand

%result = pcf.shared_executor scope(#scope) initialize {
  %lhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  %rhs_stage = pcf.alloc() : !pcf.sref<2x?x?xf32, #scope>
  pcf.yield %lhs_stage, %rhs_stage
    : !pcf.sref<2x?x?xf32, #scope>, !pcf.sref<2x?x?xf32, #scope>
} -> (%lhs_stage: !pcf.sref<2x?x?xf32, #scope>,
      %rhs_stage: !pcf.sref<2x?x?xf32, #scope>)
  execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %init)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, #scope>,
       !pcf.sref<?x?xf32, sync(#scope)>)
    -> (tensor<?x?xf32>) {
  %half_wave_count = arith.divui %scope_wave_count, %c2 : index
  %m_half = arith.divui %m_tile, %c2 : index
  %n_half = arith.divui %n_tile, %c2 : index
  %upper_m_offset = arith.addi %m_offset, %m_half : index
  %upper_n_offset = arith.addi %n_offset, %n_half : index

  pcf.shared_executor.tile_group %tg ns(vertical) split [[%half_wave_count]]
      (%lower: !pcf.cluster<#scope, (0 -> d0), vertical.lower>,
       %upper: !pcf.cluster<#scope, (d0 -> s0), vertical.upper>) {
    // Leading/lower half submits tile 0 copies before the loop.
    %copy_lower0 = pcf.shared_executor.run_cluster(%lower)[%half_wave_count]
        () {
      %empty_token = pcf.copy_token.empty : !pcf.copy_token<#scope>
      %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %c0]
          [%m_half, %k_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %rhs_tile = pcf.read_slice %rhs_ref[%c0, %n_offset]
          [%k_tile, %n_half] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %lhs_stage_view = pcf.subview %lhs_stage[%c0, %c0, %c0]
          [1, %m_half, %k_tile] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %rhs_stage_view = pcf.subview %rhs_stage[%c0, %c0, %c0]
          [1, %k_tile, %n_half] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
          [%m_half, %k_tile] [1, 1] after %empty_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
          [%k_tile, %n_half] [1, 1] after %lhs_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
    } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      {!pcf.copy_token<#scope>}, vertical.lower>

    // Trailing/upper half submits tile 0 copies before the loop.
    %copy_upper0 = pcf.shared_executor.run_cluster(%upper)[%half_wave_count]
        () {
      %empty_token = pcf.copy_token.empty : !pcf.copy_token<#scope>
      %lhs_tile = pcf.read_slice %lhs_ref[%upper_m_offset, %c0]
          [%m_half, %k_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %rhs_tile = pcf.read_slice %rhs_ref[%c0, %upper_n_offset]
          [%k_tile, %n_half] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %lhs_stage_view = pcf.subview %lhs_stage[%c0, %m_half, %c0]
          [1, %m_half, %k_tile] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %rhs_stage_view = pcf.subview %rhs_stage[%c0, %c0, %n_half]
          [1, %k_tile, %n_half] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
          [%m_half, %k_tile] [1, 1] after %empty_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
          [%k_tile, %n_half] [1, 1] after %lhs_token
        : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
          !pcf.copy_token<#scope>
      pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
    } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>)
      -> !pcf.cluster<#scope, (d0 -> s0),
                      {!pcf.copy_token<#scope>}, vertical.upper>

    // Initialize the leading/lower accumulator.
    %acc_lower0 = pcf.shared_executor.run_cluster(%lower)[%half_wave_count]
        () {
      %acc_empty = tensor.empty(%m_half, %n_half) : tensor<?x?xf32>
      %zero = arith.constant 0.0 : f32
      %acc_init = linalg.fill ins(%zero : f32)
          outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_init : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      {tensor<?x?xf32>}, vertical.lower>

    // Initialize the trailing/upper accumulator.
    %acc_upper0 = pcf.shared_executor.run_cluster(%upper)[%half_wave_count]
        () {
      %acc_empty = tensor.empty(%m_half, %n_half) : tensor<?x?xf32>
      %zero = arith.constant 0.0 : f32
      %acc_init = linalg.fill ins(%zero : f32)
          outs(%acc_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %acc_init : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>)
      -> !pcf.cluster<#scope, (d0 -> s0),
                      {tensor<?x?xf32>}, vertical.upper>

    // Wait for both halves' tile 0 copies before the leading prologue compute.
    %ready_lower0, %ready_upper0 = pcf.shared_executor.barrier
        %copy_lower0, %copy_upper0[%half_wave_count]
      : (!pcf.cluster<#scope, (0 -> d0),
                      {!pcf.copy_token<#scope>}, vertical.lower>,
         !pcf.cluster<#scope, (d0 -> s0),
                      {!pcf.copy_token<#scope>}, vertical.upper>)
        -> (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0), vertical.upper>)

    // Leading/lower half computes tile 0 before the loop.
    %acc_lower_start =
        pcf.shared_executor.run_cluster(%ready_lower0, %acc_lower0)
        [%half_wave_count](%acc_value: tensor<?x?xf32>) {
      %lhs_stage_view = pcf.subview %lhs_stage[%c0, %c0, %c0]
          [1, %m_half, %k_tile] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %rhs_stage_view = pcf.subview %rhs_stage[%c0, %c0, %c0]
          [1, %k_tile, %n_half] [1, 1, 1]
        : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
      %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
          [%m_half, %k_tile] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
          [%k_tile, %n_half] [1, 1]
        : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
      %next_value = linalg.matmul
          ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
          outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
      pcf.cluster_yield %next_value : tensor<?x?xf32>
    } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
         !pcf.cluster<#scope, (0 -> d0),
                      {tensor<?x?xf32>}, vertical.lower>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      {tensor<?x?xf32>}, vertical.lower>

    %copy_lower_done, %copy_upper_done, %acc_lower_done, %acc_upper_done,
    %delayed_stage =
        scf.for %k = %c1 to %k_tiles step %c1
        iter_args(%copy_lower_iter = %copy_lower0,
                  %copy_upper_iter = %copy_upper0,
                  %acc_lower_iter = %acc_lower_start,
                  %acc_upper_iter = %acc_upper0,
                  %delayed_stage_iter = %c0)
        -> (!pcf.cluster<#scope, (0 -> d0),
                           {!pcf.copy_token<#scope>}, vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0),
                         {!pcf.copy_token<#scope>}, vertical.upper>,
            !pcf.cluster<#scope, (0 -> d0),
                         {tensor<?x?xf32>}, vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0),
                         {tensor<?x?xf32>}, vertical.upper>,
            index) {
      %stage = arith.remui %k, %c2 : index
      %k_offset = arith.muli %k, %k_tile : index

      // Leading/lower half submits copies for the current tile.
      %copy_lower_next =
          pcf.shared_executor.run_cluster(%copy_lower_iter)[%half_wave_count]
          (%prev_token: !pcf.copy_token<#scope>) {
        %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, %k_offset]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %n_offset]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1] after %prev_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1] after %lhs_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
      } : (!pcf.cluster<#scope, (0 -> d0),
                           {!pcf.copy_token<#scope>}, vertical.lower>)
        -> !pcf.cluster<#scope, (0 -> d0),
                        {!pcf.copy_token<#scope>}, vertical.lower>

      // Wait for current lower copies and previous upper copies.
      %ready_lower, %ready_upper = pcf.shared_executor.barrier
          %copy_lower_next, %copy_upper_iter[%half_wave_count]
        : (!pcf.cluster<#scope, (0 -> d0),
                        {!pcf.copy_token<#scope>}, vertical.lower>,
           !pcf.cluster<#scope, (d0 -> s0),
                        {!pcf.copy_token<#scope>}, vertical.upper>)
          -> (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
              !pcf.cluster<#scope, (d0 -> s0), vertical.upper>)

      // Trailing/upper half submits current tile copies after the skew barrier.
      %copy_upper_next =
          pcf.shared_executor.run_cluster(%copy_upper_iter)[%half_wave_count]
          (%prev_token: !pcf.copy_token<#scope>) {
        %lhs_tile = pcf.read_slice %lhs_ref[%upper_m_offset, %k_offset]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_tile = pcf.read_slice %rhs_ref[%k_offset, %upper_n_offset]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %m_half, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %n_half]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_token = pcf.async_write_slice %lhs_tile into %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1] after %prev_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        %rhs_token = pcf.async_write_slice %rhs_tile into %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1] after %lhs_token
          : tensor<?x?xf32> into !pcf.sref<?x?xf32, #scope>,
            !pcf.copy_token<#scope>
        pcf.cluster_yield %rhs_token : !pcf.copy_token<#scope>
      } : (!pcf.cluster<#scope, (d0 -> s0),
                           {!pcf.copy_token<#scope>}, vertical.upper>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        {!pcf.copy_token<#scope>}, vertical.upper>

      // Leading/lower half computes the current tile.
      %acc_lower_next =
          pcf.shared_executor.run_cluster(%ready_lower, %acc_lower_iter)
          [%half_wave_count](%acc_value: tensor<?x?xf32>) {
        %lhs_stage_view = pcf.subview %lhs_stage[%stage, %c0, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%stage, %c0, %c0]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
           !pcf.cluster<#scope, (0 -> d0),
                        {tensor<?x?xf32>}, vertical.lower>)
        -> !pcf.cluster<#scope, (0 -> d0),
                        {tensor<?x?xf32>}, vertical.lower>

      // Trailing/upper half computes the previous tile.
      %acc_upper_next =
          pcf.shared_executor.run_cluster(%ready_upper, %acc_upper_iter)
          [%half_wave_count](%acc_value: tensor<?x?xf32>) {
        %lhs_stage_view = pcf.subview %lhs_stage[%delayed_stage_iter, %m_half, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%delayed_stage_iter, %c0, %n_half]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>,
           !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>

      scf.yield %copy_lower_next, %copy_upper_next,
                %acc_lower_next, %acc_upper_next, %stage
        : !pcf.cluster<#scope, (0 -> d0),
                       {!pcf.copy_token<#scope>}, vertical.lower>,
          !pcf.cluster<#scope, (d0 -> s0),
                       {!pcf.copy_token<#scope>}, vertical.upper>,
          !pcf.cluster<#scope, (0 -> d0),
                       {tensor<?x?xf32>}, vertical.lower>,
          !pcf.cluster<#scope, (d0 -> s0),
                       {tensor<?x?xf32>}, vertical.upper>,
          index
    }

    // Wait for final upper copies; join final lower copies at the barrier.
    %ready_lower_done, %ready_upper_done = pcf.shared_executor.barrier
        %copy_lower_done, %copy_upper_done[%half_wave_count]
      : (!pcf.cluster<#scope, (0 -> d0),
                      {!pcf.copy_token<#scope>}, vertical.lower>,
         !pcf.cluster<#scope, (d0 -> s0),
                      {!pcf.copy_token<#scope>}, vertical.upper>)
        -> (!pcf.cluster<#scope, (0 -> d0), vertical.lower>,
            !pcf.cluster<#scope, (d0 -> s0), vertical.upper>)

    %has_k_tile = arith.cmpi ult, %c0, %k_tiles : index
    %acc_upper_final = scf.if %has_k_tile
        -> (!pcf.cluster<#scope, (d0 -> s0),
                           {tensor<?x?xf32>}, vertical.upper>) {
      // Trailing/upper half drains the last delayed tile after the loop.
      %acc_late =
          pcf.shared_executor.run_cluster(%ready_upper_done, %acc_upper_done)
          [%half_wave_count](%acc_value: tensor<?x?xf32>) {
        %lhs_stage_view = pcf.subview %lhs_stage[%delayed_stage, %m_half, %c0]
            [1, %m_half, %k_tile] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %rhs_stage_view = pcf.subview %rhs_stage[%delayed_stage, %c0, %n_half]
            [1, %k_tile, %n_half] [1, 1, 1]
          : !pcf.sref<2x?x?xf32, #scope> to !pcf.sref<?x?xf32, #scope>
        %lhs_panel = pcf.read_slice %lhs_stage_view[%c0, %c0]
            [%m_half, %k_tile] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %rhs_panel = pcf.read_slice %rhs_stage_view[%c0, %c0]
            [%k_tile, %n_half] [1, 1]
          : !pcf.sref<?x?xf32, #scope> to tensor<?x?xf32>
        %next_value = linalg.matmul
            ins(%lhs_panel, %rhs_panel : tensor<?x?xf32>, tensor<?x?xf32>)
            outs(%acc_value : tensor<?x?xf32>) -> tensor<?x?xf32>
        pcf.cluster_yield %next_value : tensor<?x?xf32>
      } : (!pcf.cluster<#scope, (d0 -> s0), vertical.upper>,
           !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>)
        -> !pcf.cluster<#scope, (d0 -> s0),
                        {tensor<?x?xf32>}, vertical.upper>
      scf.yield %acc_late
        : !pcf.cluster<#scope, (d0 -> s0),
                       {tensor<?x?xf32>}, vertical.upper>
    } else {
      scf.yield %acc_upper_done
        : !pcf.cluster<#scope, (d0 -> s0),
                       {tensor<?x?xf32>}, vertical.upper>
    }

    pcf.return
  } : !pcf.threadgroup<#scope>

  pcf.return
}

Lowerings

Cluster Resolution

Cluster resolution lowers cluster groups to ordinary control flow. This pass is
only expected to see fully distributed clusters. By the time it runs, all
collective work has already been distributed to the current worker level, so
resolution does not perform new distribution decisions. It just replicates the
cluster region once per case and dialect-converts each clone.

Consider a two-cluster region that contains nested control flow:

Click to expand

pcf.shared_executor.tile_group %tg ns(tile) split [[%split]]
    (%left: !pcf.cluster<#scope, (0 -> d0), tile.left>,
     %right: !pcf.cluster<#scope, (d0 -> s0), tile.right>) {
  scf.for %i = %c0 to %n step %c1 {
    %parity = arith.remui %i, %c2 : index
    %use_left = arith.cmpi eq, %parity, %c0 : index
    scf.if %use_left {
      pcf.shared_executor.run_thread(%left)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (0 -> d0), tile.left>)
    } else {
      pcf.shared_executor.run_thread(%right)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (d0 -> s0), tile.right>)
    }
  }
  pcf.return
} : !pcf.threadgroup<#scope>

The first step computes the current worker's cluster id and replicates the
region into one case per cluster. Immediately after replication, the cloned
regions still contain both clusters; only the case mapping has changed. The
cloned block arguments are shown with the same names as the source region:

Click to expand

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%is_right = arith.cmpi uge, %worker_id, %split : index
%cluster_id = arith.select %is_right, %c1, %c0 : index

scf.index_switch %cluster_id
case 0 {
  // Replicated body for tile.left:
  //   %left is active, %right is inactive.
  scf.for %i = %c0 to %n step %c1 {
    %parity = arith.remui %i, %c2 : index
    %use_left = arith.cmpi eq, %parity, %c0 : index
    scf.if %use_left {
      pcf.shared_executor.run_thread(%left)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (0 -> d0), tile.left>)
    } else {
      pcf.shared_executor.run_thread(%right)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (d0 -> s0), tile.right>)
    }
  }
  scf.yield
}
case 1 {
  // Replicated body for tile.right:
  //   %left is inactive, %right is active.
  scf.for %i = %c0 to %n step %c1 {
    %parity = arith.remui %i, %c2 : index
    %use_left = arith.cmpi eq, %parity, %c0 : index
    scf.if %use_left {
      pcf.shared_executor.run_thread(%left)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (0 -> d0), tile.left>)
    } else {
      pcf.shared_executor.run_thread(%right)[%split]
          ()[%tid: index] {
        %lane_value = arith.addi %i, %tid : index
        pcf.cluster_yield
      } : (!pcf.cluster<#scope, (d0 -> s0), tile.right>)
    }
  }
  scf.yield
}
default {
  scf.yield
}

Then each case is dialect-converted with a case-specific cluster type converter.
The active cluster converts to ordinary per-worker code, while the inactive
cluster converts away:

Click to expand

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%is_right = arith.cmpi uge, %worker_id, %split : index
%cluster_id = arith.select %is_right, %c1, %c0 : index

scf.index_switch %cluster_id
case 0 {
  scf.for %i = %c0 to %n step %c1 {
    %parity = arith.remui %i, %c2 : index
    %use_left = arith.cmpi eq, %parity, %c0 : index
    scf.if %use_left {
      %tid = arith.subi %worker_id, %c0 : index
      %lane_value = arith.addi %i, %tid : index
    } else {
      // The tile.right cluster is inactive in this case.
    }
  }
  scf.yield
}
case 1 {
  scf.for %i = %c0 to %n step %c1 {
    %parity = arith.remui %i, %c2 : index
    %use_left = arith.cmpi eq, %parity, %c0 : index
    scf.if %use_left {
      // The tile.left cluster is inactive in this case.
    } else {
      %tid = arith.subi %worker_id, %split : index
      %lane_value = arith.addi %i, %tid : index
    }
  }
  scf.yield
}
default {
  scf.yield
}

The important part is that control flow is not interpreted specially. Loops,
conditionals, and nested regions are simply cloned with the cluster case and
then converted like we would expect with any standard lowering. Also notice
that values crossing control flow edges are handled naturally like this.
For example, the loop may carry a cluster payload from only one cluster:

Click to expand

pcf.shared_executor.tile_group %tg ns(tile) split [[%split]]
    (%left: !pcf.cluster<#scope, (0 -> d0), tile.left>,
     %right: !pcf.cluster<#scope, (d0 -> s0), tile.right>) {
  %left0 = pcf.shared_executor.run_thread(%left)[%split]
      ()[%tid: index] {
    %zero = arith.constant 0 : index
    %true = arith.constant true
    pcf.cluster_yield %zero, %true : index, i1
  } : (!pcf.cluster<#scope, (0 -> d0), tile.left>)
    -> !pcf.cluster<#scope, (0 -> d0),
                    private = {index, i1}, tile.left>

  %left_done = scf.for %i = %c0 to %n step %c1
      iter_args(%left_iter = %left0)
      -> (!pcf.cluster<#scope, (0 -> d0),
                         private = {index, i1}, tile.left>) {
    %left_next = pcf.shared_executor.run_thread(%left_iter)[%split]
        (%acc: index, %valid: i1)[%tid: index] {
      %next = arith.addi %acc, %i : index
      pcf.cluster_yield %next, %valid : index, i1
    } : (!pcf.cluster<#scope, (0 -> d0),
                        private = {index, i1}, tile.left>)
      -> !pcf.cluster<#scope, (0 -> d0),
                      private = {index, i1}, tile.left>

    pcf.shared_executor.run_thread(%right)[%split]
        ()[%tid: index] {
      %lane_value = arith.addi %i, %tid : index
      pcf.cluster_yield
    } : (!pcf.cluster<#scope, (d0 -> s0), tile.right>)

    scf.yield %left_next
      : !pcf.cluster<#scope, (0 -> d0),
                     private = {index, i1}, tile.left>
  }
  pcf.return
} : !pcf.threadgroup<#scope>

After replication and conversion, both cases keep the same loop structure. The
active-left case carries the converted private struct fields, while the
active-right case has no loop-carried payload:

Click to expand

%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%is_right = arith.cmpi uge, %worker_id, %split : index
%cluster_id = arith.select %is_right, %c1, %c0 : index

scf.index_switch %cluster_id
case 0 {
  %left0_acc = arith.constant 0 : index
  %left0_valid = arith.constant true
  %left_done = scf.for %i = %c0 to %n step %c1
      iter_args(%acc_iter = %left0_acc,
                %valid_iter = %left0_valid) -> (index, i1) {
    %tid = arith.subi %worker_id, %c0 : index
    %next = arith.addi %acc_iter, %i : index
    scf.yield %next, %valid_iter : index, i1
  }
  scf.yield
}
case 1 {
  scf.for %i = %c0 to %n step %c1 {
    %tid = arith.subi %worker_id, %split : index
    %lane_value = arith.addi %i, %tid : index
    scf.yield
  }
  scf.yield
}
default {
  scf.yield
}

Distribution

The previous section focused on how to deal with resolving the surrounding structure
of distributable regions. The other side of the same coin is how to distribute the
contents of each region. This is a fairly nuanced question with many competing
implementations and IR, so the expectation is that trying to fix a one-size fits all
approach right now is a bad choice. Instead we can try to formulate an interface through
which distribution can happen and provide a baseline implementation for it. Consider
a chain of two distributable regions:

Click to expand

pcf.shared_executor scope(#scope) initialize {
  pcf.yield
} -> ()
  execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %bias_ref <- %bias)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?xf32, #scope>,
       !pcf.sref<?xf32, #scope>,
       !pcf.sref<?xf32, #scope>) -> () {
  %sum_tg = pcf.shared_executor.run_cluster(%tg)[]
      () {
    %lhs_tile = pcf.read_slice %lhs_ref[%tile_offset]
        [%tile_size] [1]
      : !pcf.sref<?xf32, #scope> to tensor<?xf32>
    %rhs_tile = pcf.read_slice %rhs_ref[%tile_offset]
        [%tile_size] [1]
      : !pcf.sref<?xf32, #scope> to tensor<?xf32>
    %sum_init = tensor.empty(%tile_size) : tensor<?xf32>
    %sum = linalg.generic {
        indexing_maps = [
          affine_map<(d0) -> (d0)>,
          affine_map<(d0) -> (d0)>,
          affine_map<(d0) -> (d0)>],
        iterator_types = ["parallel"]}
        ins(%lhs_tile, %rhs_tile : tensor<?xf32>, tensor<?xf32>)
        outs(%sum_init : tensor<?xf32>) {
    ^bb0(%lhs_elem: f32, %rhs_elem: f32, %out_elem: f32):
      %sum_elem = arith.addf %lhs_elem, %rhs_elem : f32
      linalg.yield %sum_elem : f32
    } -> tensor<?xf32>
    pcf.cluster_yield %sum : tensor<?xf32>
  } : (!pcf.threadgroup<#scope>)
    -> !pcf.threadgroup<#scope, {tensor<?xf32>}>

  %diff_tg = pcf.shared_executor.run_cluster(%sum_tg)[]
      (%sum: tensor<?xf32>) {
    %bias_tile = pcf.read_slice %bias_ref[%tile_offset]
        [%tile_size] [1]
      : !pcf.sref<?xf32, #scope> to tensor<?xf32>
    %diff_init = tensor.empty(%tile_size) : tensor<?xf32>
    %diff = linalg.generic {
        indexing_maps = [
          affine_map<(d0) -> (d0)>,
          affine_map<(d0) -> (d0)>,
          affine_map<(d0) -> (d0)>],
        iterator_types = ["parallel"]}
        ins(%sum, %bias_tile : tensor<?xf32>, tensor<?xf32>)
        outs(%diff_init : tensor<?xf32>) {
    ^bb0(%sum_elem: f32, %bias_elem: f32, %out_elem: f32):
      %diff_elem = arith.subf %sum_elem, %bias_elem : f32
      linalg.yield %diff_elem : f32
    } -> tensor<?xf32>
    pcf.cluster_yield %diff : tensor<?xf32>
  } : (!pcf.threadgroup<#scope, {tensor<?xf32>}>)
    -> !pcf.threadgroup<#scope, {tensor<?xf32>}>

  pcf.return
}

Because there is a shared value passed between the two clusters we have
to distribute them simultaneously. Trying to distribute both regions simultaneously
means that any implementation responsible for distribution would need to
walk the regions to determine how to resolve those two regions together. We
don't want to have to force that onto all implementations so instead we provide
an equivalence analysis that walks control flow and determines when two threadgroup
carried values are equivalent. The above example is obviously very simple, but with
more complex control flow, like loop carried values, it's more relevant.

For example, from the perspective of a distribution implementation, this:

%result = pcf.shared_executor scope(#scope) initialize {
  pcf.yield
} -> ()
  execute(%out_ref = %out)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?xf32, sync(#scope)>) -> (tensor<?xf32>) {
  %before_tg = pcf.shared_executor.run_cluster(%tg)[]
      () {
    %zero = arith.constant 0.0 : f32
    %init_empty = tensor.empty(%tile_size) : tensor<?xf32>
    %init = linalg.fill ins(%zero : f32)
        outs(%init_empty : tensor<?xf32>) -> tensor<?xf32>
    %one = arith.constant 1.0 : f32
    %one_empty = tensor.empty(%tile_size) : tensor<?xf32>
    %one_tile = linalg.fill ins(%one : f32)
        outs(%one_empty : tensor<?xf32>) -> tensor<?xf32>
    pcf.cluster_yield %init, %one_tile : tensor<?xf32>, tensor<?xf32>
  } : (!pcf.threadgroup<#scope>)
    -> !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>

  %loop_tg = scf.for %i = %c0 to %n step %c1
      iter_args(%iter_tg = %before_tg)
      -> (!pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>) {
    %next_tg = pcf.shared_executor.run_cluster(%iter_tg)[]
        (%value: tensor<?xf32>, %one_tile: tensor<?xf32>) {
      %next_empty = tensor.empty(%tile_size) : tensor<?xf32>
      %next = linalg.add
          ins(%value, %one_tile : tensor<?xf32>, tensor<?xf32>)
          outs(%next_empty : tensor<?xf32>) -> tensor<?xf32>
      pcf.cluster_yield %next, %one_tile : tensor<?xf32>, tensor<?xf32>
    } : (!pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)
      -> !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>

    scf.yield %next_tg
      : !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>
  }

  pcf.shared_executor.run_cluster(%loop_tg)[]
      (%value: tensor<?xf32>, %one_tile: tensor<?xf32>) {
    pcf.write_slice %value into %out_ref[%tile_offset]
        [%tile_size] [1]
      : tensor<?xf32> into !pcf.sref<?xf32, sync(#scope)>
    pcf.cluster_yield
  } : (!pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)

  pcf.return
}

is the same as distributing this CFG:

%result = pcf.shared_executor scope(#scope) initialize {
  pcf.yield
} -> ()
  execute(%out_ref = %out)
    [%tg: !pcf.threadgroup<#scope>]
    : (!pcf.sref<?xf32, sync(#scope)>) -> (tensor<?xf32>) {
  %before_tg = pcf.shared_executor.run_cluster(%tg)[]
      () {
    %zero = arith.constant 0.0 : f32
    %init_empty = tensor.empty(%tile_size) : tensor<?xf32>
    %init = linalg.fill ins(%zero : f32)
        outs(%init_empty : tensor<?xf32>) -> tensor<?xf32>
    %one = arith.constant 1.0 : f32
    %one_empty = tensor.empty(%tile_size) : tensor<?xf32>
    %one_tile = linalg.fill ins(%one : f32)
        outs(%one_empty : tensor<?xf32>) -> tensor<?xf32>
    pcf.cluster_yield %init, %one_tile : tensor<?xf32>, tensor<?xf32>
  } : (!pcf.threadgroup<#scope>)
    -> !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>

  cf.br ^loop_header(%c0, %before_tg
      : index, !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)

^loop_header(%i: index,
             %iter_tg: !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>):
  %keep_going = arith.cmpi ult, %i, %n : index
  cf.cond_br %keep_going,
      ^loop_body(%i, %iter_tg
        : index, !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>),
      ^after_loop(%iter_tg
        : !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)

^loop_body(%body_i: index,
           %body_tg: !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>):
  %next_tg = pcf.shared_executor.run_cluster(%body_tg)[]
      (%value: tensor<?xf32>, %one_tile: tensor<?xf32>) {
    %next_empty = tensor.empty(%tile_size) : tensor<?xf32>
    %next = linalg.add
        ins(%value, %one_tile : tensor<?xf32>, tensor<?xf32>)
        outs(%next_empty : tensor<?xf32>) -> tensor<?xf32>
    pcf.cluster_yield %next, %one_tile : tensor<?xf32>, tensor<?xf32>
  } : (!pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)
    -> !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>

  %next_i = arith.addi %body_i, %c1 : index
  cf.br ^loop_header(%next_i, %next_tg
      : index, !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)

^after_loop(%loop_tg: !pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>):
  pcf.shared_executor.run_cluster(%loop_tg)[]
      (%value: tensor<?xf32>, %one_tile: tensor<?xf32>) {
    pcf.write_slice %value into %out_ref[%tile_offset]
        [%tile_size] [1]
      : tensor<?xf32> into !pcf.sref<?xf32, sync(#scope)>
    pcf.cluster_yield
  } : (!pcf.threadgroup<#scope, {tensor<?xf32>, tensor<?xf32>}>)

  pcf.return
}

In principle this extends across function boundaries too, though the current way the IR
is represented is local region based meaning functions are (currently) unsupported.
To support proper function calls we'll likely need some flavor of wrapping module but
that is future work.

GPU Pipeline Rework

This walkthrough uses the run-cluster distribution path for a dynamic
f16xf16xf32 matmul. The target workgroup tile is 256x128x64,
with a 2x2 subgroup split and lane-local MFMA fragments chosen only
after the collective phase structure has been made explicit.

The main point of the pipeline is to keep two things visible at the
right time: tensor-level structure while choosing tiles, then PCF
phase structure while distributing copy, compute, and writeback work.

1. Create Dispatch Config

This stage records the target pipeline choice and launch shape. The IR is still
the original dispatch-level matmul over dynamic tensors, with the HAL interface
materialized and the RunClusterDistribute translation selected. This is no different
that existing pipelines and the intent is to reuse the configuration attributes other
pipelines employ.

This example is picking a fairly arbitrary 128x256x64 target tile size. This is not
implying that we only support static tile sizes, it's just because tensor doesn't
distinguish between different dynamic sizes well at all so it makes it difficult
to parse what's happening in an IR dump.

IR after CreateDispatchConfigPass

// -----// IR Dump After CreateDispatchConfigPass (iree-codegen-create-dispatch-config) ('hal.executable.variant' operation: @rocm_hsaco_fb) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>
        %empty = tensor.empty(%m, %n) : tensor<?x?xf32>
        %fill = linalg.fill ins(%zero : f32) outs(%empty : tensor<?x?xf32>) -> tensor<?x?xf32>
        %matmul = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [16, 8, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs, %rhs : tensor<?x?xf16>, tensor<?x?xf16>) outs(%fill : tensor<?x?xf32>) -> tensor<?x?xf32>
        iree_tensor_ext.dispatch.tensor.store %matmul, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

2. Apply Workgroup Tiling

Workgroup tiling looks a bit different here, targeting pcf.loop instead of scf.forall.
The reason is twofold. First we need to see the boundary between global and workgroup local values.
The way to do this is with readonly memory that we read from. The second is to prepare for
future strategies that are incompatible with scf.forall like stream-k.

With that said, for the time being the implementation is still tile + fuse based so the
core transformation is largely the same.

IR after GPUApplyWorkgroupTilingPass

// -----// IR Dump After GPUApplyWorkgroupTilingPass (iree-gpu-apply-workgroup-tiling) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>

        // Workgroup level: one workgroup computes one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%lhs_wg <- %lhs, %rhs_wg <- %rhs)
              [%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>,
               !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %lhs_tile = pcf.read_slice %lhs_wg[%m_offset, 0] [%m_tile, %k] [1, 1] : !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>> to tensor<?x?xf16>
          %rhs_tile = pcf.read_slice %rhs_wg[0, %n_offset] [%k, %n_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>> to tensor<?x?xf16>
          %acc_tile = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>
          %init = linalg.fill ins(%zero : f32) outs(%acc_tile : tensor<?x?xf32>) -> tensor<?x?xf32>
          %tile = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [16, 8, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs_tile, %rhs_tile : tensor<?x?xf16>, tensor<?x?xf16>) outs(%init : tensor<?x?xf32>) -> tensor<?x?xf32>
          pcf.return %tile : tensor<?x?xf32>
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

3. Wrap Workgroup Body

This step introduces the collective subgroup-scope region without changing the
body schedule yet. The workgroup pcf.loop still owns the global workgroup
tile, while the actual tile computation is now nested under
pcf.shared_executor scope(#iree_gpu.subgroup_scope).

IR after GPUWrapWorkgroupBodyPass

// -----// IR Dump After GPUWrapWorkgroupBodyPass (iree-gpu-wrap-workgroup-body) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>

        // Workgroup level: one workgroup computes one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%lhs_wg <- %lhs, %rhs_wg <- %rhs)
              [%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>,
               !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>

          // Shared-executor level: introduce the collective subgroup program.
          %tile = pcf.shared_executor scope(#iree_gpu.subgroup_scope) initialize {
            pcf.yield
          } -> ()
            execute(%lhs_ref <- %lhs_wg, %rhs_ref <- %rhs_wg, %acc_ref = %acc_tile)
              [%tg: !pcf.threadgroup<#iree_gpu.subgroup_scope>]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>)
                -> (tensor<?x?xf32>) {
            %lhs_tile = pcf.read_slice %lhs_ref[%m_offset, 0] [%m_tile, %k] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
            %rhs_tile = pcf.read_slice %rhs_ref[0, %n_offset] [%k, %n_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
            %acc = pcf.read_slice %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)> to tensor<?x?xf32>
            %init = linalg.fill ins(%zero : f32) outs(%acc : tensor<?x?xf32>) -> tensor<?x?xf32>
            %result = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [16, 8, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs_tile, %rhs_tile : tensor<?x?xf16>, tensor<?x?xf16>) outs(%init : tensor<?x?xf32>) -> tensor<?x?xf32>
            pcf.write_slice %result into %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>
            pcf.return
          }

          pcf.return %tile : tensor<?x?xf32>
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

4. Form Run-Cluster Flow

This splits the shared executor body into explicit phases. Initialization,
compute, and writeback become run_cluster regions, and the K loop carries
the accumulator as a threadgroup payload.

IR after GPUFormRunClusterMatmulFlowPass

// -----// IR Dump After GPUFormRunClusterMatmulFlowPass (iree-gpu-form-run-cluster-matmul-flow) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c64 = arith.constant 64 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>

        // Workgroup level: one workgroup owns one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%lhs_wg <- %lhs, %rhs_wg <- %rhs)
              [%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>,
               !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>

          // Shared-executor level: express the workgroup as explicit run_cluster regions.
          %tile = pcf.shared_executor scope(#iree_gpu.subgroup_scope) initialize {
            pcf.yield
          } -> ()
            execute(%lhs_ref <- %lhs_wg, %rhs_ref <- %rhs_wg, %acc_ref = %acc_tile)
              [%tg: !pcf.threadgroup<#iree_gpu.subgroup_scope>]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>)
                -> (tensor<?x?xf32>) {

            // Initialization cluster: initialize the local accumulator tensor.
            %init = pcf.shared_executor.run_cluster(%tg)[]
                () {
              %acc = pcf.read_slice %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)> to tensor<?x?xf32>
              %fill = linalg.fill ins(%zero : f32) outs(%acc : tensor<?x?xf32>) -> tensor<?x?xf32>
              pcf.cluster_yield %fill : tensor<?x?xf32>
            } : (!pcf.threadgroup<#iree_gpu.subgroup_scope>)
              -> !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>

            // Reduction level: run explicit compute clusters over 64-wide K panels.
            // The loop-carried accumulator is a cluster value; the tensor payload
            // is only visible inside run_cluster bodies.
            %reduced = scf.for %ko = %c0 to %k step %c64 iter_args(%iter = %init) -> (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>) {
              %k_tile = affine.min affine_map<(d0)[s0] -> (64, s0 - d0)>(%ko)[%k]

              // Compute cluster: read global panels and update the workgroup accumulator.
              %next = pcf.shared_executor.run_cluster(%iter)[]
                  (%iter_tensor: tensor<?x?xf32>) {
                %lhs_panel = pcf.read_slice %lhs_ref[%m_offset, %ko] [%m_tile, %k_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                %rhs_panel = pcf.read_slice %rhs_ref[%ko, %n_offset] [%k_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                %mm = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [16, 8, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs_panel, %rhs_panel : tensor<?x?xf16>, tensor<?x?xf16>) outs(%iter_tensor : tensor<?x?xf32>) -> tensor<?x?xf32>
                pcf.cluster_yield %mm : tensor<?x?xf32>
              } : (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>)
                -> !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>
              scf.yield %next : !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>
            }

            // Writeback cluster: write the complete workgroup tile.
            pcf.shared_executor.run_cluster(%reduced)[]
                (%reduced_tensor: tensor<?x?xf32>) {
              pcf.write_slice %reduced_tensor into %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>
              pcf.cluster_yield
            } : (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>)
            pcf.return
          }

          pcf.return %tile : tensor<?x?xf32>
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

5. Introduce Shared Memory Copies

This step allocates subgroup shared-memory panels and splits each K iteration
into a copy cluster followed by a compute cluster. The plan is that this is where
shared memory level padding would happen. Vectorization is what is responsible for
instruction level padding.

(yes synchronization for the memory is missing here, I haven't decided what it looks
like yet).

IR after GPUIntroduceSharedMemoryCopiesPass

// -----// IR Dump After GPUIntroduceSharedMemoryCopiesPass (iree-gpu-introduce-shared-memory-copies) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c64 = arith.constant 64 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>

        // Workgroup level: one workgroup owns one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%lhs_wg <- %lhs, %rhs_wg <- %rhs)
              [%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>,
               !pcf.sref<?x?xf16, #iree_codegen.workgroup_scope<linearize>>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = tensor.empty(%m_tile, %n_tile) : tensor<?x?xf32>

          // Shared-executor level: add shared-memory panels for the K loop.
          %tile = pcf.shared_executor scope(#iree_gpu.subgroup_scope) initialize {
            %lhs_smem = pcf.alloc() : !pcf.sref<256x64xf16, #iree_gpu.subgroup_scope>
            %rhs_smem = pcf.alloc() : !pcf.sref<64x128xf16, #iree_gpu.subgroup_scope>
            pcf.yield %lhs_smem, %rhs_smem : !pcf.sref<256x64xf16, #iree_gpu.subgroup_scope>, !pcf.sref<64x128xf16, #iree_gpu.subgroup_scope>
          } -> (%lhs_smem: !pcf.sref<256x64xf16, #iree_gpu.subgroup_scope>, %rhs_smem: !pcf.sref<64x128xf16, #iree_gpu.subgroup_scope>)
            execute(%lhs_ref <- %lhs_wg, %rhs_ref <- %rhs_wg, %acc_ref = %acc_tile)
              [%tg: !pcf.threadgroup<#iree_gpu.subgroup_scope>]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>,
                 !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>)
                -> (tensor<?x?xf32>) {

            // Some kind of barrier or synchronization goes here.

            // Initialization cluster: initialize the local accumulator tensor.
            %init = pcf.shared_executor.run_cluster(%tg)[]
                () {
              %acc = pcf.read_slice %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)> to tensor<?x?xf32>
              %fill = linalg.fill ins(%zero : f32) outs(%acc : tensor<?x?xf32>) -> tensor<?x?xf32>
              pcf.cluster_yield %fill : tensor<?x?xf32>
            } : (!pcf.threadgroup<#iree_gpu.subgroup_scope>)
              -> !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>

            // Reduction level: copy global panels, then compute from shared memory.
            %reduced = scf.for %ko = %c0 to %k step %c64 iter_args(%iter = %init) -> (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>) {
              %k_tile = affine.min affine_map<(d0)[s0] -> (64, s0 - d0)>(%ko)[%k]

              // Shared-memory copy cluster: stage the current LHS/RHS K panel.
              pcf.shared_executor.run_cluster(%tg)[]
                  () {
                %lhs_panel = pcf.read_slice %lhs_ref[%m_offset, %ko] [%m_tile, %k_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                %rhs_panel = pcf.read_slice %rhs_ref[%ko, %n_offset] [%k_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                pcf.write_slice %lhs_panel into %lhs_smem[0, 0] [%m_tile, %k_tile] [1, 1] : tensor<?x?xf16> into !pcf.sref<256x64xf16, #iree_gpu.subgroup_scope>
                pcf.write_slice %rhs_panel into %rhs_smem[0, 0] [%k_tile, %n_tile] [1, 1] : tensor<?x?xf16> into !pcf.sref<64x128xf16, #iree_gpu.subgroup_scope>
                pcf.cluster_yield
              } : (!pcf.threadgroup<#iree_gpu.subgroup_scope>)

              // Compute cluster: consume staged panels and update the accumulator.
              %next = pcf.shared_executor.run_cluster(%iter)[]
                  (%iter_tensor: tensor<?x?xf32>) {
                %lhs_s = pcf.read_slice %lhs_smem[0, 0] [%m_tile, %k_tile] [1, 1] : !pcf.sref<256x64xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                %rhs_s = pcf.read_slice %rhs_smem[0, 0] [%k_tile, %n_tile] [1, 1] : !pcf.sref<64x128xf16, #iree_gpu.subgroup_scope> to tensor<?x?xf16>
                %mm = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [16, 8, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs_s, %rhs_s : tensor<?x?xf16>, tensor<?x?xf16>) outs(%iter_tensor : tensor<?x?xf32>) -> tensor<?x?xf32>
                pcf.cluster_yield %mm : tensor<?x?xf32>
              } : (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>)
                -> !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>
              scf.yield %next : !pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>
            }

            // Writeback cluster: write the complete workgroup tile.
            pcf.shared_executor.run_cluster(%reduced)[]
                (%reduced_tensor: tensor<?x?xf32>) {
              pcf.write_slice %reduced_tensor into %acc_ref[0, 0] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>
              pcf.cluster_yield
            } : (!pcf.threadgroup<#iree_gpu.subgroup_scope, {tensor<?x?xf32>}>)
            pcf.return
          }

          pcf.return %tile : tensor<?x?xf32>
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

6. Distribute to Subgroups

This distributes the workgroup tile to a 2x2 subgroup grid. The outer
run_cluster structure is preserved at lane scope so the next distribution
step can still see the phase boundaries and shared payload flow.
Note that after this step the overall cluster structure is retained, however
the scope is instead on lanes instead of subgroups. This is because we want
the exact same structure to do lane level distribution as subgroup level.

IR after PCFDistributeRunClusterToSubgroupsPass

// -----// IR Dump After PCFDistributeRunClusterToSubgroupsPass (iree-pcf-distribute-run-cluster-to-subgroups) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c64 = arith.constant 64 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>
        %empty = tensor.empty(%m, %n) : tensor<?x?xf32>

        // Workgroup level: one workgroup owns one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%acc_wg = %empty)[%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = pcf.read_slice %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)> to tensor<?x?xf32>

          // Subgroup level: first distribution phase rewrites the workgroup tile
          // into a 2x2 subgroup grid. The run_cluster markers remain, but their
          // threadgroup is now the lane group used by the next distribution phase.
          %subgroup_tile = pcf.generic scope(#iree_gpu.subgroup_scope) initialize {
            pcf.yield
          }
            execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %acc_tile)[%sg_id: index, %sg_count: index]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>) -> (tensor<?x?xf32>) {
            %sg_idx:2 = affine.delinearize_index %sg_id into (2, 2) : index, index
            %sg_m_base = arith.muli %sg_idx#0, %c128 : index
            %sg_n_base = arith.muli %sg_idx#1, %c64 : index
            %sg_m_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_m_base)[%m_tile]
            %sg_n_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_n_base)[%n_tile]
            %sg_m_size = affine.min affine_map<()[s0] -> (128, s0)>()[%sg_m_remaining]
            %sg_n_size = affine.min affine_map<()[s0] -> (64, s0)>()[%sg_n_remaining]
            %sg_m_offset = arith.addi %m_offset, %sg_m_base : index
            %sg_n_offset = arith.addi %n_offset, %sg_n_base : index

            // Lane-cluster level: run_cluster markers still group init, copy,
            // compute, and writeback work. No vector types have been introduced.
            pcf.shared_executor scope(#iree_gpu.lane_scope) initialize {
              %lhs_smem = pcf.alloc() : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
              %rhs_smem = pcf.alloc() : !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
              pcf.yield %lhs_smem, %rhs_smem : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
            } -> (%lhs_smem: !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, %rhs_smem: !pcf.sref<64x64xf16, #iree_gpu.lane_scope>)
              execute(%lane_lhs_ref <- %lhs_ref, %lane_rhs_ref <- %rhs_ref, %lane_acc_ref = %acc_ref)[%tg: !pcf.threadgroup<#iree_gpu.lane_scope>]
                : (!pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)>) -> () {

              // Subgroup accumulator initialization cluster.
              %init = pcf.shared_executor.run_cluster(%tg)[]
                  () {
                %sg_acc_tile = pcf.read_slice %lane_acc_ref[%sg_m_base, %sg_n_base] [%sg_m_size, %sg_n_size] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)> to tensor<?x?xf32>
                %fill = linalg.fill ins(%zero : f32) outs(%sg_acc_tile : tensor<?x?xf32>) -> tensor<?x?xf32>
                pcf.cluster_yield %fill : tensor<?x?xf32>
              } : (!pcf.threadgroup<#iree_gpu.lane_scope>)
                -> !pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>

              // Reduction level: each 64-wide K panel is still expressed as
              // explicit copy and compute clusters. The accumulator remains a
              // threadgroup-carried tensor payload until lane distribution.
              %sg_reduced = scf.for %ko = %c0 to %k step %c64 iter_args(%iter = %init) -> (!pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>) {
                %k_tile = affine.min affine_map<(d0)[s0] -> (64, s0 - d0)>(%ko)[%k]

                // Shared-memory copy cluster: whole subgroup LHS/RHS panels.
                pcf.shared_executor.run_cluster(%tg)[]
                    () {
                  %lhs_panel = pcf.read_slice %lane_lhs_ref[%sg_m_offset, %ko] [%sg_m_size, %k_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                  %rhs_panel = pcf.read_slice %lane_rhs_ref[%ko, %sg_n_offset] [%k_tile, %sg_n_size] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                  pcf.write_slice %lhs_panel into %lhs_smem[0, 0] [%sg_m_size, %k_tile] [1, 1] : tensor<?x?xf16> into !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
                  pcf.write_slice %rhs_panel into %rhs_smem[0, 0] [%k_tile, %sg_n_size] [1, 1] : tensor<?x?xf16> into !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
                  pcf.cluster_yield
                } : (!pcf.threadgroup<#iree_gpu.lane_scope>)

                // Compute cluster: subgroup-level matmul on staged panels.
                %next = pcf.shared_executor.run_cluster(%iter)[]
                    (%iter_tensor: tensor<?x?xf32>) {
                  %lhs_s = pcf.read_slice %lhs_smem[0, 0] [%sg_m_size, %k_tile] [1, 1] : !pcf.sref<128x64xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                  %rhs_s = pcf.read_slice %rhs_smem[0, 0] [%k_tile, %sg_n_size] [1, 1] : !pcf.sref<64x64xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                  %mm = linalg.matmul {lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 64], subgroup = [128, 64, 0], lane = [0, 0, 0], workgroup = [256, 128, 0], mma_kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>}>} ins(%lhs_s, %rhs_s : tensor<?x?xf16>, tensor<?x?xf16>) outs(%iter_tensor : tensor<?x?xf32>) -> tensor<?x?xf32>
                  pcf.cluster_yield %mm : tensor<?x?xf32>
                } : (!pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>)
                  -> !pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>
                scf.yield %next : !pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>
              }

              // Subgroup writeback cluster.
              pcf.shared_executor.run_cluster(%sg_reduced)[]
                  (%sg_reduced_tensor: tensor<?x?xf32>) {
                pcf.write_slice %sg_reduced_tensor into %lane_acc_ref[%sg_m_base, %sg_n_base] [%sg_m_size, %sg_n_size] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)>
                pcf.cluster_yield
              } : (!pcf.threadgroup<#iree_gpu.lane_scope, {tensor<?x?xf32>}>)
              pcf.return
            }

            pcf.return
          }

          pcf.write_slice %subgroup_tile into %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>
          pcf.return
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

7. Distribute to Lanes

This lowers the remaining lane-scope run_cluster regions into lane-local
code. It chooses the lane copy chunks and packed tensor fragments and leaves
the program tensor-shaped for vectorization.

IR after PCFDistributeRunClusterToLanesPass

// -----// IR Dump After PCFDistributeRunClusterToLanesPass (iree-pcf-distribute-run-cluster-to-lanes) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
#matmul_accesses = [affine_map<(i, j, k) -> (i, k)>, affine_map<(i, j, k) -> (k, j)>, affine_map<(i, j, k) -> (i, j)>]
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c1 = arith.constant 1 : index
        %c2 = arith.constant 2 : index
        %c4 = arith.constant 4 : index
        %c16 = arith.constant 16 : index
        %c64 = arith.constant 64 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>
        %empty = tensor.empty(%m, %n) : tensor<?x?xf32>

        // Workgroup level: one workgroup owns one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%acc_wg = %empty)[%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = pcf.read_slice %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)> to tensor<?x?xf32>

          // Subgroup level: lane distribution preserves the subgroup grid and
          // removes run_cluster markers by inlining their bodies into lane code.
          %subgroup_tile = pcf.generic scope(#iree_gpu.subgroup_scope) initialize {
            pcf.yield
          }
            execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %acc_tile)[%sg_id: index, %sg_count: index]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>) -> (tensor<?x?xf32>) {
            %sg_idx:2 = affine.delinearize_index %sg_id into (2, 2) : index, index
            %sg_m_base = arith.muli %sg_idx#0, %c128 : index
            %sg_n_base = arith.muli %sg_idx#1, %c64 : index
            %sg_m_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_m_base)[%m_tile]
            %sg_n_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_n_base)[%n_tile]
            %sg_m_size = affine.min affine_map<()[s0] -> (128, s0)>()[%sg_m_remaining]
            %sg_n_size = affine.min affine_map<()[s0] -> (64, s0)>()[%sg_n_remaining]
            %sg_m_offset = arith.addi %m_offset, %sg_m_base : index
            %sg_n_offset = arith.addi %n_offset, %sg_n_base : index

            // Lane level: run_cluster bodies are now inlined. Distribution has
            // selected lane-local tensor fragments, but vectorization has not run.
            %lane_tile = pcf.generic scope(#iree_gpu.lane_scope) initialize {
              %lane_lhs_smem = pcf.alloc() : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
              %lane_rhs_smem = pcf.alloc() : !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
              pcf.yield %lane_lhs_smem, %lane_rhs_smem : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
            } -> (%lane_lhs_smem: !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, %lane_rhs_smem: !pcf.sref<64x64xf16, #iree_gpu.lane_scope>)
              execute(%lane_lhs_ref <- %lhs_ref, %lane_rhs_ref <- %rhs_ref, %lane_acc_ref = %acc_ref)[%lane_id: index, %lane_count: index]
                : (!pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)>) -> (tensor<?x?xf32>) {
              %lane_dln:3 = affine.delinearize_index %lane_id into (4, 16) : index, index, index
              %mma_m_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%sg_m_size]
              %mma_n_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%sg_n_size]
              %mma_lane_m = iree_codegen.index_hint %lane_dln#1(#iree_gpu.lane_constant<16>) : index
              %mma_lane_k = iree_codegen.index_hint %lane_dln#2(#iree_gpu.lane_increment<16, aligned>) : index
              %mma_acc_inner_m = affine.linearize_index disjoint [%mma_lane_m, %c0] by (4, 4) : index
              %acc_subview = pcf.subview %lane_acc_ref[%sg_m_base, %sg_n_base] [%sg_m_size, %sg_n_size] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)> to !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)>
              %acc_packed = pcf.expand_shape %acc_subview [[0, 1], [2, 3]] : !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)> into !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)>
              %acc_read = pcf.read_slice %acc_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_m_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)> to tensor<?x4x?x1xf32>
              %acc_lane = linalg.transpose ins(%acc_read : tensor<?x4x?x1xf32>) outs(tensor.empty() : tensor<?x?x4x1xf32>) permutation = [0, 2, 1, 3]
              %init_lane = linalg.fill ins(%zero : f32) outs(%acc_lane : tensor<?x?x4x1xf32>) -> tensor<?x?x4x1xf32>

              // Reduction level: copy clusters are inlined as lane-local tensor copies.
              %lane_reduced = scf.for %ko = %c0 to %k step %c64 iter_args(%iter = %init_lane) -> (tensor<?x?x4x1xf32>) {
                %k_tile = affine.min affine_map<(d0)[s0] -> (64, s0 - d0)>(%ko)[%k]
                %mma_k_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%k_tile]
                %lhs_copy_m = arith.muli %lane_id, %c2 : index
                %rhs_copy_k = arith.muli %lane_id, %c1 : index
                %lhs_copy_m_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%lhs_copy_m)[%sg_m_size]
                %rhs_copy_k_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%rhs_copy_k)[%k_tile]
                %lhs_copy_m_size = affine.min affine_map<()[s0] -> (2, s0)>()[%lhs_copy_m_remaining]
                %rhs_copy_k_size = affine.min affine_map<()[s0] -> (1, s0)>()[%rhs_copy_k_remaining]
                %lhs_global_m = arith.addi %sg_m_offset, %lhs_copy_m : index
                %rhs_global_k = arith.addi %ko, %rhs_copy_k : index
                %lhs_copy = pcf.read_slice %lane_lhs_ref[%lhs_global_m, %ko] [%lhs_copy_m_size, %k_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                %rhs_copy = pcf.read_slice %lane_rhs_ref[%rhs_global_k, %sg_n_offset] [%rhs_copy_k_size, %sg_n_size] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                pcf.write_slice %lhs_copy into %lane_lhs_smem[%lhs_copy_m, %c0] [%lhs_copy_m_size, %k_tile] [1, 1] : tensor<?x?xf16> into !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
                pcf.write_slice %rhs_copy into %lane_rhs_smem[%rhs_copy_k, %c0] [%rhs_copy_k_size, %sg_n_size] [1, 1] : tensor<?x?xf16> into !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
                gpu.barrier memfence [#gpu.address_space<workgroup>]

                %lhs_packed = pcf.expand_shape %lane_lhs_smem [[0, 1], [2, 3]] : !pcf.sref<128x64xf16, #iree_gpu.lane_scope> into !pcf.sref<8x16x4x16xf16, #iree_gpu.lane_scope>
                %rhs_packed = pcf.expand_shape %lane_rhs_smem [[0, 1], [2, 3]] : !pcf.sref<64x64xf16, #iree_gpu.lane_scope> into !pcf.sref<4x16x4x16xf16, #iree_gpu.lane_scope>
                %lhs_read = pcf.read_slice %lhs_packed[0, %mma_lane_k, 0, %mma_acc_inner_m] [%mma_m_outer, 1, %mma_k_outer, 4] [1, 1, 1, 1] : !pcf.sref<8x16x4x16xf16, #iree_gpu.lane_scope> to tensor<?x1x?x4xf16>
                %rhs_read = pcf.read_slice %rhs_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_k_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : !pcf.sref<4x16x4x16xf16, #iree_gpu.lane_scope> to tensor<?x4x?x1xf16>
                %lhs_lane = linalg.transpose ins(%lhs_read : tensor<?x1x?x4xf16>) outs(tensor.empty() : tensor<?x?x1x4xf16>) permutation = [0, 2, 1, 3]
                %rhs_lane = linalg.transpose ins(%rhs_read : tensor<?x4x?x1xf16>) outs(tensor.empty() : tensor<?x?x1x4xf16>) permutation = [0, 2, 3, 1]
                %next = iree_codegen.inner_tiled ins(%lhs_lane, %rhs_lane) outs(%iter) {indexing_maps = #matmul_accesses, iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>, #linalg.iterator_type<reduction>], kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, semantics = #iree_gpu.mma_semantics<distributed = true, opaque = true>} : tensor<?x?x1x4xf16>, tensor<?x?x1x4xf16> into tensor<?x?x4x1xf32>
                gpu.barrier memfence [#gpu.address_space<workgroup>]
                scf.yield %next : tensor<?x?x4x1xf32>
              }

              %lane_interleaved = linalg.transpose ins(%lane_reduced : tensor<?x?x4x1xf32>) outs(tensor.empty() : tensor<?x4x?x1xf32>) permutation = [0, 2, 1, 3]
              pcf.write_slice %lane_interleaved into %acc_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_m_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : tensor<?x4x?x1xf32> into !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)>
              pcf.return
            }

            pcf.write_slice %lane_tile into %acc_ref[%sg_m_base, %sg_n_base] [%sg_m_size, %sg_n_size] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_gpu.subgroup_scope)>
            pcf.return
          }

          pcf.write_slice %subgroup_tile into %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>
          pcf.return
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

8. Vectorize Lane Code

This vectorizes the lane-local tensors into concrete vector transfers and
MFMA-shaped fragments. The distributed matmul becomes iree_codegen.inner_tiled
over vector operands, and the accumulated vectors are written back through PCF
views.

IR after GenericVectorizationPass

// -----// IR Dump After GenericVectorizationPass (iree-codegen-generic-vectorization) ('func.func' operation: @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32) //----- //
#pipeline_layout = #hal.pipeline.layout<constants = 8, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>
#target = #hal.executable.target<"rocm", "rocm-hsaco-fb">
#translation = #iree_codegen.translation_info<pipeline = #iree_gpu.pipeline<RunClusterDistribute> workgroup_size = [256, 1, 1] subgroup_size = 64>
#matmul_accesses = [affine_map<(i, j, k) -> (i, k)>, affine_map<(i, j, k) -> (k, j)>, affine_map<(i, j, k) -> (i, j)>]
hal.executable public @dynamic_matmul_dispatch_0 {
  hal.executable.variant public @rocm_hsaco_fb target(#target) {
    hal.executable.export public @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32 ordinal(0) layout(#pipeline_layout) count(%device: !hal.device, %k0: index, %k1: index, %m0: index, %m1: index, %n0: index, %n1: index) -> (index, index, index) {
      %x, %y, %z = iree_tensor_ext.dispatch.workgroup_count_from_slice(%k0, %k1, %m0, %m1)
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @dynamic_matmul_dispatch_0_matmul_DxDxD_f16xf16xf32() attributes {translation_info = #translation} {
        %c0 = arith.constant 0 : index
        %c1 = arith.constant 1 : index
        %c2 = arith.constant 2 : index
        %c4 = arith.constant 4 : index
        %c16 = arith.constant 16 : index
        %c64 = arith.constant 64 : index
        %c128 = arith.constant 128 : index
        %c256 = arith.constant 256 : index
        %zero16 = arith.constant 0.0 : f16
        %zero32 = arith.constant 0.0 : f32
        %k32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(0) : i32
        %m32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(4) : i32
        %n32 = hal.interface.constant.load layout(#pipeline_layout) ordinal(6) : i32
        %k = arith.index_castui %k32 : i32 to index
        %m = arith.index_castui %m32 : i32 to index
        %n = arith.index_castui %n32 : i32 to index
        %wg_m_count = affine.apply affine_map<()[s0] -> ((s0 + 255) floordiv 256)>()[%m]
        %wg_n_count = affine.apply affine_map<()[s0] -> ((s0 + 127) floordiv 128)>()[%n]
        %lhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k}
        %rhs_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n}
        %out_binding = hal.interface.binding.subspan layout(#pipeline_layout) binding(2) alignment(64) offset(%c0) flags(Indirect) : !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        %lhs = iree_tensor_ext.dispatch.tensor.load %lhs_binding, offsets = [0, 0], sizes = [%m, %k], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%m, %k} -> tensor<?x?xf16>
        %rhs = iree_tensor_ext.dispatch.tensor.load %rhs_binding, offsets = [0, 0], sizes = [%k, %n], strides = [1, 1] : !iree_tensor_ext.dispatch.tensor<readonly:tensor<?x?xf16>>{%k, %n} -> tensor<?x?xf16>
        %empty = tensor.empty(%m, %n) : tensor<?x?xf32>

        // Workgroup level: one workgroup owns one 256x128 output tile.
        %wg_result = pcf.loop scope(#iree_codegen.workgroup_scope<linearize>) count(%wg_m_count, %wg_n_count)
          execute(%acc_wg = %empty)[%wg_m_id: index, %wg_n_id: index]
            : (!pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>) -> (tensor<?x?xf32>) {
          %m_offset = arith.muli %wg_m_id, %c256 : index
          %n_offset = arith.muli %wg_n_id, %c128 : index
          %m_tile = affine.min affine_map<(d0)[s0] -> (256, s0 - d0)>(%m_offset)[%m]
          %n_tile = affine.min affine_map<(d0)[s0] -> (128, s0 - d0)>(%n_offset)[%n]
          %acc_tile = pcf.read_slice %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)> to tensor<?x?xf32>
          %acc_mask = vector.create_mask %m_tile, %n_tile : vector<256x128xi1>
          %init = vector.transfer_read %acc_tile[%c0, %c0], %zero32, %acc_mask : tensor<?x?xf32>, vector<256x128xf32>

          // Subgroup level: vectorization runs after both distribution phases.
          %subgroup_tile = pcf.generic scope(#iree_gpu.subgroup_scope) initialize {
            pcf.yield
          }
            execute(%lhs_ref <- %lhs, %rhs_ref <- %rhs, %acc_ref = %init)[%sg_id: index, %sg_count: index]
              : (!pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<?x?xf16, #iree_gpu.subgroup_scope>, !pcf.sref<256x128xf32, sync(#iree_gpu.subgroup_scope)>) -> (vector<256x128xf32>) {
            %sg_idx:2 = affine.delinearize_index %sg_id into (2, 2) : index, index
            %sg_m_base = arith.muli %sg_idx#0, %c128 : index
            %sg_n_base = arith.muli %sg_idx#1, %c64 : index
            %sg_m_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_m_base)[%m_tile]
            %sg_n_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%sg_n_base)[%n_tile]
            %sg_m_size = affine.min affine_map<()[s0] -> (128, s0)>()[%sg_m_remaining]
            %sg_n_size = affine.min affine_map<()[s0] -> (64, s0)>()[%sg_n_remaining]
            %sg_m_offset = arith.addi %m_offset, %sg_m_base : index
            %sg_n_offset = arith.addi %n_offset, %sg_n_base : index

            // Lane level: vectorization materializes the lane-local fragments
            // chosen by the preceding lane-distribution pass.
            %lane_tile = pcf.generic scope(#iree_gpu.lane_scope) initialize {
              %lane_lhs_smem = pcf.alloc() : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
              %lane_rhs_smem = pcf.alloc() : !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
              pcf.yield %lane_lhs_smem, %lane_rhs_smem : !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
            } -> (%lane_lhs_smem: !pcf.sref<128x64xf16, #iree_gpu.lane_scope>, %lane_rhs_smem: !pcf.sref<64x64xf16, #iree_gpu.lane_scope>)
              execute(%lane_lhs_ref <- %lhs_ref, %lane_rhs_ref <- %rhs_ref, %lane_acc_ref = %acc_ref)[%lane_id: index, %lane_count: index]
                : (!pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<?x?xf16, #iree_gpu.lane_scope>, !pcf.sref<256x128xf32, sync(#iree_gpu.lane_scope)>) -> (vector<256x128xf32>) {
              %lane_dln:3 = affine.delinearize_index %lane_id into (4, 16) : index, index, index
              %mma_m_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%sg_m_size]
              %mma_n_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%sg_n_size]
              %mma_lane_m = iree_codegen.index_hint %lane_dln#1(#iree_gpu.lane_constant<16>) : index
              %mma_lane_k = iree_codegen.index_hint %lane_dln#2(#iree_gpu.lane_increment<16, aligned>) : index
              %mma_acc_inner_m = affine.linearize_index disjoint [%mma_lane_m, %c0] by (4, 4) : index
              %acc_subview = pcf.subview %lane_acc_ref[%sg_m_base, %sg_n_base] [%sg_m_size, %sg_n_size] [1, 1] : !pcf.sref<256x128xf32, sync(#iree_gpu.lane_scope)> to !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)>
              %acc_packed = pcf.expand_shape %acc_subview [[0, 1], [2, 3]] : !pcf.sref<?x?xf32, sync(#iree_gpu.lane_scope)> into !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)>
              %acc_read = pcf.read_slice %acc_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_m_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)> to vector<8x4x4x1xf32>
              %acc_vec = vector.transpose %acc_read, [0, 2, 1, 3] : vector<8x4x4x1xf32> to vector<8x4x4x1xf32>

              // Reduction level: row-distributed vector copies feed packed MFMA fragments.
              %lane_reduced = scf.for %ko = %c0 to %k step %c64 iter_args(%iter = %acc_vec) -> (vector<8x4x4x1xf32>) {
                %k_tile = affine.min affine_map<(d0)[s0] -> (64, s0 - d0)>(%ko)[%k]
                %mma_k_outer = affine.apply affine_map<()[s0] -> ((s0 + 15) floordiv 16)>()[%k_tile]
                %lhs_copy_m = arith.muli %lane_id, %c2 : index
                %rhs_copy_k = arith.muli %lane_id, %c1 : index
                %lhs_copy_m_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%lhs_copy_m)[%sg_m_size]
                %rhs_copy_k_remaining = affine.max affine_map<(d0)[s0] -> (0, s0 - d0)>(%rhs_copy_k)[%k_tile]
                %lhs_copy_m_size = affine.min affine_map<()[s0] -> (2, s0)>()[%lhs_copy_m_remaining]
                %rhs_copy_k_size = affine.min affine_map<()[s0] -> (1, s0)>()[%rhs_copy_k_remaining]
                %lhs_global_m = arith.addi %sg_m_offset, %lhs_copy_m : index
                %rhs_global_k = arith.addi %ko, %rhs_copy_k : index
                %lhs_copy_mask = vector.create_mask %lhs_copy_m_size, %k_tile : vector<2x64xi1>
                %rhs_copy_mask = vector.create_mask %rhs_copy_k_size, %sg_n_size : vector<1x64xi1>
                %lhs_copy_tensor = pcf.read_slice %lane_lhs_ref[%lhs_global_m, %ko] [%lhs_copy_m_size, %k_tile] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                %rhs_copy_tensor = pcf.read_slice %lane_rhs_ref[%rhs_global_k, %sg_n_offset] [%rhs_copy_k_size, %sg_n_size] [1, 1] : !pcf.sref<?x?xf16, #iree_gpu.lane_scope> to tensor<?x?xf16>
                %lhs_copy_vec = vector.transfer_read %lhs_copy_tensor[%c0, %c0], %zero16, %lhs_copy_mask : tensor<?x?xf16>, vector<2x64xf16>
                %rhs_copy_vec = vector.transfer_read %rhs_copy_tensor[%c0, %c0], %zero16, %rhs_copy_mask : tensor<?x?xf16>, vector<1x64xf16>
                iree_vector_ext.transfer_write %lhs_copy_vec, %lane_lhs_smem[%lhs_copy_m, %c0] {in_bounds = [true, true]} : vector<2x64xf16>, !pcf.sref<128x64xf16, #iree_gpu.lane_scope>
                iree_vector_ext.transfer_write %rhs_copy_vec, %lane_rhs_smem[%rhs_copy_k, %c0] {in_bounds = [true, true]} : vector<1x64xf16>, !pcf.sref<64x64xf16, #iree_gpu.lane_scope>
                gpu.barrier memfence [#gpu.address_space<workgroup>]

                %lhs_packed = pcf.expand_shape %lane_lhs_smem [[0, 1], [2, 3]] : !pcf.sref<128x64xf16, #iree_gpu.lane_scope> into !pcf.sref<8x16x4x16xf16, #iree_gpu.lane_scope>
                %rhs_packed = pcf.expand_shape %lane_rhs_smem [[0, 1], [2, 3]] : !pcf.sref<64x64xf16, #iree_gpu.lane_scope> into !pcf.sref<4x16x4x16xf16, #iree_gpu.lane_scope>
                %lhs_read = pcf.read_slice %lhs_packed[0, %mma_lane_k, 0, %mma_acc_inner_m] [%mma_m_outer, 1, %mma_k_outer, 4] [1, 1, 1, 1] : !pcf.sref<8x16x4x16xf16, #iree_gpu.lane_scope> to vector<8x1x4x4xf16>
                %rhs_read = pcf.read_slice %rhs_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_k_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : !pcf.sref<4x16x4x16xf16, #iree_gpu.lane_scope> to vector<4x4x4x1xf16>
                %lhs_mma = vector.transpose %lhs_read, [0, 2, 1, 3] : vector<8x1x4x4xf16> to vector<8x4x1x4xf16>
                %rhs_mma = vector.transpose %rhs_read, [0, 2, 3, 1] : vector<4x4x4x1xf16> to vector<4x4x1x4xf16>
                %mma = iree_codegen.inner_tiled ins(%lhs_mma, %rhs_mma) outs(%iter) {indexing_maps = #matmul_accesses, iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>, #linalg.iterator_type<reduction>], kind = #iree_gpu.mma_layout<MFMA_F32_16x16x16_F16>, semantics = #iree_gpu.mma_semantics<distributed = true, opaque = true>} : vector<8x4x1x4xf16>, vector<4x4x1x4xf16> into vector<8x4x4x1xf32>
                gpu.barrier memfence [#gpu.address_space<workgroup>]
                scf.yield %mma : vector<8x4x4x1xf32>
              }

              %acc_interleaved = vector.transpose %lane_reduced, [0, 2, 1, 3] : vector<8x4x4x1xf32> to vector<8x4x4x1xf32>
              pcf.write_slice %acc_interleaved into %acc_packed[0, %mma_acc_inner_m, 0, %mma_lane_k] [%mma_m_outer, 4, %mma_n_outer, 1] [1, 1, 1, 1] : vector<8x4x4x1xf32> into !pcf.sref<?x16x?x16xf32, sync(#iree_gpu.lane_scope)>
              pcf.return
            }

            pcf.return %lane_tile : vector<256x128xf32>
          }

          %acc_result = vector.transfer_write %subgroup_tile, %acc_tile[%c0, %c0], %acc_mask : vector<256x128xf32>, tensor<?x?xf32>
          pcf.write_slice %acc_result into %acc_wg[%m_offset, %n_offset] [%m_tile, %n_tile] [1, 1] : tensor<?x?xf32> into !pcf.sref<?x?xf32, sync(#iree_codegen.workgroup_scope<linearize>)>
          pcf.return
        }

        iree_tensor_ext.dispatch.tensor.store %wg_result, %out_binding, offsets = [0, 0], sizes = [%m, %n], strides = [1, 1] : tensor<?x?xf32> -> !iree_tensor_ext.dispatch.tensor<writeonly:tensor<?x?xf32>>{%m, %n}
        return
      }
    }
  }
}

krzysz00 · 2026-05-08T15:39:57Z

krzysz00
May 8, 2026
Collaborator

Trivial comments:

I assume the 2x2 tiling to subgroups is for the sake of the example?
We probably want collapse_shape to complete the set of shape transforms

Will spend some time thinking about this IR and trying to understand the semantics more deeply

4 replies

qedawkins May 8, 2026
Collaborator Author

Yes, 2x2 is just for illustration purposes. I should have made the wording there clearer for sure.

qedawkins May 8, 2026
Collaborator Author

re: collapse_shape, collapse is much trickier than expand because lowerings are still beholden to memref layouts (which aren't generally collapsible). I wanted to allow completely general layouts in my initial design but realized rewriting everything below PCF was too much of an undertaking.

(not to say we shouldn't add the op, just that that was my thought process at the time)

krzysz00 May 8, 2026
Collaborator

... I claim that memref layouts should be collapsable to 1-D in any case we realistically care about

qedawkins May 8, 2026
Collaborator Author

Sorry late reply. 1-d is the extreme case, the op should still work in other cases. Also the main problem is that the ability to collapse only applies after unrolling which happens way later on. We can add it but I'm not a fan of narrowly supported ops personally.

krzysz00 · 2026-05-08T18:18:06Z

krzysz00
May 8, 2026
Collaborator

Re threadgroup - why is it called a "struct" type? What's "struct" about this? Maybe "collective"?
Re cluster IDs - so, they're completely arbitrary? Do lowering passes give them semantic meanings?
Why threadgroups if we have clusters? Maybe we could partition clusters into smaller clusters? Or is that too much flexibility? (I think I see it, I'm just asking the annoying question)
Ok, so I noticed that split isn't actually a complete enumeration of the split. Why? (ex. in the horizontal example, shouldn't it be split [[%producer_wave_conut, %consumer_wave_count]] or something like that?)
"when two threadgroup carried values are equivalent" - what's the definition of equivalent?

Ok, so after staring at the IR I think I get the model here ... though it's definitely verbose (unavoidably so? Maybe we can grow assembly format shorthantds?) and I'm still confused about how distribution is meant to work.

1 reply

qedawkins May 8, 2026
Collaborator Author

For distribution my current plan is to repurpose non-vector portions of vector distribute and employ the map dialect for layouts. I'm my prototype I used vector distribute directly and got quite far so I'm hoping it won't be too bad. Ultimately I still believe neither tensor nor vector is a suitable type for what we need long term but I don't wanna try to solve that yet.

qedawkins · 2026-05-08T19:08:00Z

qedawkins
May 8, 2026
Collaborator Author

Re threadgroup - why is it called a "struct" type? What's "struct" about this? Maybe "collective"?

Struct felt like a fairly familiar term, and just a dictionary of types sounds a lot like a struct to me. Also the struct does not refer to the threadgroup here, it refers to the group-local fields carried by the threadgroup. Collective or composite are also ok, but the name is purely reflected on the field/in docs so I'm not too concerned with the name here. If something else sounds better that works for me.

Re cluster IDs - so, they're completely arbitrary? Do lowering passes give them semantic meanings?

They are as arbitrary as a symbol name. So we can pick whatever we want, but the purpose is to give clusters a unique identifier we can use to convert them. One way to think of it is they are a symbolic reference to the condition for which a thread within the larger group executes the cluster.

Why threadgroups if we have clusters? Maybe we could partition clusters into smaller clusters? Or is that too much flexibility? (I think I see it, I'm just asking the annoying question)

We could drop the threadgroup type, but the threadgroup is such a fundamental special case of a cluster (i.e. all the threads) it made more sense to me for it to be its own type.

Ok, so I noticed that split isn't actually a complete enumeration of the split. Why? (ex. in the horizontal example, shouldn't it be split [[%producer_wave_conut, %consumer_wave_count]] or something like that?)

I made it so that the split indices are the middle values of a range that covers all of the threads. So its like a prefix sum of the counts with 0 and the end removed. If that ends up being awkward we can change it, but the main thing tile_group needs to do is emit the conditions that threads can branch on, so I figured that the split points were more useful than each individual count.

"when two threadgroup carried values are equivalent" - what's the definition of equivalent?

Basically along control flow edges. So if you have a value passed from a cluster, through a control flow branch, to the operand of another cluster, the type you distribute that value to must be consistent before and after the branch. So "equivalent" is only referring to the type here. Good question, that definitely wasn't clear reading that a second time.

2 replies

krzysz00 May 8, 2026
Collaborator

Re the prefix sum... that makes sense but as a reader of the IR it feels weird. The split field feels like it ought to be (size1, sixe2, ... siseK)

krzysz00 May 8, 2026
Collaborator

And re struct ... it was just weird that you have the "struct" and then "private" - those nouns just don't feel like they go together.

It's not anything fundamental, just wanted to complain a bit

[RFC] PCF Enablement and GPU Codegen Rework #24399

Uh oh!

qedawkins May 8, 2026 Collaborator

PCF Baseline

New IR

Types

!pcf.threadgroup

!pcf.cluster

Ops

pcf.shared_executor

pcf.shared_executor.tile_group

pcf.shared_executor.run_cluster

pcf.shared_executor.run_thread

pcf.subview and pcf.expand_shape

Examples

Horizontal Warp/Wave Specialization

Vertical Warp/Wave Specialization

Lowerings

Cluster Resolution

Distribution

GPU Pipeline Rework

1. Create Dispatch Config

2. Apply Workgroup Tiling

3. Wrap Workgroup Body

4. Form Run-Cluster Flow

5. Introduce Shared Memory Copies

6. Distribute to Subgroups

7. Distribute to Lanes

8. Vectorize Lane Code

Replies: 3 comments · 7 replies

Uh oh!

krzysz00 May 8, 2026 Collaborator

Uh oh!

qedawkins May 8, 2026 Collaborator Author

Uh oh!

Uh oh!

qedawkins May 8, 2026 Collaborator Author

Uh oh!

krzysz00 May 8, 2026 Collaborator

Uh oh!

qedawkins May 8, 2026 Collaborator Author

Uh oh!

krzysz00 May 8, 2026 Collaborator

Uh oh!

qedawkins May 8, 2026 Collaborator Author

Uh oh!

qedawkins May 8, 2026 Collaborator Author

Uh oh!

krzysz00 May 8, 2026 Collaborator

Uh oh!

krzysz00 May 8, 2026 Collaborator

qedawkins
May 8, 2026
Collaborator

`!pcf.threadgroup`

`!pcf.cluster`

`pcf.shared_executor`

`pcf.shared_executor.tile_group`

`pcf.shared_executor.run_cluster`

`pcf.shared_executor.run_thread`

`pcf.subview` and `pcf.expand_shape`

Replies: 3 comments 7 replies

krzysz00
May 8, 2026
Collaborator

qedawkins May 8, 2026
Collaborator Author

qedawkins May 8, 2026
Collaborator Author

krzysz00 May 8, 2026
Collaborator

qedawkins May 8, 2026
Collaborator Author

krzysz00
May 8, 2026
Collaborator

qedawkins May 8, 2026
Collaborator Author

qedawkins
May 8, 2026
Collaborator Author

krzysz00 May 8, 2026
Collaborator

krzysz00 May 8, 2026
Collaborator