Replies: 3 comments 7 replies
-
|
Trivial comments:
Will spend some time thinking about this IR and trying to understand the semantics more deeply |
Beta Was this translation helpful? Give feedback.
-
Ok, so after staring at the IR I think I get the model here ... though it's definitely verbose (unavoidably so? Maybe we can grow assembly format shorthantds?) and I'm still confused about how distribution is meant to work. |
Beta Was this translation helpful? Give feedback.
-
Struct felt like a fairly familiar term, and just a dictionary of types sounds a lot like a struct to me. Also the struct does not refer to the threadgroup here, it refers to the group-local fields carried by the threadgroup. Collective or composite are also ok, but the name is purely reflected on the field/in docs so I'm not too concerned with the name here. If something else sounds better that works for me.
They are as arbitrary as a symbol name. So we can pick whatever we want, but the purpose is to give clusters a unique identifier we can use to convert them. One way to think of it is they are a symbolic reference to the condition for which a thread within the larger group executes the cluster.
We could drop the threadgroup type, but the threadgroup is such a fundamental special case of a cluster (i.e. all the threads) it made more sense to me for it to be its own type.
I made it so that the split indices are the middle values of a range that covers all of the threads. So its like a prefix sum of the counts with
Basically along control flow edges. So if you have a value passed from a cluster, through a control flow branch, to the operand of another cluster, the type you distribute that value to must be consistent before and after the branch. So "equivalent" is only referring to the type here. Good question, that definitely wasn't clear reading that a second time. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
This proposal covers new IR, transforms, and pipeline changes for integrating
the PCF dialect into GPU code generation, first focusing on LLVMGPU.
There are a few goals here, both short term and long term preparing for future changes.
As it stands LLVMGPU codegen has been building up and maintaining
two main lowering pipelines: VectorDistribute and TileAndFuse. VectorDistribute
was introduced initially for the purpose of handling attention and modern matrix
multiplication intrinsics, both of which heavily feature subgroup level operations.
TileAndFuse was added with the primary goal of being a backbone pipeline that could
begin to consolidate the numerous bespoke lowering pipelines that existed in the
LLVMGPU backend at the time (many of these pipelines still exist :( ).
Supporting new hardware and problems have given us a chance to review our lowering
pipelines and try to unify them. This proposal takes steps towards unifying the
two pipelines, as well as tries to make it easier to handle:
And to eventually make it easier to handle:
A short list of problems outside the scope of this proposal, but are planned for
future changes:
The general direction is to start weaning the compiler off of static
parameters everywhere and instead design the top level for dynamism from the ground
up. This proposal only covers the control-flow side, but the expectation is that we
will want to take a closer look at the ops and types we use for the meat of a dispatch
afterwards.
PCF Baseline
This section summarizes the landed portions of the PCF dialect. The ops give a structured
way to represent parallel control flow with more flexibility. The core pieces are:
!pcf.sref<shape x element_type, scope>: a shaped reference whose memoryspace and synchronization meaning come from its scope.
sync(#scope)is theshorthand used when the reference is synchronized on return from the parent
scope.
pcf.generic: a parallel region over the native worker set of a scope. Itsbody receives worker ids and worker counts. Results are snapshots of tied
srefarguments after all workers return.pcf.loop: a scoped parallel loop with explicitcount(...)operands. Itsbody receives iteration ids instead of native worker ids, and the scope
decides how those iterations map to workers. This can be thought of as an
analog for
scf.forallbut without the structured terminator.pcf.alloc,pcf.read_slice, andpcf.write_slice: scoped allocation andslice access on
srefvalues.pcf.yield,pcf.return, andpcf.br.cond_return: the structuralterminators and branch support needed to keep PCF ops region based.
The current lowering story for the core is also already in place:
tokens are resolved first,
pcf.srefis converted to concretememreftypes,and the remaining structural PCF ops are lowered to
scf/cf. Existing fusionutilities can move producers and consumers across
pcf.generic/pcf.loopboundaries by rewriting
pcf.read_sliceandpcf.write_slice.A workgroup tile can therefore be represented today as a
pcf.loopover theworkgroup tile grid:
What is a way to ease the distribution process.
pcf.genericalready means"each worker runs this body with an id"; it does not directly model "this body is
a cooperative program for this group" or the named phases inside that cooperative
program. Directly jumping to distributed code has proven challenging, especially
when considering algorithms that need cross-worker communication.
On its own, PCF as it stands does not provide too much beyond what existing
dialects provide. It mainly supports more advanced fusion options. There are a
few key missing pieces:
Essentially what exists now is the base, the IR proposed here are extensions, and
the new transforms + pipeline start connecting the pieces.
New IR
This section covers new types + ops proposed, primarily for specialization.
The extension keeps the existing scoped
srefmodel and adds a collectiveexecution layer. The key distinction is:
pcf.generic/pcf.loop: distributed worker code. Worker identity is part ofthe region interface.
pcf.shared_executor: collective code for a group. The region receives agroup handle, and later distribution decides how each operation maps to
workers.
Types
!pcf.threadgroup!pcf.threadgroup<#scope>is a handle for a reentrant cooperating worker group ata scope. The threadgroup is a special case covering all workers at the scope. It
deliberately does not expose worker ids. It carries a struct field to ferry values
between regions of execution:
The default struct field carries collective values, values held collectively across
all workers in the group.
!pcf.threadgroupalso carries a private struct field forvalues held locally within one worker at the scope (i.e. every worker carries one copy
of the struct). Because the intended common case is to use this type with collective
values, we give that struct type prettier printing.
Another way to interpret this op is as a class that contains the thread id as a private
field which can only be accessed through structured ops described below.
!pcf.cluster!pcf.cluster<#scope, bounds, id>represents a rectangular subset of athreadgroup's worker grid. Bounds are half-open affine ranges. The affine map
uses
d0,d1, ... for explicit SSA-dependent split values ands0,s1,... for the implicit worker counts of the scope.
left,right, andtg.c01are cluster ids. They uniquely identify the clusterwithin the region or IR where it is used.
Clusters are a more general case of a threadgroup, and thus have the same overall
structure. They can carry either private or shared struct payloads:
The private/shared split is important because it keeps "per worker" values and
"collective" values separate in the type system, which tells us which kinds of values
are valid to forward to a region executing them.
Cluster ids use
#pcf.ns_syminternally and print as dot-qualified names suchas
tg.leftorouter.inner.leaf. This ties the cluster to the parent operationthat defined them, giving it a unique identifier within the region. The bounds map
alone is not enough to identify a cluster uniquely because it can depend on SSA
values. Private namespaces give the compiler a stable identity for region-wide
conversion without relying on globally unique string names.
The purpose of the bounds map is to link the global id to the cluster's local id.
Contiguous rectangular threadblocks is more restrictive than we'll realistically want
eventually, but for now it makes that mapping easy.
Ops
pcf.shared_executorpcf.shared_executoris a collective region at a scope. It has the samestructure as
pcf.generic(initializer -> execute), but the execute regionreceives a
!pcf.threadgroupinstead of id/count arguments.Readonly operands use
<-and produce no result. Readwrite operands use=,are tied to tensor/memref inits, and are snapshotted as results, matching the
existing PCF tied-result model.
Every operation within the execute region of the shared_executor is a collective.
The semantics for how that operation is executed across the workers is either
up to the operation itself or unspecified. In simpler terms, this can be thought
of as an undistributed version of
pcf.generic. More examples are shown below.pcf.shared_executor.tile_grouptile_grouppartitions a threadgroup or cluster into named clusters. This isthe structural operation that makes warp/wave specialization representable:
different clusters can retain different bodies until distribution and lowering.
The lowering computes the current worker's cluster index from the scope worker
ids and emits
scf.index_switch, cloning only the matching cluster's code intoeach case. For example:
After distribution, each worker computes the cluster it belongs to and the two
cluster bodies become separate switch cases:
Mechanically this is implemented by replicating the
tile_groupregion foreach cluster and then converting each region with a simple type converter that
either forwards the struct fields or zeroes out the type.
pcf.shared_executor.run_clusterrun_clustermarks a collective phase over a cluster. It consumes sharedpayloads from its source clusters and can yield a new shared payload:
Importantly this is the op that actually shifts from the parent context to just the
worker subset specified by the cluster, not the
tile_grouptypically wrapping it.pcf.shared_executor.run_threadrun_threadis the per-worker counterpart torun_cluster. It consumes privatecluster payloads, provides tile-relative thread ids, and yields private payloads.
It is the form produced after a collective region has been distributed to
individual workers.
pcf.subviewandpcf.expand_shapeThese are
srefview ops analogous tomemref.subviewandtensor.expand_shape. They are more for completeness than anything but were leftout of earlier iterations of the dialect design and ultimately proved necessary for
handling MMAs.
Examples
The next section shows a few IR samples of what we can represent with the above.
Horizontal Warp/Wave Specialization
A common pattern in high performance kernels involve a basic producer/consumer
wave split. Half, say, of the waves will load data while the other half does the
compute. Written by hand with thread level code this would look something like this.
Click to expand
There are a few issues with a representation like this. First, we're typically ingesting
the program as a single instruction stream. Earmarking operations for different groups,
distributing, and splitting the streams in one go is too large of a jump for a single
transformation. We would much prefer to break it up. Second, it is substantially harder
to match related regions of code with this representation. It requires matching
synchronization and control flow between the two regions. Instead with reentrant cluster
types we can write the same like the following:
different code while still being represented inside one structured region.
Click to expand
Vertical Warp/Wave Specialization
The previous example shows "horizontal" specialization (different workers
performing completely different tasks towards a larger goal). This example
is meant to illustrate vertical specialization (different workers doing the
same task but in different orders). This is useful for overlapping execution
of different instruction types. This lets us do something similar to what
we do in the pingpong ukernels in the ROCM backend. Start with a collective
matmul with outlined distinct regions:
Click to expand
(note: some of the IR here, namely
copy_token, is included for illustrationpurposes. It is just a suggestion and isn't part of the proposal).
Because the worker count itself is dynamic, code running on a threadgroup can
be split by just splitting threadgroup. This would look something like this:
Click to expand
Now here the code is doing the exact same as in the first sample, just
in two parts. After splitting it though, we can now shear the clusters
by moving half of them one-step past the barrier.
Click to expand
Lowerings
Cluster Resolution
Cluster resolution lowers cluster groups to ordinary control flow. This pass is
only expected to see fully distributed clusters. By the time it runs, all
collective work has already been distributed to the current worker level, so
resolution does not perform new distribution decisions. It just replicates the
cluster region once per case and dialect-converts each clone.
Consider a two-cluster region that contains nested control flow:
Click to expand
The first step computes the current worker's cluster id and replicates the
region into one case per cluster. Immediately after replication, the cloned
regions still contain both clusters; only the case mapping has changed. The
cloned block arguments are shown with the same names as the source region:
Click to expand
Then each case is dialect-converted with a case-specific cluster type converter.
The active cluster converts to ordinary per-worker code, while the inactive
cluster converts away:
Click to expand
The important part is that control flow is not interpreted specially. Loops,
conditionals, and nested regions are simply cloned with the cluster case and
then converted like we would expect with any standard lowering. Also notice
that values crossing control flow edges are handled naturally like this.
For example, the loop may carry a cluster payload from only one cluster:
Click to expand
After replication and conversion, both cases keep the same loop structure. The
active-left case carries the converted private struct fields, while the
active-right case has no loop-carried payload:
Click to expand
Distribution
The previous section focused on how to deal with resolving the surrounding structure
of distributable regions. The other side of the same coin is how to distribute the
contents of each region. This is a fairly nuanced question with many competing
implementations and IR, so the expectation is that trying to fix a one-size fits all
approach right now is a bad choice. Instead we can try to formulate an interface through
which distribution can happen and provide a baseline implementation for it. Consider
a chain of two distributable regions:
Click to expand
Because there is a shared value passed between the two clusters we have
to distribute them simultaneously. Trying to distribute both regions simultaneously
means that any implementation responsible for distribution would need to
walk the regions to determine how to resolve those two regions together. We
don't want to have to force that onto all implementations so instead we provide
an equivalence analysis that walks control flow and determines when two threadgroup
carried values are equivalent. The above example is obviously very simple, but with
more complex control flow, like loop carried values, it's more relevant.
For example, from the perspective of a distribution implementation, this:
is the same as distributing this CFG:
In principle this extends across function boundaries too, though the current way the IR
is represented is local region based meaning functions are (currently) unsupported.
To support proper function calls we'll likely need some flavor of wrapping module but
that is future work.
GPU Pipeline Rework
This walkthrough uses the run-cluster distribution path for a dynamic
f16xf16xf32matmul. The target workgroup tile is 256x128x64,with a 2x2 subgroup split and lane-local MFMA fragments chosen only
after the collective phase structure has been made explicit.
The main point of the pipeline is to keep two things visible at the
right time: tensor-level structure while choosing tiles, then PCF
phase structure while distributing copy, compute, and writeback work.
1. Create Dispatch Config
This stage records the target pipeline choice and launch shape. The IR is still
the original dispatch-level matmul over dynamic tensors, with the HAL interface
materialized and the RunClusterDistribute translation selected. This is no different
that existing pipelines and the intent is to reuse the configuration attributes other
pipelines employ.
This example is picking a fairly arbitrary 128x256x64 target tile size. This is not
implying that we only support static tile sizes, it's just because tensor doesn't
distinguish between different dynamic sizes well at all so it makes it difficult
to parse what's happening in an IR dump.
IR after CreateDispatchConfigPass
2. Apply Workgroup Tiling
Workgroup tiling looks a bit different here, targeting
pcf.loopinstead ofscf.forall.The reason is twofold. First we need to see the boundary between global and workgroup local values.
The way to do this is with readonly memory that we read from. The second is to prepare for
future strategies that are incompatible with
scf.foralllike stream-k.With that said, for the time being the implementation is still tile + fuse based so the
core transformation is largely the same.
IR after GPUApplyWorkgroupTilingPass
3. Wrap Workgroup Body
This step introduces the collective subgroup-scope region without changing the
body schedule yet. The workgroup
pcf.loopstill owns the global workgrouptile, while the actual tile computation is now nested under
pcf.shared_executor scope(#iree_gpu.subgroup_scope).IR after GPUWrapWorkgroupBodyPass
4. Form Run-Cluster Flow
This splits the shared executor body into explicit phases. Initialization,
compute, and writeback become
run_clusterregions, and the K loop carriesthe accumulator as a threadgroup payload.
IR after GPUFormRunClusterMatmulFlowPass
5. Introduce Shared Memory Copies
This step allocates subgroup shared-memory panels and splits each K iteration
into a copy cluster followed by a compute cluster. The plan is that this is where
shared memory level padding would happen. Vectorization is what is responsible for
instruction level padding.
(yes synchronization for the memory is missing here, I haven't decided what it looks
like yet).
IR after GPUIntroduceSharedMemoryCopiesPass
6. Distribute to Subgroups
This distributes the workgroup tile to a 2x2 subgroup grid. The outer
run_clusterstructure is preserved at lane scope so the next distributionstep can still see the phase boundaries and shared payload flow.
Note that after this step the overall cluster structure is retained, however
the scope is instead on lanes instead of subgroups. This is because we want
the exact same structure to do lane level distribution as subgroup level.
IR after PCFDistributeRunClusterToSubgroupsPass
7. Distribute to Lanes
This lowers the remaining lane-scope
run_clusterregions into lane-localcode. It chooses the lane copy chunks and packed tensor fragments and leaves
the program tensor-shaped for vectorization.
IR after PCFDistributeRunClusterToLanesPass
8. Vectorize Lane Code
This vectorizes the lane-local tensors into concrete vector transfers and
MFMA-shaped fragments. The distributed matmul becomes
iree_codegen.inner_tiledover vector operands, and the accumulated vectors are written back through PCF
views.
IR after GenericVectorizationPass
Beta Was this translation helpful? Give feedback.
All reactions