Primitive Ordered Pixel Shading¶
Primitive Ordered Pixel Shading (POPS) is the feature available starting from GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering functionality.
It allows a part of a fragment shader — an ordered section (or a critical section) — to be executed sequentially in rasterization order for different invocations covering the same pixel position.
This article describes how POPS is set up in shader code and the registers. The information here is currently provided for architecture generations up to GFX11.
Note that the information in this article is not official and may contain inaccuracies, as well as incomplete or incorrect assumptions. It is based on the shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage in Direct3D shaders, AMD’s Platform Abstraction Library (PAL), ISA references, and experimentation with the hardware.
Shader code¶
With POPS, a wave can dynamically execute up to one ordered section. It is fine for a wave not to enter an ordered section at all if it doesn’t need ordering on its execution path, however.
The setup of the ordered section consists of three parts:
Entering the ordered section in the current wave — awaiting the completion of ordered sections in overlapped waves.
Resolving overlap within the current wave — intrawave collisions (optional and GFX9–10.3 only).
Exiting the ordered section — resuming overlapping waves trying to enter their ordered sections.
GFX9–10.3: Entering the ordered section in the wave¶
Awaiting the completion of ordered sections in overlapped waves is performed by
setting the POPS packer hardware register, and then polling the volatile
pops_exiting_wave_id
ALU operand source until its value exceeds the newest
overlapped wave ID for the current wave.
The information needed for the wave to perform the waiting is provided to it via
the SGPR argument COLLISION_WAVEID
. Its loading needs to be enabled in the
SPI_SHADER_PGM_RSRC2_PS
and PA_SC_SHADER_CONTROL
registers (note that
the POPS arguments specifically need to be enabled not only in RSRC
unlike
various other arguments, but in PA_SC_SHADER_CONTROL
as well).
The collision wave ID argument contains the following unsigned values:
[31]: Whether overlap has occurred.
[29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated with.
[25:16]: Newest overlapped wave ID.
[9:0]: Current wave ID.
The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of the fields, possibly from an early development iteration, but the meanings of them are accurate there.
The wait must not be performed if the “did overlap” bit 31 is set to 0, otherwise it will result in a hang. Also, the bit being set to 0 indicates that there are both no wave overlap and no intrawave collisions for the current wave — so if the bit is 0, it’s safe for the wave to skip all of the POPS logic completely and execute the contents of the ordered section simply as usual with unordered access as a potential additional optimization. The packer hardware register, however, may be set even without overlap safely — it’s the wait loop itself that must not be executed if it was reported that there was no overlap.
The packer ID needs to be passed to the packer hardware register using
s_setreg_b32
so the wave can poll pops_exiting_wave_id
on that packer.
On GFX9, the MODE
(1) hardware register has two bits specifying which packer
the wave is associated with:
[25]: The wave is associated with packer 1.
[24]: The wave is associated with packer 0.
Initially, both of these bits are set 0, meaning that POPS is disabled for the
wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if
the packer ID in COLLISION_WAVEID
is 0, or set bit 25 to 1 if the packer ID
is 1.
Starting from GFX10, the POPS_PACKER
(25) hardware register is used instead,
containing the following fields:
[2:1]: Packer ID.
[0]: POPS enabled for the wave.
Initially, POPS is disabled for a wave. To start entering the ordered section,
bits 2:1 must be set to the packer ID from COLLISION_WAVEID
, and bit 0 needs
to be set to 1.
The wave IDs, both in COLLISION_WAVEID
and pops_exiting_wave_id
, are
10-bit values wrapping around on overflow — consecutive waves are numbered 1022,
1023, 0, 1… This wraparound needs to be taken into account when comparing the
exiting wave ID and the newest overlapped wave ID.
Specifically, until the current wave exits the ordered section, its ID can’t be
smaller than the newest overlapped wave ID or the exiting wave ID. So
current_wave_id + 1
can be subtracted from 10-bit wave IDs to remap them to
monotonically increasing unsigned values. In this case, the largest value,
0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current
wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from
before the last wraparound will be near 0 increasing away from it. Subtracting
current_wave_id + 1
is equivalent to adding ~current_wave_id
.
GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit newest overlapped wave ID is greater than the 10-bit current wave ID (meaning that it’s behind the last wraparound point), 1 needs to be added to the newest overlapped wave ID before using it in the comparison. This was corrected in GFX10.
The exiting wave ID (not to be confused with “exited” — the exiting wave ID is
the wave that will exit the ordered section next) is queried via the
pops_exiting_wave_id
ALU operand source, numbered 239. Normally, it will be
one of the arguments of s_add_i32
that remaps it from a wrapping 10-bit wave
ID to monotonically increasing one.
It’s a volatile operand, and it needs to be read in a loop until its value
becomes greater than the newest overlapped wave ID (after remapping both to
monotonic). However, if it’s too early for the current wave to enter the ordered
section, it needs to yield execution to other waves that may potentially be
overlapped — via s_sleep
. GFX9 requires a finite amount of delay to be
specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up
the waiting waves, so the maximum delay of 0xFFFF can be used.
In pseudocode, the entering logic would look like this:
bool did_overlap = collision_wave_id[31];
if (did_overlap) {
if (gfx_level >= GFX10) {
uint packer_id = collision_wave_id[29:28];
s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
} else {
uint packer_id = collision_wave_id[28];
s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
}
uint current_10bit_wave_id = collision_wave_id[9:0];
// Or -(current_10bit_wave_id + 1).
uint wave_id_remap_offset = ~current_10bit_wave_id;
uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
if (gfx_level < GFX10 &&
newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
++newest_overlapped_10bit_wave_id;
}
uint newest_overlapped_wave_id =
newest_overlapped_10bit_wave_id + wave_id_remap_offset;
while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
newest_overlapped_wave_id)) {
s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
}
}
The SPIR-V fragment shader interlock specification requires an invocation — an
individual invocation, not the whole subgroup — to execute
OpBeginInvocationInterlockEXT
exactly once. However, if there are multiple
begin instructions, or even multiple begin/end pairs, under divergent
conditions, a wave may end up waiting for the overlapped waves multiple times.
Thankfully, it’s safe to set the POPS packer hardware register to the same
value, or to run the wait loop, multiple times during the wave’s execution, as
long as the ordered section isn’t exited in between by the wave.
GFX11: Entering the ordered section in the wave¶
Instead of exposing wave IDs to shaders, GFX11 uses the “export ready” wave
status flag to report that the wave may enter the ordered section. It’s awaited
by the s_wait_event
instruction, with the bit 0 (“don’t wait for
export_ready
”) of the immediate operand set to 0. On GFX11 specifically, AMD
passes 0 as the whole immediate operand.
The “export ready” wait can be done multiple times safely.
GFX9–10.3: Resolving intrawave collisions¶
On GFX9–10.3, it’s possible for overlapping fragment shader invocations to be placed not only in different waves, but also in the same wave, with the shader code making sure that the ordered section is executed for overlapping invocations in order.
This functionality is optional — it can be activated by enabling loading of the
INTRAWAVE_COLLISION
SGPR argument in SPI_SHADER_PGM_RSRC2_PS
and
PA_SC_SHADER_CONTROL
.
The lower 8 or 16 (depending on the wave size) bits of INTRAWAVE_COLLISION
contain the mask of whether each quad in the wave starts a new layer of
overlapping invocations, and thus the ordered section code for them needs to be
executed after running it for all lanes with indices preceding that quad index
multiplied by 4. The rest of the bits in the argument need to be ignored — AMD
explicitly masks them out in shader code (although this is not necessary if the
shader uses “find first 1” to obtain the start of the next set of overlapping
quads or expands this quad mask into a lane mask).
For example, if the intrawave collision mask is 0b0000001110000100, or
(1 << 2) | (1 << 7) | (1 << 8) | (1 << 9)
, the code of the ordered section
needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads
6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32),
and then for the remaining quads 15:9 (lanes 63:36).
This effectively causes the ordered section to be executed as smaller “sub-subgroups” within the original subgroup.
However, this is not always compatible with the execution model of SPIR-V or
GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of
the shader in a loop may be unsafe in some cases. One particular example is when
the shader uses subgroup operations influenced by lanes outside the current
quad. In this case, the code outside and inside the ordered section may be
executed with different sets of active invocations, affecting the results of
subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not
supposed to modify the set of active invocations in any way. So the intrawave
collision loop may break the results of subgroup operations in unpredictable
ways, even outside the driver’s compiler infrastructure. Even if the driver
splits the subgroup exactly at OpBeginInvocationInterlockEXT
and makes the
lane subsets rejoin exactly at OpEndInvocationInterlockEXT
, the application
and the compilers that created the source shader are still not aware of that
happening — the input SPIR-V or GLSL shader might have already gone through
various optimizations, such as common subexpression elimination which might
have considered a subgroup operation before OpBeginInvocationInterlockEXT
and one after it equivalent.
The idea behind reporting intrawave collisions to shaders is to reduce the impact on the parallelism of the part of the shader that doesn’t depend on the ordering, to avoid wasting lanes in the wave and to allow the code outside the ordered section in different invocations to run in parallel lanes as usual. This may be especially helpful if the ordered section is small compared to the rest of the shader — for instance, a custom blending equation in the end of the usual fragment shader for a surface in the world.
However, whether handling intrawave collisions is preferred is not a question with one universal answer. Intrawave collisions are pretty uncommon without multisampling, or when using sample interlock with multisampling, although they’re highly frequent with pixel interlock with multisampling, when adjacent primitives cover the same pixels along the shared edge (though that’s an extremely expensive situation in general). But resolving intrawave collisions adds some overhead costs to the shader. If intrawave overlap is unlikely to happen often, or even more importantly, if the majority of the shader is inside the ordered section, handling it in the shader may cause more harm than good.
GFX11 removes this concept entirely, instead overlapping invocations are always placed in different waves.
GFX9–10.3: Exiting the ordered section in the wave¶
To exit the ordered section and let overlapping waves resume execution and enter
their ordered sections, the wave needs to send the ORDERED_PS_DONE
message
(7) using s_sendmsg
.
If the wave has enabled POPS by setting the packer hardware register, it must
not execute s_endpgm
without having sent ORDERED_PS_DONE
once, so the
message must be sent on all execution paths after the packer register setup.
However, if the wave exits before having configured the packer register, sending
the message is not required, though it’s still fine to send it regardless of
that.
Note that if the shader has multiple OpEndInvocationInterlockEXT
instructions executed in the same wave (depending on a divergent condition, for
example), it must still be ensured that ORDERED_PS_DONE
is sent by the wave
only once, and especially not before any awaiting of overlapped waves.
Before the message is sent, all counters for memory accesses that need to be
primitive-ordered, both writes and (in case something after the ordered section
depends on the per-pixel data, for instance, the tail blending fallback in
order-independent transparency) reads, must be awaited. Those may include
vm
, vs
, and in some cases lgkm
(though normally primitive-ordered
memory accesses will be done through VMEM with divergent addresses, not SMEM, as
there’s no synchronization between fragments at different pixel coordinates, but
it’s still technically possible for a shader, even though pointless and
nonoptimal, to explicitly perform them in a waterfall loop, for instance, and
that must work correctly too). Without that, a race condition will occur when
the newly resumed waves start accessing the memory locations to which there
still are outstanding accesses in the current wave.
Another option for exiting is the s_endpgm_ordered_ps_done
instruction,
which combines waiting for all the counters, sending the ORDERED_PS_DONE
message, and ending the program. Generally, however, it’s desirable to resume
overlapping waves as early as possible, including before the export, as it may
stall the wave for some time too.
GFX11: Exiting the ordered section in the wave¶
The overlapping waves are resumed when the wave performs the last export (with
the done
flag).
The same requirements for awaiting the memory access counters as on GFX9–10.3 still apply.
Memory access requirements¶
The compiler needs to ensure that entering the ordered section implements
acquire semantics, and exiting it implements release semantics, in the fragment
interlock memory scope for UniformMemory
and ImageMemory
SPIR-V storage
classes.
A fragment interlock memory scope instance includes overlapping fragment shader invocations executed by commands inside a single subpass. It may be considered a subset of a queue family memory scope instance from the perspective of memory barriers.
Fragment shader interlock doesn’t perform implicit memory availability or
visibility operations. Shaders must do them by themselves for accesses requiring
primitive ordering, such as via coherent
(queuefamilycoherent
) in GLSL
or MakeAvailable
and MakeVisible
in at least the QueueFamily
scope
in SPIR-V.
On AMD hardware, this means that the accessed memory locations must be made available or visible between waves that may be executed on any compute unit — so accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag and L1$ via DLC.
However, it should be noted that memory accesses in the ordered section may be expected by the application to be done in primitive order even if they don’t have the GLC and DLC flags. Coherent access not only bypasses, but also invalidates the lower-level caches for the accessed memory locations. Thus, considering that normally per-pixel data is accessed exclusively by the invocation executing the ordered section, it’s not necessary to make all reads or writes in the ordered section for one memory location to be GLC/DLC — just the first read and the last write: it doesn’t matter if per-pixel data is cached in L0/L1 in the middle of a dependency chain in the ordered section, as long as it’s invalidated in them in the beginning and flushed to L2 in the end. Therefore, optimizations in the compiler must not simply assume that only coherent accesses need primitive ordering — and moreover, the compiler must also take into account that the same data may be accessed through different bindings.
Export requirements¶
With POPS, on all hardware generations, the shader must have at least one
export, though it can be a null or an off, off, off, off
one.
Also, even if the shader doesn’t need to export any real data, the export
skipping that was added in GFX10 must not be used, and some space must be
allocated in the export buffer, such as by setting SPI_SHADER_COL_FORMAT
for
some color output to SPI_SHADER_32_R
.
Without this, the shader will be executed without the needed synchronization on GFX10, and will hang on GFX11.
Drawing context setup¶
Configuring POPS¶
Most of the configuration is performed via the DB_SHADER_CONTROL
register.
To enable POPS for the draw,
DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER
should be set to 1.
On GFX9–10.3, DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES
controls which
fragment shader invocations are considered overlapping:
For pixel interlock, it must be set to 0 (1 sample).
If sample interlock is sufficient (only synchronizing between invocations that have any common sample mask bits), it may be set to
PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES
— the number of sample coverage mask bits passed to the shader which is expected to use the sample mask to determine whether it’s allowed to access the data for each of the samples. As of April 2023, PAL for some reason doesn’t use non-1xPOPS_OVERLAP_NUM_SAMPLES
at all, even when using Direct3D Rasterizer Ordered Views orGL_INTEL_fragment_shader_ordering
with sample shading (those APIs tie the interlock granularity to the shading frequency — Vulkan and OpenGL fragment shader interlock, however, allows specifying the interlock granularity independently of it, making it possible both to ask for finer synchronization guarantees and to require stronger ones than Direct3D ROVs can provide). However, with MSAA, on AMD hardware, pixel interlock generally performs massively, sometimes prohibitively, slower than sample interlock, because it causes fragment shader invocations along the common edge of adjacent primitives to be ordered as they cover the same pixels (even though they don’t cover any common samples). So it’s highly desirable for the driver to provide sample interlock, and to setPOPS_OVERLAP_NUM_SAMPLES
accordingly, if the shader declares that it’s enough for it via the execution mode.
On GFX11, when POPS is enabled, DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE
is
used in place of DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES
from the earlier
architecture generations (and has a different bit offset in the register), and
DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE
must be set to 1. The GFX11
blending performance workaround overriding the intrinsic rate must not be
applied if POPS is used in the draw — the intrinsic rate override must be used
solely to control the interlock granularity in this case.
No explicit flushes/synchronization are needed when changing the pipeline state variables that may be involved in POPS, such as the rasterization sample count. POPS automatically keeps synchronizing invocations even between draws with different sample counts (invocations with common coverage mask bits are considered overlapping by the hardware, regardless of what those samples actually are — only the indices are important).
Also, on GFX11, POPS uses DB_Z_INFO.NUM_SAMPLES
to determine the coverage
sample count, and it must be equal to PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES
even if there’s no depth/stencil target.
Hardware bug workarounds¶
Early revisions of GFX9 — CHIP_VEGA10
and CHIP_RAVEN
— contain a
hardware bug that may result in a hang, and need a workaround to be enabled.
Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or
more depth/stencil target samples, DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP
must be set to 1 for draws that satisfy this condition. In PAL, this is the
waMiscPopsMissedOverlap
workaround. It results in slightly lower performance
in those cases, increasing the frame time by around 1.5 to 2 times in
nvpro-samples/vk_order_independent_transparency
on the RX Vega 10, but it’s required in a pretty rare case (8x+ MSAA) and is
mandatory to ensure stability.
Also, even though DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP
is not required
on chips other than the CHIP_VEGA10
and CHIP_RAVEN
GFX9 revisions, if
it’s enabled for some reason on GFX10.1 (CHIP_NAVI10
, CHIP_NAVI12
,
CHIP_NAVI14
), and the draw uses POPS,
DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL
must be set to
PSLC_ON_HANG_ONLY
to avoid a hang (see waStalledPopsMode
in PAL).
Out-of-order rasterization interaction¶
This is a largely unresearched topic currently. However, considering that POPS is primarily the functionality of the Depth Block, similarity to the behavior of out-of-order rasterization in depth/stencil testing may possibly be expected.
If the shader specifies an ordered interlock execution mode, out-of-order rasterization likely must not be enabled implicitly.
As of April 2023, PAL doesn’t have any rules specifically for POPS in the logic determining whether out-of-order rasterization can be enabled automatically. Some of the POPS usage cases may possibly be covered by the rule that always disables out-of-order rasterization if the shader writes to Unordered Access Views (storage resources), though fragment shader interlock can be used for read-only purposes too (for ordering between draws that only read per-pixel data and draws that may write it), so that may be an oversight.
Explicitly enabled relaxed rasterization order modifies the concept of rasterization order itself in Vulkan, so from the point of view of the specification of fragment shader interlock, relaxed rasterization order should still be applicable regardless of whether the shader requests ordered interlock. PAL also doesn’t make any POPS-specific exceptions here as of April 2023.
Variable-rate shading interaction¶
On GFX10.3, enabling DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER
forces
the shading rate to be 1x1, thus the
fragmentShadingRateWithFragmentShaderInterlock
Vulkan device property must
be false.
On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the
fragmentShadingRateWithFragmentShaderInterlock
property must be true.
However, if PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS
is set,
enabling POPS will force 1x1 shading rate.
The widest interlock granularity available on GFX11 — with the lowest possible
Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There’s no
synchronization between coarse fragment shader invocations if they don’t cover
common fine pixels, so the fragmentShaderShadingRateInterlock
Vulkan device
feature is not available.
Additional configuration¶
These are some largely unresearched options found in the register declarations.
PAL doesn’t use them, so it’s unknown if they make any significant difference.
No effect was found in nvpro-samples/vk_order_independent_transparency
during testing on GFX9 CHIP_RAVEN
and GFX11 CHIP_NAVI31
.
DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED
on GFX9–10.3.PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS
on GFX10+.