Primitive Ordered Pixel Shading

Primitive Ordered Pixel Shading (POPS) is the feature available starting from GFX9 that provides the Fragment Shader Interlock or Fragment Shader Ordering functionality.

It allows a part of a fragment shader — an ordered section (or a critical section) — to be executed sequentially in rasterization order for different invocations covering the same pixel position.

This article describes how POPS is set up in shader code and the registers. The information here is currently provided for architecture generations up to GFX11.

Note that the information in this article is not official and may contain inaccuracies, as well as incomplete or incorrect assumptions. It is based on the shader code output of the Radeon GPU Analyzer for Rasterizer Ordered View usage in Direct3D shaders, AMD’s Platform Abstraction Library (PAL), ISA references, and experimentation with the hardware.

Shader code

With POPS, a wave can dynamically execute up to one ordered section. It is fine for a wave not to enter an ordered section at all if it doesn’t need ordering on its execution path, however.

The setup of the ordered section consists of three parts:

  1. Entering the ordered section in the current wave — awaiting the completion of ordered sections in overlapped waves.

  2. Resolving overlap within the current wave — intrawave collisions (optional and GFX9–10.3 only).

  3. Exiting the ordered section — resuming overlapping waves trying to enter their ordered sections.

GFX9–10.3: Entering the ordered section in the wave

Awaiting the completion of ordered sections in overlapped waves is performed by setting the POPS packer hardware register, and then polling the volatile pops_exiting_wave_id ALU operand source until its value exceeds the newest overlapped wave ID for the current wave.

The information needed for the wave to perform the waiting is provided to it via the SGPR argument COLLISION_WAVEID. Its loading needs to be enabled in the SPI_SHADER_PGM_RSRC2_PS and PA_SC_SHADER_CONTROL registers (note that the POPS arguments specifically need to be enabled not only in RSRC unlike various other arguments, but in PA_SC_SHADER_CONTROL as well).

The collision wave ID argument contains the following unsigned values:

  • [31]: Whether overlap has occurred.

  • [29:28] (GFX10+) / [28] (GFX9): ID of the packer the wave should be associated with.

  • [25:16]: Newest overlapped wave ID.

  • [9:0]: Current wave ID.

The 2020 RDNA and RDNA 2 ISA references contain incorrect offsets and widths of the fields, possibly from an early development iteration, but the meanings of them are accurate there.

The wait must not be performed if the “did overlap” bit 31 is set to 0, otherwise it will result in a hang. Also, the bit being set to 0 indicates that there are both no wave overlap and no intrawave collisions for the current wave — so if the bit is 0, it’s safe for the wave to skip all of the POPS logic completely and execute the contents of the ordered section simply as usual with unordered access as a potential additional optimization. The packer hardware register, however, may be set even without overlap safely — it’s the wait loop itself that must not be executed if it was reported that there was no overlap.

The packer ID needs to be passed to the packer hardware register using s_setreg_b32 so the wave can poll pops_exiting_wave_id on that packer.

On GFX9, the MODE (1) hardware register has two bits specifying which packer the wave is associated with:

  • [25]: The wave is associated with packer 1.

  • [24]: The wave is associated with packer 0.

Initially, both of these bits are set 0, meaning that POPS is disabled for the wave. If the wave needs to enter the ordered section, it must set bit 24 to 1 if the packer ID in COLLISION_WAVEID is 0, or set bit 25 to 1 if the packer ID is 1.

Starting from GFX10, the POPS_PACKER (25) hardware register is used instead, containing the following fields:

  • [2:1]: Packer ID.

  • [0]: POPS enabled for the wave.

Initially, POPS is disabled for a wave. To start entering the ordered section, bits 2:1 must be set to the packer ID from COLLISION_WAVEID, and bit 0 needs to be set to 1.

The wave IDs, both in COLLISION_WAVEID and pops_exiting_wave_id, are 10-bit values wrapping around on overflow — consecutive waves are numbered 1022, 1023, 0, 1… This wraparound needs to be taken into account when comparing the exiting wave ID and the newest overlapped wave ID.

Specifically, until the current wave exits the ordered section, its ID can’t be smaller than the newest overlapped wave ID or the exiting wave ID. So current_wave_id + 1 can be subtracted from 10-bit wave IDs to remap them to monotonically increasing unsigned values. In this case, the largest value, 0xFFFFFFFF, will correspond to the current wave, 10-bit values up to the current wave ID will be in a range near 0xFFFFFFFF growing towards it, and wave IDs from before the last wraparound will be near 0 increasing away from it. Subtracting current_wave_id + 1 is equivalent to adding ~current_wave_id.

GFX9 has an off-by-one error in the newest overlapped wave ID: if the 10-bit newest overlapped wave ID is greater than the 10-bit current wave ID (meaning that it’s behind the last wraparound point), 1 needs to be added to the newest overlapped wave ID before using it in the comparison. This was corrected in GFX10.

The exiting wave ID (not to be confused with “exited” — the exiting wave ID is the wave that will exit the ordered section next) is queried via the pops_exiting_wave_id ALU operand source, numbered 239. Normally, it will be one of the arguments of s_add_i32 that remaps it from a wrapping 10-bit wave ID to monotonically increasing one.

It’s a volatile operand, and it needs to be read in a loop until its value becomes greater than the newest overlapped wave ID (after remapping both to monotonic). However, if it’s too early for the current wave to enter the ordered section, it needs to yield execution to other waves that may potentially be overlapped — via s_sleep. GFX9 requires a finite amount of delay to be specified, AMD uses 3. Starting from GFX10, exiting the ordered section wakes up the waiting waves, so the maximum delay of 0xFFFF can be used.

In pseudocode, the entering logic would look like this:

bool did_overlap = collision_wave_id[31];
if (did_overlap) {
   if (gfx_level >= GFX10) {
      uint packer_id = collision_wave_id[29:28];
      s_setreg_b32(HW_REG_POPS_PACKER[2:0], 1 | (packer_id << 1));
   } else {
      uint packer_id = collision_wave_id[28];
      s_setreg_b32(HW_REG_MODE[25:24], packer_id ? 0b10 : 0b01);
   }

   uint current_10bit_wave_id = collision_wave_id[9:0];
   // Or -(current_10bit_wave_id + 1).
   uint wave_id_remap_offset = ~current_10bit_wave_id;

   uint newest_overlapped_10bit_wave_id = collision_wave_id[25:16];
   if (gfx_level < GFX10 &&
       newest_overlapped_10bit_wave_id > current_10bit_wave_id) {
      ++newest_overlapped_10bit_wave_id;
   }
   uint newest_overlapped_wave_id =
      newest_overlapped_10bit_wave_id + wave_id_remap_offset;

   while (!(src_pops_exiting_wave_id + wave_id_remap_offset >
            newest_overlapped_wave_id)) {
      s_sleep(gfx_level >= GFX10 ? 0xFFFF : 3);
   }
}

The SPIR-V fragment shader interlock specification requires an invocation — an individual invocation, not the whole subgroup — to execute OpBeginInvocationInterlockEXT exactly once. However, if there are multiple begin instructions, or even multiple begin/end pairs, under divergent conditions, a wave may end up waiting for the overlapped waves multiple times. Thankfully, it’s safe to set the POPS packer hardware register to the same value, or to run the wait loop, multiple times during the wave’s execution, as long as the ordered section isn’t exited in between by the wave.

GFX11: Entering the ordered section in the wave

Instead of exposing wave IDs to shaders, GFX11 uses the “export ready” wave status flag to report that the wave may enter the ordered section. It’s awaited by the s_wait_event instruction, with the bit 0 (“don’t wait for export_ready”) of the immediate operand set to 0. On GFX11 specifically, AMD passes 0 as the whole immediate operand.

The “export ready” wait can be done multiple times safely.

GFX9–10.3: Resolving intrawave collisions

On GFX9–10.3, it’s possible for overlapping fragment shader invocations to be placed not only in different waves, but also in the same wave, with the shader code making sure that the ordered section is executed for overlapping invocations in order.

This functionality is optional — it can be activated by enabling loading of the INTRAWAVE_COLLISION SGPR argument in SPI_SHADER_PGM_RSRC2_PS and PA_SC_SHADER_CONTROL.

The lower 8 or 16 (depending on the wave size) bits of INTRAWAVE_COLLISION contain the mask of whether each quad in the wave starts a new layer of overlapping invocations, and thus the ordered section code for them needs to be executed after running it for all lanes with indices preceding that quad index multiplied by 4. The rest of the bits in the argument need to be ignored — AMD explicitly masks them out in shader code (although this is not necessary if the shader uses “find first 1” to obtain the start of the next set of overlapping quads or expands this quad mask into a lane mask).

For example, if the intrawave collision mask is 0b0000001110000100, or (1 << 2) | (1 << 7) | (1 << 8) | (1 << 9), the code of the ordered section needs to be executed first only for quads 1:0 (lanes 7:0), then only for quads 6:2 (lanes 27:8), then for quad 7 (lanes 31:28), then for quad 8 (lanes 35:32), and then for the remaining quads 15:9 (lanes 63:36).

This effectively causes the ordered section to be executed as smaller “sub-subgroups” within the original subgroup.

However, this is not always compatible with the execution model of SPIR-V or GLSL fragment shaders, so enabling intrawave collisions and wrapping a part of the shader in a loop may be unsafe in some cases. One particular example is when the shader uses subgroup operations influenced by lanes outside the current quad. In this case, the code outside and inside the ordered section may be executed with different sets of active invocations, affecting the results of subgroup operations. But in SPIR-V and GLSL, fragment shader interlock is not supposed to modify the set of active invocations in any way. So the intrawave collision loop may break the results of subgroup operations in unpredictable ways, even outside the driver’s compiler infrastructure. Even if the driver splits the subgroup exactly at OpBeginInvocationInterlockEXT and makes the lane subsets rejoin exactly at OpEndInvocationInterlockEXT, the application and the compilers that created the source shader are still not aware of that happening — the input SPIR-V or GLSL shader might have already gone through various optimizations, such as common subexpression elimination which might have considered a subgroup operation before OpBeginInvocationInterlockEXT and one after it equivalent.

The idea behind reporting intrawave collisions to shaders is to reduce the impact on the parallelism of the part of the shader that doesn’t depend on the ordering, to avoid wasting lanes in the wave and to allow the code outside the ordered section in different invocations to run in parallel lanes as usual. This may be especially helpful if the ordered section is small compared to the rest of the shader — for instance, a custom blending equation in the end of the usual fragment shader for a surface in the world.

However, whether handling intrawave collisions is preferred is not a question with one universal answer. Intrawave collisions are pretty uncommon without multisampling, or when using sample interlock with multisampling, although they’re highly frequent with pixel interlock with multisampling, when adjacent primitives cover the same pixels along the shared edge (though that’s an extremely expensive situation in general). But resolving intrawave collisions adds some overhead costs to the shader. If intrawave overlap is unlikely to happen often, or even more importantly, if the majority of the shader is inside the ordered section, handling it in the shader may cause more harm than good.

GFX11 removes this concept entirely, instead overlapping invocations are always placed in different waves.

GFX9–10.3: Exiting the ordered section in the wave

To exit the ordered section and let overlapping waves resume execution and enter their ordered sections, the wave needs to send the ORDERED_PS_DONE message (7) using s_sendmsg.

If the wave has enabled POPS by setting the packer hardware register, it must not execute s_endpgm without having sent ORDERED_PS_DONE once, so the message must be sent on all execution paths after the packer register setup. However, if the wave exits before having configured the packer register, sending the message is not required, though it’s still fine to send it regardless of that.

Note that if the shader has multiple OpEndInvocationInterlockEXT instructions executed in the same wave (depending on a divergent condition, for example), it must still be ensured that ORDERED_PS_DONE is sent by the wave only once, and especially not before any awaiting of overlapped waves.

Before the message is sent, all counters for memory accesses that need to be primitive-ordered, both writes and (in case something after the ordered section depends on the per-pixel data, for instance, the tail blending fallback in order-independent transparency) reads, must be awaited. Those may include vm, vs, and in some cases lgkm (though normally primitive-ordered memory accesses will be done through VMEM with divergent addresses, not SMEM, as there’s no synchronization between fragments at different pixel coordinates, but it’s still technically possible for a shader, even though pointless and nonoptimal, to explicitly perform them in a waterfall loop, for instance, and that must work correctly too). Without that, a race condition will occur when the newly resumed waves start accessing the memory locations to which there still are outstanding accesses in the current wave.

Another option for exiting is the s_endpgm_ordered_ps_done instruction, which combines waiting for all the counters, sending the ORDERED_PS_DONE message, and ending the program. Generally, however, it’s desirable to resume overlapping waves as early as possible, including before the export, as it may stall the wave for some time too.

GFX11: Exiting the ordered section in the wave

The overlapping waves are resumed when the wave performs the last export (with the done flag).

The same requirements for awaiting the memory access counters as on GFX9–10.3 still apply.

Memory access requirements

The compiler needs to ensure that entering the ordered section implements acquire semantics, and exiting it implements release semantics, in the fragment interlock memory scope for UniformMemory and ImageMemory SPIR-V storage classes.

A fragment interlock memory scope instance includes overlapping fragment shader invocations executed by commands inside a single subpass. It may be considered a subset of a queue family memory scope instance from the perspective of memory barriers.

Fragment shader interlock doesn’t perform implicit memory availability or visibility operations. Shaders must do them by themselves for accesses requiring primitive ordering, such as via coherent (queuefamilycoherent) in GLSL or MakeAvailable and MakeVisible in at least the QueueFamily scope in SPIR-V.

On AMD hardware, this means that the accessed memory locations must be made available or visible between waves that may be executed on any compute unit — so accesses must go directly to the global L2 cache, bypassing L0$ via the GLC flag and L1$ via DLC.

However, it should be noted that memory accesses in the ordered section may be expected by the application to be done in primitive order even if they don’t have the GLC and DLC flags. Coherent access not only bypasses, but also invalidates the lower-level caches for the accessed memory locations. Thus, considering that normally per-pixel data is accessed exclusively by the invocation executing the ordered section, it’s not necessary to make all reads or writes in the ordered section for one memory location to be GLC/DLC — just the first read and the last write: it doesn’t matter if per-pixel data is cached in L0/L1 in the middle of a dependency chain in the ordered section, as long as it’s invalidated in them in the beginning and flushed to L2 in the end. Therefore, optimizations in the compiler must not simply assume that only coherent accesses need primitive ordering — and moreover, the compiler must also take into account that the same data may be accessed through different bindings.

Export requirements

With POPS, on all hardware generations, the shader must have at least one export, though it can be a null or an off, off, off, off one.

Also, even if the shader doesn’t need to export any real data, the export skipping that was added in GFX10 must not be used, and some space must be allocated in the export buffer, such as by setting SPI_SHADER_COL_FORMAT for some color output to SPI_SHADER_32_R.

Without this, the shader will be executed without the needed synchronization on GFX10, and will hang on GFX11.

Drawing context setup

Configuring POPS

Most of the configuration is performed via the DB_SHADER_CONTROL register.

To enable POPS for the draw, DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER should be set to 1.

On GFX9–10.3, DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES controls which fragment shader invocations are considered overlapping:

  • For pixel interlock, it must be set to 0 (1 sample).

  • If sample interlock is sufficient (only synchronizing between invocations that have any common sample mask bits), it may be set to PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES — the number of sample coverage mask bits passed to the shader which is expected to use the sample mask to determine whether it’s allowed to access the data for each of the samples. As of April 2023, PAL for some reason doesn’t use non-1x POPS_OVERLAP_NUM_SAMPLES at all, even when using Direct3D Rasterizer Ordered Views or GL_INTEL_fragment_shader_ordering with sample shading (those APIs tie the interlock granularity to the shading frequency — Vulkan and OpenGL fragment shader interlock, however, allows specifying the interlock granularity independently of it, making it possible both to ask for finer synchronization guarantees and to require stronger ones than Direct3D ROVs can provide). However, with MSAA, on AMD hardware, pixel interlock generally performs massively, sometimes prohibitively, slower than sample interlock, because it causes fragment shader invocations along the common edge of adjacent primitives to be ordered as they cover the same pixels (even though they don’t cover any common samples). So it’s highly desirable for the driver to provide sample interlock, and to set POPS_OVERLAP_NUM_SAMPLES accordingly, if the shader declares that it’s enough for it via the execution mode.

On GFX11, when POPS is enabled, DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE is used in place of DB_SHADER_CONTROL.POPS_OVERLAP_NUM_SAMPLES from the earlier architecture generations (and has a different bit offset in the register), and DB_SHADER_CONTROL.OVERRIDE_INTRINSIC_RATE_ENABLE must be set to 1. The GFX11 blending performance workaround overriding the intrinsic rate must not be applied if POPS is used in the draw — the intrinsic rate override must be used solely to control the interlock granularity in this case.

No explicit flushes/synchronization are needed when changing the pipeline state variables that may be involved in POPS, such as the rasterization sample count. POPS automatically keeps synchronizing invocations even between draws with different sample counts (invocations with common coverage mask bits are considered overlapping by the hardware, regardless of what those samples actually are — only the indices are important).

Also, on GFX11, POPS uses DB_Z_INFO.NUM_SAMPLES to determine the coverage sample count, and it must be equal to PA_SC_AA_CONFIG.MSAA_EXPOSED_SAMPLES even if there’s no depth/stencil target.

Hardware bug workarounds

Early revisions of GFX9 — CHIP_VEGA10 and CHIP_RAVEN — contain a hardware bug that may result in a hang, and need a workaround to be enabled. Specifically, if POPS is used with 8 or more rasterization samples, or with 8 or more depth/stencil target samples, DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP must be set to 1 for draws that satisfy this condition. In PAL, this is the waMiscPopsMissedOverlap workaround. It results in slightly lower performance in those cases, increasing the frame time by around 1.5 to 2 times in nvpro-samples/vk_order_independent_transparency on the RX Vega 10, but it’s required in a pretty rare case (8x+ MSAA) and is mandatory to ensure stability.

Also, even though DB_DFSM_CONTROL.POPS_DRAIN_PS_ON_OVERLAP is not required on chips other than the CHIP_VEGA10 and CHIP_RAVEN GFX9 revisions, if it’s enabled for some reason on GFX10.1 (CHIP_NAVI10, CHIP_NAVI12, CHIP_NAVI14), and the draw uses POPS, DB_RENDER_OVERRIDE2.PARTIAL_SQUAD_LAUNCH_CONTROL must be set to PSLC_ON_HANG_ONLY to avoid a hang (see waStalledPopsMode in PAL).

Out-of-order rasterization interaction

This is a largely unresearched topic currently. However, considering that POPS is primarily the functionality of the Depth Block, similarity to the behavior of out-of-order rasterization in depth/stencil testing may possibly be expected.

If the shader specifies an ordered interlock execution mode, out-of-order rasterization likely must not be enabled implicitly.

As of April 2023, PAL doesn’t have any rules specifically for POPS in the logic determining whether out-of-order rasterization can be enabled automatically. Some of the POPS usage cases may possibly be covered by the rule that always disables out-of-order rasterization if the shader writes to Unordered Access Views (storage resources), though fragment shader interlock can be used for read-only purposes too (for ordering between draws that only read per-pixel data and draws that may write it), so that may be an oversight.

Explicitly enabled relaxed rasterization order modifies the concept of rasterization order itself in Vulkan, so from the point of view of the specification of fragment shader interlock, relaxed rasterization order should still be applicable regardless of whether the shader requests ordered interlock. PAL also doesn’t make any POPS-specific exceptions here as of April 2023.

Variable-rate shading interaction

On GFX10.3, enabling DB_SHADER_CONTROL.PRIMITIVE_ORDERED_PIXEL_SHADER forces the shading rate to be 1x1, thus the fragmentShadingRateWithFragmentShaderInterlock Vulkan device property must be false.

On GFX11, by default, POPS itself can work with non-1x1 shading rates, and the fragmentShadingRateWithFragmentShaderInterlock property must be true. However, if PA_SC_VRS_SURFACE_CNTL_1.FORCE_SC_VRS_RATE_FINE_POPS is set, enabling POPS will force 1x1 shading rate.

The widest interlock granularity available on GFX11 — with the lowest possible Depth Block intrinsic rate, 1x — is per-fine-pixel, however. There’s no synchronization between coarse fragment shader invocations if they don’t cover common fine pixels, so the fragmentShaderShadingRateInterlock Vulkan device feature is not available.

Additional configuration

These are some largely unresearched options found in the register declarations. PAL doesn’t use them, so it’s unknown if they make any significant difference. No effect was found in nvpro-samples/vk_order_independent_transparency during testing on GFX9 CHIP_RAVEN and GFX11 CHIP_NAVI31.

  • DB_SHADER_CONTROL.EXEC_IF_OVERLAPPED on GFX9–10.3.

  • PA_SC_BINNER_CNTL_0.BIN_MAPPING_MODE = BIN_MAP_MODE_POPS on GFX10+.