The WAVE Specification

Wide Architecture Virtual Encoding

Version: 0.2 (Working Draft, Revised)
Authors: Ojima Abraham, Onyinye Okoli
Date: March 23, 2026
Status: Working Draft (Revised)

0.Conformance Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

1.Introduction

1.1Purpose

This specification defines WAVE (Wide Architecture Virtual Encoding), a vendor-neutral instruction set architecture for general-purpose GPU computation. It specifies an abstract execution model, a register model, a memory model, structured control flow semantics, an instruction set, and a capability query system.

The specification follows the thin abstraction principle: it defines what a compliant implementation MUST be able to do, not how it must do it. Implementations MAY use any microarchitectural technique to achieve compliance.

1.2Scope

This specification covers general-purpose compute workloads. Graphics pipeline operations (rasterization, tessellation, pixel export, ray tracing) are out of scope and MAY be addressed by future extensions.

1.3Design Principles

  1. Thin abstraction. Every requirement traces to a hardware-invariant primitive observed across all four major GPU vendors. No requirement is imposed for software convenience alone.
  2. Queryable parameters. Values that differ across implementations (wave width, register count, scratchpad size) are exposed as queryable constants, not fixed in the specification.
  3. Structured divergence. The specification defines control flow semantics but not divergence mechanisms. Implementations are free to use any technique (execution masks, predication, hardware stacks, per-thread program counters) to achieve the specified behavior.
  4. Mandatory minimums. Every queryable parameter has a minimum value. A compliant implementation MUST meet or exceed all minimums.

1.4Relationship to Other Standards

This specification is complementary to existing standards. SPIR-V MAY be used as a distribution format for programs targeting this ISA. OpenCL and Vulkan MAY serve as host APIs for dispatching workloads. The distinction is that this specification defines the hardware execution model, while existing standards define host-device interaction.

1.5Changes from v0.1

This version incorporates corrections and clarifications discovered during implementation of the reference toolchain (assembler, disassembler, and emulator). Key changes:

  1. Register encoding widened from 5-bit to 8-bit (Section 8.2). The v0.1 encoding used 5-bit register fields, limiting addressable registers to 32. This conflicted with Section 7.1, which specifies MAX_REGISTERS minimum of 64. The encoding is now 8-bit, supporting up to 256 registers per thread.
  2. Minimum divergence stack depth specified (Section 5.4, Section 7.1). v0.1 did not define a minimum nesting depth for structured control flow. Implementations MUST now support at least 32 levels of nested control flow.
  3. Predicate negation semantics clarified (Section 5.1). v0.1 did not specify behavior of negated predicates on break and continue instructions. This is now explicitly defined.
  4. Per-Wave control flow state requirement added (Section 5.5). v0.1 did not specify whether control flow state (divergence stack) is per-Wave or shared. Each Wave MUST have independent control flow state.
  5. Full opcode table provided (Appendix A). v0.1 deferred this to a future version.
  6. Conformance test suite referenced (Section 9.4). The reference test suite (102 tests) is now a companion artifact.

2.Execution Model

2.1Overview

A compliant processor consists of one or more Cores. Each Core is an independent compute unit capable of executing multiple Workgroups concurrently. Cores are not addressable by software. The hardware assigns Workgroups to Cores, and the programmer MUST NOT assume any particular mapping.

2.2Thread Hierarchy

The execution model defines four levels, three mandatory and one optional:

Level 0: Thread. The smallest unit of execution. A Thread has a private register file, a scalar program counter, and a position within the hierarchy identified by hardware-populated identity values. A Thread executes a sequential stream of instructions.

Level 1: Wave. A group of exactly W Threads that execute a single instruction simultaneously, where W is a hardware constant queryable at compile time (see Section 7). All Threads in a Wave share a program counter for the purpose of instruction fetch. When Threads in a Wave disagree on a branch condition, the implementation MUST ensure that both paths execute with inactive Threads producing no architectural side effects. The mechanism by which this is achieved is not specified.

A Wave is the fundamental scheduling unit. The hardware scheduler operates on Waves, not individual Threads. Each Wave MUST maintain independent control flow state (see Section 5.5).

Level 2: Workgroup. A group of one or more Waves, containing up to MAX_WORKGROUP_SIZE Threads. All Waves in a Workgroup execute on the same Core, share access to Local Memory, and may synchronize via Barriers.

The number of Waves per Workgroup is ceil(workgroup_thread_count / W).

Workgroup dimensions are specified at dispatch time as a 3-dimensional size (x, y, z) where x * y * z <= MAX_WORKGROUP_SIZE.

Level 3: Grid. The complete dispatch of Workgroups. A Grid is specified as a 3-dimensional count of Workgroups. Workgroups within a Grid MAY execute in any order, on any Core, at any time. No synchronization is available between Workgroups within a single Grid dispatch.

Level 2.5 (Optional): Cluster. A group of Workgroups guaranteed to execute concurrently on adjacent Cores with access to each other's Local Memory. The Cluster size is queryable via CLUSTER_SIZE. If the implementation does not support Clusters, CLUSTER_SIZE is 1 and Cluster-scope operations behave identically to Workgroup-scope operations.

2.3Thread Identifiers

Every Thread has the following hardware-populated, read-only values available as special registers:

Identifier Type Description
thread_id.{x,y,z} uint32 Thread position within Workgroup (3D)
wave_id uint32 Wave index within Workgroup
lane_id uint32 Thread position within Wave (0 to W-1)
workgroup_id.{x,y,z} uint32 Workgroup position within Grid (3D)
workgroup_size.{x,y,z} uint32 Workgroup dimensions
grid_size.{x,y,z} uint32 Grid dimensions (in Workgroups)
num_waves uint32 Number of Waves in this Workgroup

2.4Core Resources

Each Core provides the following resources:

Register File. A fixed-size on-chip storage of F bytes, partitioned among all simultaneously resident Threads. Each Thread receives R registers (declared at compile time). The maximum number of simultaneously resident Waves is bounded by:

max_resident_waves = floor(F / (R * W * 4))

where 4 is the register width in bytes (32 bits). This is the occupancy equation. Implementations MUST support at least MIN_MAX_REGISTERS registers per Thread.

Local Memory. A fixed-size on-chip scratchpad of S bytes, shared among all Waves in the same Workgroup. Local Memory is explicitly addressed via load and store instructions. There is no automatic caching or data placement. Local Memory contents are undefined at Workgroup start and are not preserved across Workgroup boundaries. Implementations MUST provide at least MIN_LOCAL_MEMORY_SIZE bytes.

Hardware Scheduler. Selects a ready Wave for execution each cycle. When a Wave stalls on a memory access, barrier, or other long-latency operation, the scheduler MUST be able to select another resident Wave without software-visible overhead. The scheduling policy is implementation-defined.

2.5Execution Guarantees

  1. All Threads in a Wave execute the same instruction in the same cycle, or appear to from the programmer's perspective.
  2. Within a Wave, instruction execution is in program order.
  3. Between Waves in the same Workgroup, no execution order is guaranteed unless explicitly synchronized via Barriers or memory ordering operations.
  4. Between Workgroups, no execution order is guaranteed. Period.
  5. A Workgroup that uses Local Memory or Barriers MUST have all its Waves resident on a single Core simultaneously.
  6. The implementation MUST guarantee forward progress for at least one Wave per Core at all times (no deadlock from scheduling).
  7. A Wave that has been dispatched MUST eventually complete, assuming the program terminates (no starvation).
  8. Execution MUST be deterministic: given identical inputs, program binary, and dispatch configuration, the same program MUST produce identical results across runs on the same implementation. (Non-determinism across different implementations is permitted for implementation-defined behaviors.)

2.6Dispatch

A dispatch operation launches a Grid of Workgroups. The dispatch specifies:

  • The kernel (program entry point)
  • Grid dimensions
  • Workgroup dimensions
  • Register count per Thread (R)
  • Local Memory size required (in bytes)
  • Kernel arguments (buffer addresses, constants)

The implementation MUST reject a dispatch if the requested resources exceed Core capacity.

3.Register Model

3.1General-Purpose Registers

Each Thread has access to R general-purpose registers, where R is declared at compile time and MUST NOT exceed MAX_REGISTERS. Registers are 32 bits wide. They are untyped at the hardware level; the instruction determines how the register contents are interpreted (integer, float, bitfield). Registers are named r0 through r{R-1}.

3.2Sub-Register Access

Each 32-bit register MAY be accessed as two 16-bit halves: r{N}.lo (bits [15:0]) and r{N}.hi (bits [31:16]). This enables efficient F16 and BF16 operations without consuming additional registers.

3.3Register Pairs

Two consecutive registers MAY be used as a 64-bit value: r{N}:r{N+1} with r{N} as the low 32 bits. This is used for F64 operations (where supported) and 64-bit integer operations.

3.4Special Registers

Hardware-populated read-only registers for thread identity:

  • sr_thread_id_x, sr_thread_id_y, sr_thread_id_z
  • sr_wave_id, sr_lane_id
  • sr_workgroup_id_x, sr_workgroup_id_y, sr_workgroup_id_z
  • sr_workgroup_size_x, sr_workgroup_size_y, sr_workgroup_size_z
  • sr_grid_size_x, sr_grid_size_y, sr_grid_size_z
  • sr_wave_width, sr_num_waves

3.5Predicate Registers

The implementation MUST provide at least 4 predicate registers (p0 through p3), each 1 bit wide per Thread. Predicates are set by comparison instructions and consumed by conditional branch instructions.

3.6Allocation and Occupancy

The register count R is declared per-kernel at compile time. The implementation allocates R registers per Thread for all Threads in all resident Waves. The occupancy equation determines how many Waves can be resident simultaneously. Compilers SHOULD minimize R to maximize occupancy.

4.Memory Model

4.1Memory Spaces

Three mandatory memory spaces:

  • Register Memory: Per-Thread, on-chip, single-cycle, not addressable
  • Local Memory: Per-Workgroup, on-chip, explicitly addressed, size S
  • Device Memory: Global, off-chip or unified, cached by hardware-managed caches, persistent across dispatches

4.2Local Memory Details

Local Memory is organized as a flat byte-addressable array of S bytes. The base address is 0. Addresses outside [0, S) produce undefined behavior.

Local Memory is banked. When multiple Threads in a Wave access the same bank in the same cycle, a bank conflict MAY occur. Accesses to the same address within a bank (broadcast) MUST NOT cause a conflict.

Supports access widths of 8, 16, 32, and 64 bits.

4.3Device Memory Details

Device Memory is byte-addressable with a 64-bit virtual address space. The implementation MUST support aligned loads and stores of 8, 16, 32, 64, and 128 bits. Coalescing of contiguous accesses is implementation-defined. Cache hierarchy is transparent to the ISA.

4.4Memory Ordering

Default ordering is relaxed. Ordering is achieved through scoped fence operations at four scopes:

  • scope_wave
  • scope_workgroup
  • scope_device
  • scope_system

Fence semantics:

  • fence_acquire(scope) ensures subsequent loads see values at least as recent as those visible at the scope
  • fence_release(scope) ensures prior stores are visible at the scope
  • fence_acq_rel(scope) combines both

Store-to-load ordering within a Thread is always guaranteed.

4.5Atomic Operations

Atomic operations perform indivisible read-modify-write sequences on Local Memory and Device Memory. Required operations:

  • atomic_add (i32, u32, f32)
  • atomic_sub (i32, u32)
  • atomic_min (i32, u32)
  • atomic_max (i32, u32)
  • atomic_and (u32)
  • atomic_or (u32)
  • atomic_xor (u32)
  • atomic_exchange (u32)
  • atomic_compare_swap (u32)

Each takes a scope parameter. 64-bit atomics are OPTIONAL.

5.Control Flow

5.1Structured Control Flow

The ISA defines structured control flow primitives. All control flow MUST be expressible through these primitives. The implementation MUST NOT require the programmer to manage divergence masks, execution masks, or reconvergence points.

Conditional:

if (predicate)
    <then-body>
else
    <else-body>
endif

When Threads in a Wave evaluate the predicate differently, the implementation MUST execute both paths. Threads for which the predicate is false MUST NOT produce side effects during the then-body, and vice versa for the else-body. After endif, all Threads that were active before the if are active again.

Loop:

loop
    <body>
    break (predicate)    // exit loop for Threads where predicate is true
    continue (predicate) // skip to next iteration for Threads where predicate is true
endloop

A loop executes until all active Threads have exited via break. The implementation MUST guarantee forward progress: if at least one Thread remains in the loop, execution continues.

Predicate negation on break and continue: When a break or continue instruction uses a negated predicate (e.g., break !p0), the instruction applies to Threads where the predicate is false. That is, break !p0 causes Threads where p0 is false to exit the loop. The negation is applied before evaluating which Threads are affected.

Function call:

call <function>
return

Function calls push the return address onto an implementation-managed call stack. The call stack depth MUST support at least MAX_CALL_DEPTH levels of nesting (see Section 7.1). Recursion is OPTIONAL (see Section 7).

5.2Uniform Branches

If all Threads in a Wave evaluate a branch identically (uniform), the implementation SHOULD avoid executing the not-taken path. Performance optimization, not correctness requirement.

5.3Divergence and Reconvergence

The implementation is free to use any mechanism to implement the structured control flow semantics of Section 5.1:

  • Compiler-managed execution masks (AMD approach)
  • Hardware per-thread program counters (NVIDIA approach)
  • Compiler-generated predicated instructions (Intel approach)
  • Hardware divergence stack (Apple approach)
  • Any other mechanism that preserves the specified semantics

The ISA does not expose or constrain the divergence mechanism.

5.4Minimum Divergence Depth

Implementations MUST support nested control flow (if/else/endif, loop/break/endloop) to a depth of at least MIN_DIVERGENCE_DEPTH levels (see Section 7.1). This means a program may have up to MIN_DIVERGENCE_DEPTH nested if/else/endif blocks, or nested loops, or any combination thereof.

If a program exceeds the implementation's maximum divergence depth, the behavior is undefined.

5.5Per-Wave Control Flow State

Each Wave MUST maintain independent control flow state. This includes, but is not limited to, the divergence stack (active mask history), loop iteration state, and reconvergence points. Two Waves executing the same program binary at different points in a loop or branch MUST NOT interfere with each other's control flow state.

Rationale: This requirement was added in v0.2 after the reference emulator discovered that sharing control flow state across Waves in a Workgroup causes deadlock when Waves reach barriers at different loop iterations.

6.Instruction Set

6.1Instruction Format

All instructions operate on registers. There are no memory-to-register or memory-to-memory instructions (except explicit load/store).

Instructions are specified in this document in assembly notation:

opcode destination, source1, source2

Predicated instructions are written as:

@predicate opcode destination, source1, source2     // execute if predicate is true
@!predicate opcode destination, source1, source2    // execute if predicate is false

A predicated instruction executes only for Threads where the predicate condition is met. Threads where the condition is not met are unaffected — their destination registers retain their previous values and no side effects (memory stores, atomics) occur.

The binary encoding of instructions is defined in Section 8.

6.2Integer Arithmetic

All integer operations are performed per-Thread. Full set of integer arithmetic for i32/u32:

  • iadd, isub, imul, imul_hi, imad
  • idiv, imod, ineg, iabs
  • imin, imax, iclamp

Integer arithmetic uses wrapping semantics for overflow. Division by zero produces undefined behavior.

6.3Bitwise Operations

  • and, or, xor, not
  • shl, shr (logical), sar (arithmetic)
  • bitcount, bitfind, bitrev
  • bfe (bit field extract), bfi (bit field insert)

Shift amounts are masked to 5 bits (shift by rs2 & 0x1F).

6.4Floating-Point Arithmetic (F32) — REQUIRED

IEEE 754 single precision. Operations:

  • fadd, fsub, fmul, fma, fdiv
  • fneg, fabs, fmin, fmax, fclamp
  • fsqrt, frsqrt, frcp
  • ffloor, fceil, fround, ftrunc, ffract, fsat

Transcendentals: fsin, fcos, fexp2, flog2 (at least 2 ULP precision). Denormals MAY be flushed to zero.

6.5Floating-Point Arithmetic (F16) — REQUIRED

F16 operations on register halves. Packed 2xF16 operations for throughput:

  • hadd2, hmul2, hma2

6.6Floating-Point Arithmetic (F64) — OPTIONAL

F64 operations on register pairs:

  • dadd, dsub, dmul, dma, ddiv, dsqrt

6.7Type Conversion

Type conversion between i32, u32, f16, f32, f64.

6.8Comparison and Select

Comparison instructions set predicate registers. Select instruction for conditional moves.

6.9Memory Operations

Local memory: local_load/local_store for u8, u16, u32, u64.

Device memory: device_load/device_store for u8, u16, u32, u64, u128. Device loads are asynchronous; use wait or fence before consuming.

Optional cache hints: .cached, .uncached, .streaming.

6.10Atomic Operations

Atomic instructions on Local and Device Memory with scope suffixes (.wave, .workgroup, .device, .system). Return old value; non-returning variants SHOULD be optimized.

6.11Wave Operations — REQUIRED

Wave operations communicate between Threads within a Wave without going through memory.

  • wave_shuffle (by lane), wave_shuffle_up, wave_shuffle_down, wave_shuffle_xor
  • wave_broadcast
  • wave_ballot, wave_any, wave_all
  • wave_prefix_sum (exclusive)
  • wave_reduce_add, wave_reduce_min, wave_reduce_max

For shuffle operations, if the source lane is out of bounds (< 0 or >= W) or the source lane is inactive, the result is implementation-defined.

Wave operations operate only on active Threads. Inactive Threads (masked by divergence) do not participate in reductions, ballots, or prefix sums, and do not have their registers modified.

6.12Synchronization

  • barrier — Workgroup-scope barrier. All Waves in the Workgroup MUST reach this point before any Wave proceeds past it. Memory operations before the barrier are visible to all Waves in the Workgroup after the barrier.
  • fence_acquire/fence_release/fence_acq_rel (scoped)
  • wait (for async loads)

Barrier restriction: A barrier instruction MUST NOT appear inside a divergent control flow path. That is, when a Wave reaches a barrier, all active Threads in that Wave (at the point of the outermost non-divergent scope) must reach the same barrier. Barriers inside uniform if blocks (where all Threads agree) are permitted. Barriers inside loops are permitted provided all Waves in the Workgroup execute the same number of barrier instructions per loop iteration.

6.13Control Flow Instructions

Instruction Description
if pdBegin conditional block (Threads where pd is false become inactive)
elseSwitch active/inactive Threads
endifEnd conditional block (restore original active set)
loopBegin loop
break pdThreads where pd is true exit the loop
break !pdThreads where pd is false exit the loop
continue pdThreads where pd is true skip to next iteration
continue !pdThreads where pd is false skip to next iteration
endloopEnd loop (branch back to loop if any Threads still active)
call <label>Call function
returnReturn from function
haltTerminate this Thread

6.14Matrix Multiply-Accumulate — OPTIONAL

  • mma_f16_f32, mma_bf16_f32, mma_f32_f32

Tile dimensions queryable.

Miscellaneous

  • mov, mov_imm, nop

7.Capability System

7.1Required Constants

Constant Minimum Description
WAVE_WIDTH 8 Threads per Wave
MAX_REGISTERS 64 Maximum registers per Thread
REGISTER_FILE_SIZE 16384 Total register file size (bytes)
LOCAL_MEMORY_SIZE 16384 Local memory per Workgroup (bytes)
MAX_WORKGROUP_SIZE 256 Maximum Threads per Workgroup
MAX_WORKGROUPS_PER_CORE 1 Maximum concurrent Workgroups per Core
MAX_WAVES_PER_CORE 4 Maximum concurrent Waves per Core
DEVICE_MEMORY_SIZE Total device memory (bytes)
CLUSTER_SIZE 1 Workgroups per Cluster
MAX_CALL_DEPTH 8 Maximum function call depth
MIN_DIVERGENCE_DEPTH 32 Minimum nested control flow depth

7.2Optional Capabilities

  • CAP_F64 — 64-bit floating point
  • CAP_ATOMIC_64 — 64-bit atomics
  • CAP_ATOMIC_F32 — F32 atomic add
  • CAP_MMA — Matrix multiply-accumulate
  • CAP_RECURSION — Recursive function calls
  • CAP_CLUSTER — Cluster support

7.3Matrix MMA Parameters

When CAP_MMA is present, the following are queryable:

  • MMA_M, MMA_N, MMA_K — Tile dimensions
  • MMA_TYPES — Supported input/output type combinations

7.4Query Mechanism

Host API provides query_constant and query_capability functions.

8.Binary Encoding

8.1Overview

Instructions are encoded as fixed-width 48-bit (6-byte) words. Some instructions require an additional 32-bit word for immediate values or extended operands (80 bits / 10 bytes total).

v0.2 change: The v0.1 encoding used 32-bit base instructions with 5-bit register fields (max 32 registers). This conflicted with MAX_REGISTERS = 64. The encoding has been widened to 48 bits with 8-bit register fields (max 256 registers).

8.2Base Instruction Format (48-bit)

Bits Field Description
[47:40] opcode 8 bits — 256 primary opcodes
[39:32] rd 8 bits — destination register (0-255)
[31:24] rs1 8 bits — source register 1 (0-255)
[23:16] rs2 8 bits — source register 2 (0-255)
[15:12] modifier 4 bits — instruction-specific sub-opcode
[11:10] scope 2 bits — memory scope (00=wave, 01=workgroup, 10=device, 11=system)
[9:8] pred 2 bits — predicate register selector (p0-p3)
[7] pred_neg 1 bit — negate predicate (0=normal, 1=negated)
[6] pred_en 1 bit — predication enable (0=unpredicated, 1=predicated)
[5:0] flags 6 bits — instruction-specific

8.3Extended Instruction Format (80-bit)

For instructions requiring a third source register or a 32-bit immediate:

  • Word 0 (48 bits): Base instruction as above, with flags indicating extended format
  • Word 1 (32 bits): [31:0] rs3 (8 bits) + imm24, or full imm32

8.4Opcode Map

The 8-bit opcode field provides 256 primary opcodes, organized as:

Range Category
0x00-0x0FInteger arithmetic
0x10-0x1FFloating-point arithmetic (F32)
0x20-0x27Bitwise operations
0x28-0x2FComparison and select
0x30-0x37Local memory operations
0x38-0x3FDevice memory operations
0x40-0x4FAtomic operations
0x50-0x5FWave operations
0x60-0x6FControl flow and synchronization
0x70-0x7FType conversion
0x80-0x8FF16 arithmetic
0x90-0x9FF64 arithmetic (optional)
0xA0-0xAFMatrix MMA (optional)
0xB0-0xEFReserved for future extensions
0xF0-0xFFMiscellaneous (mov, mov_imm, nop, halt)

See Appendix A for the complete opcode-to-mnemonic mapping.

9.Conformance

9.1Required Behavior

A compliant implementation MUST:

  1. Support all mandatory instructions (Sections 6.2 through 6.15).
  2. Meet or exceed all minimum values in Section 7.1.
  3. Implement the memory ordering semantics of Section 4.4.
  4. Implement the structured control flow semantics of Section 5.1, including per-Wave control flow state (Section 5.5).
  5. Satisfy all execution guarantees of Section 2.5.
  6. Correctly report all capabilities of Section 7.2.
  7. Support nested control flow to at least MIN_DIVERGENCE_DEPTH levels (Section 5.4).

9.2Implementation-Defined Behavior

The following behaviors are implementation-defined (valid implementations may differ):

  • Denormal floating-point handling (flush to zero or preserve)
  • Bank conflict penalty in Local Memory
  • Device Memory coalescing policy
  • Cache hierarchy structure, sizes, and policies
  • Scheduling policy for Wave selection
  • Transcendental function precision beyond the specified minimum
  • Out-of-bounds shuffle source lane result
  • Shuffle from inactive source lane result
  • Unaligned memory access behavior
  • Wave scheduling order within a Workgroup

9.3Undefined Behavior

The following constitute undefined behavior (no guarantees):

  • Accessing Local Memory outside [0, S)
  • Accessing Device Memory outside allocated regions
  • Using optional capabilities on hardware that does not support them
  • Exceeding MAX_CALL_DEPTH
  • Exceeding the implementation's maximum divergence depth
  • Data races on Device Memory without proper fencing
  • Infinite loops with no forward progress
  • Barrier inside a divergent control flow path (where threads in a wave disagree on whether to execute the barrier)
  • Integer division by zero

9.4Conformance Testing

A reference conformance test suite consisting of 102 tests is provided as a companion artifact in the WAVE toolchain repository. The test suite verifies:

  1. Correct execution of all mandatory instructions, including edge cases (overflow, NaN, infinity)
  2. Memory ordering compliance across scopes
  3. Barrier semantics with multi-wave workgroups, including barriers inside loops
  4. Atomic operation correctness on both local and device memory
  5. Structured control flow behavior under divergence, including nested divergence to depth 32
  6. Wave operations under divergence (shuffle, ballot, reduce with inactive threads)
  7. Capability reporting accuracy
  8. Real GPU program correctness (tiled GEMM, parallel reduction, histogram, prefix sum)

An implementation passes conformance if all 102 tests in the mandatory suite produce correct results. The test suite is versioned alongside the specification.

A.Full Opcode Table

Integer Arithmetic (0x00-0x0F)

OpcodeMnemonicFormatDescription
0x00iaddBaserd = rs1 + rs2
0x01isubBaserd = rs1 - rs2
0x02imulBaserd = (rs1 * rs2) & 0xFFFFFFFF
0x03imul_hiBaserd = (rs1 * rs2) >> 32
0x04imadExtendedrd = rs1 * rs2 + rs3
0x05idivBaserd = rs1 / rs2
0x06imodBaserd = rs1 % rs2
0x07inegBaserd = -rs1
0x08iabsBaserd = abs(rs1)
0x09iminBaserd = min(rs1, rs2) (signed)
0x0AimaxBaserd = max(rs1, rs2) (signed)
0x0BiclampExtendedrd = clamp(rs1, rs2, rs3)
0x0CuminBaserd = min(rs1, rs2) (unsigned)
0x0DumaxBaserd = max(rs1, rs2) (unsigned)
0x0E-0x0FReserved

Floating-Point Arithmetic F32 (0x10-0x1F)

OpcodeMnemonicFormatDescription
0x10faddBaserd = rs1 + rs2
0x11fsubBaserd = rs1 - rs2
0x12fmulBaserd = rs1 * rs2
0x13fmaExtendedrd = rs1 * rs2 + rs3
0x14fdivBaserd = rs1 / rs2
0x15fnegBaserd = -rs1
0x16fabsBaserd = abs(rs1)
0x17fminBaserd = min(rs1, rs2)
0x18fmaxBaserd = max(rs1, rs2)
0x19fclampExtendedrd = clamp(rs1, rs2, rs3)
0x1AfsqrtBaserd = sqrt(rs1)
0x1BfrsqrtBaserd = 1/sqrt(rs1)
0x1CfrcpBaserd = 1/rs1
0x1DfroundBasemodifier: 0=floor, 1=ceil, 2=round, 3=trunc
0x1EffractBaserd = fract(rs1)
0x1FftranscBasemodifier: 0=sin, 1=cos, 2=exp2, 3=log2

Bitwise Operations (0x20-0x27)

OpcodeMnemonicFormatDescription
0x20andBaserd = rs1 & rs2
0x21orBaserd = rs1 | rs2
0x22xorBaserd = rs1 ^ rs2
0x23notBaserd = ~rs1
0x24shiftBasemodifier: 0=shl, 1=shr, 2=sar
0x25bitopBasemodifier: 0=bitcount, 1=bitfind, 2=bitrev
0x26bfeExtendedExtract bit field
0x27bfiExtendedInsert bit field

Comparison and Select (0x28-0x2F)

OpcodeMnemonicFormatDescription
0x28icmpBasemodifier: 0=eq, 1=ne, 2=lt, 3=le, 4=gt, 5=ge
0x29ucmpBasemodifier: 0=lt, 1=le
0x2AfcmpBasemodifier: 0=eq, 1=lt, 2=le, 3=gt, 4=ne, 5=ord, 6=unord
0x2BselectBaserd = pred ? rs1 : rs2
0x2CfsatBaserd = clamp(rs1, 0.0, 1.0)
0x2D-0x2FReserved

Local Memory (0x30-0x37)

OpcodeMnemonicFormatDescription
0x30local_loadBasemodifier: 0=u8, 1=u16, 2=u32, 3=u64
0x31local_storeBasemodifier: 0=u8, 1=u16, 2=u32, 3=u64
0x32-0x37Reserved

Device Memory (0x38-0x3F)

OpcodeMnemonicFormatDescription
0x38device_loadBasemodifier: 0=u8, 1=u16, 2=u32, 3=u64, 4=u128
0x39device_storeBasemodifier: 0=u8, 1=u16, 2=u32, 3=u64, 4=u128
0x3A-0x3FReserved

Atomic Operations (0x40-0x4F)

OpcodeMnemonicFormatDescription
0x40atomic_addExtendedAtomic add (scope in scope field)
0x41atomic_subExtendedAtomic subtract
0x42atomic_minExtendedAtomic minimum
0x43atomic_maxExtendedAtomic maximum
0x44atomic_andExtendedAtomic bitwise AND
0x45atomic_orExtendedAtomic bitwise OR
0x46atomic_xorExtendedAtomic bitwise XOR
0x47atomic_exchangeExtendedAtomic swap
0x48atomic_casExtendedCompare-and-swap
0x49-0x4FReserved

Wave Operations (0x50-0x5F)

OpcodeMnemonicFormatDescription
0x50wave_shuffleBaserd = rs1 from lane rs2
0x51wave_shuffle_upBaserd = rs1 from lane (lane_id - rs2)
0x52wave_shuffle_downBaserd = rs1 from lane (lane_id + rs2)
0x53wave_shuffle_xorBaserd = rs1 from lane (lane_id ^ rs2)
0x54wave_broadcastBaserd = rs1 from lane rs2 (all threads)
0x55wave_ballotBaserd = bitmask of pd across active threads
0x56wave_anyBasepd_dst = any active thread has pd_src true
0x57wave_allBasepd_dst = all active threads have pd_src true
0x58wave_prefix_sumBaseExclusive prefix sum
0x59wave_reduceBasemodifier: 0=add, 1=min, 2=max
0x5A-0x5FReserved

Control Flow and Synchronization (0x60-0x6F)

OpcodeMnemonicFormatDescription
0x60ifBaseBegin conditional block
0x61elseBaseSwitch active/inactive
0x62endifBaseEnd conditional, restore mask
0x63loopBaseBegin loop
0x64breakBaseExit loop for predicated threads
0x65continueBaseSkip to next iteration for predicated threads
0x66endloopBaseEnd loop, branch back if any active
0x67callExtendedCall function at imm32 address
0x68returnBaseReturn from function
0x69barrierBaseWorkgroup barrier
0x6AfenceBasemodifier: 0=acquire, 1=release, 2=acq_rel
0x6BwaitBaseWait for async loads
0x6ChaltBaseTerminate thread
0x6D-0x6FReserved

Type Conversion (0x70-0x7F)

OpcodeMnemonicFormatDescription
0x70cvt_f32_i32BaseSigned int to float
0x71cvt_f32_u32BaseUnsigned int to float
0x72cvt_i32_f32BaseFloat to signed int
0x73cvt_u32_f32BaseFloat to unsigned int
0x74cvt_f32_f16BaseF16 to F32
0x75cvt_f16_f32BaseF32 to F16
0x76cvt_f32_f64BaseF64 to F32 (requires CAP_F64)
0x77cvt_f64_f32BaseF32 to F64 (requires CAP_F64)
0x78-0x7FReserved

F16 Arithmetic (0x80-0x8F)

OpcodeMnemonicFormatDescription
0x80haddBaseF16 add
0x81hsubBaseF16 subtract
0x82hmulBaseF16 multiply
0x83hmaExtendedF16 fused multiply-add
0x84hadd2BasePacked 2xF16 add
0x85hmul2BasePacked 2xF16 multiply
0x86hma2ExtendedPacked 2xF16 fused multiply-add
0x87-0x8FReserved

F64 Arithmetic (0x90-0x9F) — Requires CAP_F64

OpcodeMnemonicFormatDescription
0x90daddBaseF64 add
0x91dsubBaseF64 subtract
0x92dmulBaseF64 multiply
0x93dmaExtendedF64 fused multiply-add
0x94ddivBaseF64 divide
0x95dsqrtBaseF64 square root
0x96-0x9FReserved

Matrix MMA (0xA0-0xAF) — Requires CAP_MMA

OpcodeMnemonicFormatDescription
0xA0mma_f16_f32ExtendedD = A*B+C, A/B F16, C/D F32
0xA1mma_bf16_f32ExtendedD = A*B+C, A/B BF16, C/D F32
0xA2mma_f32_f32ExtendedD = A*B+C, all F32
0xA3-0xAFReserved

Miscellaneous (0xF0-0xFF)

OpcodeMnemonicFormatDescription
0xF0movBaserd = rs1
0xF1mov_immExtendedrd = imm32
0xF2mov_specialBaserd = special register (rs1 encodes which)
0xF3nopBaseNo operation
0xF4-0xFFReserved

B.Vendor Mapping Reference

Abstract Concept NVIDIA AMD RDNA AMD CDNA Intel Apple
Core SM WGP CU Xe-core GPU core
Wave Warp (32) Wavefront (32) Wavefront (64) Sub-group (8-16) SIMD-group (32)
Register PTX register VGPR VGPR GRF entry GPR
Local Memory Shared memory LDS LDS SLM Threadgroup mem
Barrier bar.sync S_BARRIER S_BARRIER barrier threadgroup_barrier
Shuffle __shfl_sync DPP/ds_permute DPP/ds_permute sub_group_shuffle simd_shuffle
Atomic atom/red ds/buffer atomic ds/buffer atomic atomic_ref (SEND) atomic_fetch_add
Device Memory Global memory VRAM VRAM GDDR/HBM LPDDR (unified)
Fence fence.scope S_WAITCNT S_WAITCNT scoreboard wait_for_loads

C.Revision History

Version Date Changes
0.1 2026-03-22 Initial draft
0.2 2026-03-23 Register encoding widened to 8-bit; minimum divergence depth specified (32); predicate negation on break/continue clarified; per-Wave control flow state required; full opcode table added; deterministic execution guarantee added; barrier divergence restriction documented; integer overflow and division semantics specified; conformance test suite (102 tests) referenced
Defensive Publication Statement: This specification is published as a defensive publication. The architectural concepts described herein are placed in the public domain for the purpose of establishing prior art and preventing proprietary claims on vendor-neutral GPU compute primitives.

The WAVE Specification

Wide Architecture Virtual Encoding

Version: 0.1 (Working Draft)
Authors: Ojima Abraham, Onyinye Okoli
Date: March 22, 2026
Status: Working Draft

0.Conformance Language

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this document are to be interpreted as described in RFC 2119.

1.Introduction

1.1Purpose

This specification defines WAVE (Wide Architecture Virtual Encoding), a vendor-neutral instruction set architecture for general-purpose GPU computation. It specifies an abstract execution model, a register model, a memory model, structured control flow semantics, an instruction set, and a capability query system.

The specification follows the thin abstraction principle: it defines what a compliant implementation MUST be able to do, not how it must do it. Implementations MAY use any microarchitectural technique to achieve compliance.

1.2Scope

This specification covers general-purpose compute workloads. Graphics pipeline operations (rasterization, tessellation, pixel export, ray tracing) are out of scope and MAY be addressed by future extensions.

1.3Design Principles

  1. Thin abstraction. Every requirement traces to a hardware-invariant primitive observed across all four major GPU vendors. No requirement is imposed for software convenience alone.
  2. Queryable parameters. Values that differ across implementations (wave width, register count, scratchpad size) are exposed as queryable constants, not fixed in the specification.
  3. Structured divergence. The specification defines control flow semantics but not divergence mechanisms. Implementations are free to use any technique (execution masks, predication, hardware stacks, per-thread program counters) to achieve the specified behavior.
  4. Mandatory minimums. Every queryable parameter has a minimum value. A compliant implementation MUST meet or exceed all minimums.

1.4Relationship to Other Standards

This specification is complementary to existing standards. SPIR-V MAY be used as a distribution format for programs targeting this ISA. OpenCL and Vulkan MAY serve as host APIs for dispatching workloads. The distinction is that this specification defines the hardware execution model, while existing standards define host-device interaction.

2.Execution Model

2.1Overview

A compliant processor consists of one or more Cores. Each Core is an independent compute unit capable of executing multiple Workgroups concurrently. Cores are not addressable by software. The hardware assigns Workgroups to Cores, and the programmer MUST NOT assume any particular mapping.

2.2Thread Hierarchy

The execution model defines four levels, three mandatory and one optional:

Level 0: Thread. The smallest unit of execution. A Thread has a private register file, a scalar program counter, and a position within the hierarchy identified by hardware-populated identity values. A Thread executes a sequential stream of instructions.

Level 1: Wave. A group of exactly W Threads that execute a single instruction simultaneously, where W is a hardware constant queryable at compile time (see Section 7). All Threads in a Wave share a program counter for the purpose of instruction fetch. When Threads in a Wave disagree on a branch condition, the implementation MUST ensure that both paths execute with inactive Threads producing no architectural side effects. The mechanism by which this is achieved is not specified.

A Wave is the fundamental scheduling unit. The hardware scheduler operates on Waves, not individual Threads.

Level 2: Workgroup. A group of one or more Waves, containing up to MAX_WORKGROUP_SIZE Threads. All Waves in a Workgroup execute on the same Core, share access to Local Memory, and may synchronize via Barriers.

The number of Waves per Workgroup is ceil(workgroup_thread_count / W).

Workgroup dimensions are specified at dispatch time as a 3-dimensional size (x, y, z) where x * y * z <= MAX_WORKGROUP_SIZE.

Level 3: Grid. The complete dispatch of Workgroups. A Grid is specified as a 3-dimensional count of Workgroups. Workgroups within a Grid MAY execute in any order, on any Core, at any time. No synchronization is available between Workgroups within a single Grid dispatch.

Level 2.5 (Optional): Cluster. A group of Workgroups guaranteed to execute concurrently on adjacent Cores with access to each other's Local Memory. The Cluster size is queryable via CLUSTER_SIZE. If the implementation does not support Clusters, CLUSTER_SIZE is 1 and Cluster-scope operations behave identically to Workgroup-scope operations.

2.3Thread Identifiers

Every Thread has the following hardware-populated, read-only values available as special registers:

Identifier Type Description
thread_id.{x,y,z} uint32 Thread position within Workgroup (3D)
wave_id uint32 Wave index within Workgroup
lane_id uint32 Thread position within Wave (0 to W-1)
workgroup_id.{x,y,z} uint32 Workgroup position within Grid (3D)
workgroup_size.{x,y,z} uint32 Workgroup dimensions
grid_size.{x,y,z} uint32 Grid dimensions (in Workgroups)
num_waves uint32 Number of Waves in this Workgroup

2.4Core Resources

Each Core provides the following resources:

Register File. A fixed-size on-chip storage of F bytes, partitioned among all simultaneously resident Threads. Each Thread receives R registers (declared at compile time). The maximum number of simultaneously resident Waves is bounded by:

max_resident_waves = floor(F / (R * W * 4))

where 4 is the register width in bytes (32 bits). This is the occupancy equation. Implementations MUST support at least MIN_MAX_REGISTERS registers per Thread.

Local Memory. A fixed-size on-chip scratchpad of S bytes, shared among all Waves in the same Workgroup. Local Memory is explicitly addressed via load and store instructions. There is no automatic caching or data placement. Local Memory contents are undefined at Workgroup start and are not preserved across Workgroup boundaries. Implementations MUST provide at least MIN_LOCAL_MEMORY_SIZE bytes.

Hardware Scheduler. Selects a ready Wave for execution each cycle. When a Wave stalls on a memory access, barrier, or other long-latency operation, the scheduler MUST be able to select another resident Wave without software-visible overhead. The scheduling policy is implementation-defined.

2.5Execution Guarantees

  1. All Threads in a Wave execute the same instruction in the same cycle, or appear to from the programmer's perspective.
  2. Inactive Threads (those on the non-taken path of a divergent branch) produce no architectural side effects (no memory writes, no register updates visible to other Threads).
  3. All Threads in a Workgroup that reach a Barrier will synchronize at that Barrier before any proceeds past it.
  4. Memory operations to Local Memory are visible to all Threads in the Workgroup after a Barrier.
  5. Memory operations to Device Memory are visible according to the memory ordering rules (Section 4).
  6. Wave-level operations (reductions, broadcasts, shuffles) operate only on active Threads.
  7. A compliant implementation MUST eventually make progress on at least one Wave per Core.

2.6Dispatch

A dispatch consists of specifying:

  1. A program (compiled binary containing instructions for this ISA)
  2. Grid dimensions (number of Workgroups in x, y, z)
  3. Workgroup dimensions (number of Threads in x, y, z)
  4. Kernel arguments (Device Memory pointers, constants)

The implementation distributes Workgroups to Cores and creates Waves within each Workgroup. The distribution policy is implementation-defined.

3.Register Model

3.1General-Purpose Registers

Each Thread has access to R general-purpose registers, where R is declared at compile time and MUST be at least MIN_MAX_REGISTERS. Registers are 32 bits wide and are denoted r0, r1, ..., r(R-1). Register contents are undefined at Thread start unless explicitly initialized.

3.2Sub-Register Access

A 32-bit register may be accessed as two 16-bit halves:

  • r0.lo – bits [15:0]
  • r0.hi – bits [31:16]

Or as four 8-bit bytes (optional capability):

  • r0.b0 – bits [7:0]
  • r0.b1 – bits [15:8]
  • r0.b2 – bits [23:16]
  • r0.b3 – bits [31:24]

3.3Register Pairs

64-bit operations use register pairs. The notation r0:r1 denotes a 64-bit value where r0 holds the low 32 bits and r1 holds the high 32 bits. Register pairs MUST use adjacent even-odd register numbers (e.g., r0:r1, r2:r3, but not r1:r2).

3.4Special Registers

The following read-only registers are populated by hardware:

  • %tid.x, %tid.y, %tid.z – Thread ID within Workgroup
  • %wid – Wave ID within Workgroup
  • %lid – Lane ID within Wave
  • %ctaid.x, %ctaid.y, %ctaid.z – Workgroup ID within Grid
  • %ntid.x, %ntid.y, %ntid.z – Workgroup dimensions
  • %nctaid.x, %nctaid.y, %nctaid.z – Grid dimensions
  • %nwaves – Number of Waves in this Workgroup
  • %clock – Cycle counter (implementation-defined resolution)

3.5Predicate Registers

Implementations MUST provide at least 8 predicate registers, denoted p0 through p7. Each predicate register holds a boolean value (1 bit per Thread). Predicate registers are used for conditional execution and branch conditions.

3.6Allocation and Occupancy

Compilers declare the number of registers required by a kernel. The runtime uses this to compute occupancy:

occupancy = min(
    max_waves_per_core,
    floor(register_file_size / (registers_per_thread * wave_width * 4)),
    floor(local_memory_size / local_memory_per_workgroup)
)

Higher occupancy generally improves latency hiding but reduces registers and Local Memory available per Thread.

4.Memory Model

4.1Memory Spaces

The specification defines four memory spaces:

Space Scope Lifetime Typical Implementation
Private Single Thread Thread lifetime Registers or stack
Local Workgroup Workgroup lifetime On-chip SRAM
Device All Threads Kernel lifetime (at least) VRAM
Constant All Threads Kernel lifetime Cached read-only

4.2Local Memory Details

Local Memory MUST support atomic operations. Access latency is implementation-defined but SHOULD be significantly lower than Device Memory. Bank conflicts MAY increase access latency but MUST NOT affect correctness. The implementation MUST provide at least MIN_LOCAL_MEMORY_SIZE bytes per Workgroup.

4.3Device Memory Details

Device Memory is the largest but slowest memory space. It persists across kernel launches (until explicitly deallocated by the host). Device Memory MUST support atomic operations. Coalescing (combining multiple Thread accesses into fewer memory transactions) is implementation-defined but strongly encouraged.

4.4Memory Ordering

Within a single Thread, memory operations appear to execute in program order. Across Threads, the ordering is relaxed by default. Explicit ordering is achieved via:

Fences. A fence instruction establishes ordering between memory operations before and after the fence. Fences are parameterized by scope:

  • fence.wave – ordering visible to all Threads in the Wave
  • fence.workgroup – ordering visible to all Threads in the Workgroup
  • fence.device – ordering visible to all Threads on the device

Barriers. A Barrier (barrier) implies a Workgroup-scope fence. All memory operations before the Barrier are visible to all Threads in the Workgroup after the Barrier.

4.5Atomic Operations

Atomic operations are indivisible read-modify-write operations. Implementations MUST support the following atomic operations on both Local and Device Memory:

  • atom.add, atom.sub – addition/subtraction
  • atom.min, atom.max – minimum/maximum (signed and unsigned)
  • atom.and, atom.or, atom.xor – bitwise operations
  • atom.exch – exchange
  • atom.cas – compare-and-swap

64-bit atomics are an optional capability.

5.Control Flow

5.1Structured Control Flow

All control flow MUST be structured. The specification defines the following control flow primitives:

  • if pd, label_else, label_endif – conditional branch based on predicate pd
  • else – marks the start of the else block
  • endif – marks the end of the if/else construct
  • loop label_end – begins a loop, label_end marks the end
  • endloop – marks the end of a loop
  • break pd – exits the innermost loop if predicate is true
  • continue pd – jumps to the next iteration of the innermost loop

Arbitrary goto is not supported. All control flow graphs MUST be reducible.

5.2Uniform Branches

A branch is uniform if all active Threads in a Wave take the same direction. Uniform branches have no divergence overhead. Compilers SHOULD annotate branches as uniform when known, allowing optimizations.

The syntax for annotating a branch as uniform is @uniform if pd, .... If a branch annotated as uniform diverges at runtime, behavior is undefined.

5.3Divergence and Reconvergence

When Threads in a Wave disagree on a branch condition, the Wave is divergent. The implementation MUST execute both paths, but the order and mechanism are implementation-defined. Possible mechanisms include:

  • Execution masks (execute both paths with some Threads inactive)
  • Stack-based reconvergence
  • Per-Thread program counters

The specification only requires that:

  1. All Threads reconverge after structured control flow constructs (at endif, endloop).
  2. Inactive Threads produce no side effects.
  3. Wave-level operations operate only on active Threads.

6.Instruction Set

6.1Instruction Format

All instructions follow the format:

[predicate] opcode[.modifiers] destination, source1, source2, ...

Where:

  • predicate (optional): @p0...@p7 for conditional execution
  • opcode: instruction mnemonic
  • modifiers: type suffixes, rounding modes, etc.
  • destination: target register
  • source: source operands (registers, immediates)

6.2Integer Arithmetic

Instruction Description Operation
add.s32 rd, ra, rb Signed 32-bit add rd = ra + rb
add.u32 rd, ra, rb Unsigned 32-bit add rd = ra + rb
sub.s32 rd, ra, rb Signed 32-bit subtract rd = ra - rb
mul.lo.s32 rd, ra, rb Signed multiply (low 32 bits) rd = (ra * rb)[31:0]
mul.hi.s32 rd, ra, rb Signed multiply (high 32 bits) rd = (ra * rb)[63:32]
mul.wide.s32 rd:rd+1, ra, rb Signed 32x32→64 multiply rd:rd+1 = ra * rb
mad.lo.s32 rd, ra, rb, rc Multiply-add (low bits) rd = (ra * rb)[31:0] + rc
div.s32 rd, ra, rb Signed division rd = ra / rb
rem.s32 rd, ra, rb Signed remainder rd = ra % rb
neg.s32 rd, ra Negate rd = -ra
abs.s32 rd, ra Absolute value rd = |ra|
min.s32 rd, ra, rb Minimum rd = min(ra, rb)
max.s32 rd, ra, rb Maximum rd = max(ra, rb)

6.3Bitwise Operations

Instruction Description Operation
and.b32 rd, ra, rb Bitwise AND rd = ra & rb
or.b32 rd, ra, rb Bitwise OR rd = ra | rb
xor.b32 rd, ra, rb Bitwise XOR rd = ra ^ rb
not.b32 rd, ra Bitwise NOT rd = ~ra
shl.b32 rd, ra, rb Shift left rd = ra << rb
shr.u32 rd, ra, rb Logical shift right rd = ra >>> rb
shr.s32 rd, ra, rb Arithmetic shift right rd = ra >> rb
popc.b32 rd, ra Population count rd = popcount(ra)
clz.b32 rd, ra Count leading zeros rd = clz(ra)
brev.b32 rd, ra Bit reverse rd = reverse_bits(ra)
bfe.u32 rd, ra, rb, rc Bit field extract rd = (ra >> rb) & mask(rc)
bfi.b32 rd, ra, rb, rc, re Bit field insert Insert rc bits of ra at position rb into re

6.4Floating-Point (F32)

All implementations MUST support IEEE 754-2008 single-precision (binary32) with the following exceptions:

  • Denormal inputs MAY be flushed to zero.
  • Denormal outputs MAY be flushed to zero.
  • NaN payloads are implementation-defined.
Instruction Description
add.f32 rd, ra, rb Addition
sub.f32 rd, ra, rb Subtraction
mul.f32 rd, ra, rb Multiplication
fma.f32 rd, ra, rb, rc Fused multiply-add (rd = ra*rb + rc)
div.f32 rd, ra, rb Division
rcp.f32 rd, ra Reciprocal (approximate, 1 ULP)
sqrt.f32 rd, ra Square root
rsqrt.f32 rd, ra Reciprocal square root (approximate)
neg.f32 rd, ra Negate
abs.f32 rd, ra Absolute value
min.f32 rd, ra, rb Minimum (IEEE semantics)
max.f32 rd, ra, rb Maximum (IEEE semantics)
sin.f32 rd, ra Sine (approximate)
cos.f32 rd, ra Cosine (approximate)
exp2.f32 rd, ra 2^x (approximate)
log2.f32 rd, ra log₂(x) (approximate)

6.5Floating-Point (F16)

Half-precision support is an optional capability. When present:

  • Two f16 values are packed into a single 32-bit register.
  • Operations operate on pairs (vec2) or scalars.
Instruction Description
add.f16x2 rd, ra, rb Packed f16 addition
mul.f16x2 rd, ra, rb Packed f16 multiplication
fma.f16x2 rd, ra, rb, rc Packed f16 FMA

6.6Floating-Point (F64)

Double-precision support is an optional capability. When present, operations use register pairs.

Instruction Description
add.f64 rd:rd+1, ra:ra+1, rb:rb+1 f64 addition
mul.f64 rd:rd+1, ra:ra+1, rb:rb+1 f64 multiplication
fma.f64 rd:rd+1, ra:ra+1, rb:rb+1, rc:rc+1 f64 FMA
div.f64 rd:rd+1, ra:ra+1, rb:rb+1 f64 division

6.7Type Conversion

Instruction Description
cvt.f32.s32 rd, ra Signed int to float
cvt.f32.u32 rd, ra Unsigned int to float
cvt.s32.f32 rd, ra Float to signed int (truncate)
cvt.rni.s32.f32 rd, ra Float to signed int (round nearest)
cvt.f16.f32 rd, ra f32 to f16 (packed)
cvt.f32.f16 rd, ra f16 to f32 (unpacked)
cvt.f64.f32 rd:rd+1, ra f32 to f64
cvt.f32.f64 rd, ra:ra+1 f64 to f32

6.8Comparison and Select

Instruction Description
setp.eq.s32 pd, ra, rb Set predicate if equal
setp.ne.s32 pd, ra, rb Set predicate if not equal
setp.lt.s32 pd, ra, rb Set predicate if less than
setp.le.s32 pd, ra, rb Set predicate if less or equal
setp.gt.s32 pd, ra, rb Set predicate if greater than
setp.ge.s32 pd, ra, rb Set predicate if greater or equal
selp.s32 rd, ra, rb, pd Select: rd = pd ? ra : rb
slct.s32.f32 rd, ra, rb, rc Select based on sign: rd = (rc >= 0) ? ra : rb

Floating-point comparisons include special handling for NaN:

  • setp.eq.f32: false if either operand is NaN
  • setp.neu.f32: true if either operand is NaN (unordered not-equal)

6.9Memory Operations

Instruction Description
ld.local.b32 rd, [ra] Load 32 bits from Local Memory
ld.local.b64 rd:rd+1, [ra] Load 64 bits from Local Memory
ld.global.b32 rd, [ra] Load 32 bits from Device Memory
ld.const.b32 rd, [ra] Load 32 bits from Constant Memory
st.local.b32 [ra], rb Store 32 bits to Local Memory
st.global.b32 [ra], rb Store 32 bits to Device Memory

Vector loads/stores are supported for efficiency:

  • ld.global.v2.b32 {rd, rd+1}, [ra] – load 2x32 bits
  • ld.global.v4.b32 {rd, rd+1, rd+2, rd+3}, [ra] – load 4x32 bits

6.10Atomic Operations

Instruction Description
atom.local.add.u32 rd, [ra], rb Atomic add to Local Memory
atom.global.add.u32 rd, [ra], rb Atomic add to Device Memory
atom.global.cas.b32 rd, [ra], rb, rc Compare-and-swap
atom.global.exch.b32 rd, [ra], rb Exchange
atom.global.min.s32 rd, [ra], rb Atomic minimum
atom.global.max.s32 rd, [ra], rb Atomic maximum

All atomic operations return the value before the operation was applied.

6.11Wave Operations

Wave operations perform computation across all active Threads in a Wave.

Instruction Description
wave.reduce.add.u32 rd, ra Sum of ra across all active lanes
wave.reduce.min.s32 rd, ra Minimum of ra across all active lanes
wave.reduce.max.s32 rd, ra Maximum of ra across all active lanes
wave.reduce.and.b32 rd, ra Bitwise AND across all active lanes
wave.reduce.or.b32 rd, ra Bitwise OR across all active lanes
wave.broadcast.b32 rd, ra, rb Broadcast ra from lane rb to all lanes
wave.shuffle.b32 rd, ra, rb rd = ra from lane rb
wave.shuffle.xor.b32 rd, ra, rb rd = ra from lane (lane_id ^ rb)
wave.shuffle.up.b32 rd, ra, rb rd = ra from lane (lane_id - rb)
wave.shuffle.down.b32 rd, ra, rb rd = ra from lane (lane_id + rb)
wave.prefix.add.u32 rd, ra Exclusive prefix sum
wave.ballot.b32 rd, pd rd = bitmask of pd across all lanes
wave.any pd, ps pd = true if any active lane has ps=true
wave.all pd, ps pd = true if all active lanes have ps=true

6.12Synchronization

Instruction Description
barrier Workgroup barrier + memory fence
fence.wave Memory fence, Wave scope
fence.workgroup Memory fence, Workgroup scope
fence.device Memory fence, Device scope

6.13Control Flow Instructions

Instruction Description
if pd Begin if block
else Begin else block
endif End if/else block
loop Begin loop
endloop End loop
break pd Break from loop if predicate true
continue pd Continue to next iteration if predicate true
ret Return from kernel

6.14Matrix MMA (Optional)

Matrix multiply-accumulate (MMA) operations are an optional capability. When present, implementations support tensor core-style operations:

mma.m16n8k8.f32.f16 {d0,d1,d2,d3}, {a0,a1}, {b0}, {c0,c1,c2,c3}

The exact shapes and data types are queryable via the capability system (see Section 7).

7.Capability System

7.1Required Constants

Every implementation MUST provide the following queryable constants:

Constant Minimum Description
WAVE_WIDTH 16 Number of Threads per Wave
MAX_WORKGROUP_SIZE 256 Maximum Threads per Workgroup
MAX_REGISTERS 64 Maximum registers per Thread
LOCAL_MEMORY_SIZE 16384 Bytes of Local Memory per Workgroup
MAX_WAVES_PER_CORE 16 Maximum resident Waves per Core
PREDICATE_REGISTERS 8 Number of predicate registers
CLUSTER_SIZE 1 Workgroups per Cluster (1 = no clusters)

7.2Optional Capabilities

Implementations MAY support the following optional features:

Capability Description
CAP_F16 Half-precision floating-point
CAP_F64 Double-precision floating-point
CAP_ATOMIC64 64-bit atomic operations
CAP_MMA Matrix multiply-accumulate
CAP_DP4A 4-element dot product (int8)
CAP_SUBGROUPS Subgroup operations (partial wave)
CAP_CLUSTER Cluster-scope operations

7.3MMA Parameters

If CAP_MMA is present, the following constants define supported MMA shapes:

Constant Description
MMA_M M dimension (rows of output)
MMA_N N dimension (columns of output)
MMA_K K dimension (inner dimension)
MMA_INPUT_TYPES Bitmask of supported input types
MMA_OUTPUT_TYPES Bitmask of supported output types

7.4Query Mechanism

The host runtime provides a query function:

wave_result wave_get_capability(wave_device device, wave_capability cap, void* value, size_t size);

Where cap is an enumeration of all queryable constants and capabilities. The function writes the value to the provided buffer and returns a success/error code.

Example usage:

uint32_t wave_width;
wave_get_capability(device, WAVE_CAP_WAVE_WIDTH, &wave_width, sizeof(wave_width));

8.Binary Encoding

8.1Overview

Instructions are encoded in either a 32-bit base format or a 64-bit extended format. The base format accommodates common operations with register operands. The extended format adds support for immediate values, additional operands, and modifiers.

8.2Base Format (32-bit)

31      26 25    21 20    16 15    11 10     6 5       0
┌─────────┬────────┬────────┬────────┬────────┬─────────┐
│ opcode  │   rd   │   ra   │   rb   │  pred  │  flags  │
│ (6 bits)│(5 bits)│(5 bits)│(5 bits)│(5 bits)│ (6 bits)│
└─────────┴────────┴────────┴────────┴────────┴─────────┘
  • opcode: Primary operation code (64 base opcodes)
  • rd: Destination register
  • ra, rb: Source registers
  • pred: Predicate register (0 = unconditional, 1-7 = p1-p7, 8-15 = !p0-!p7)
  • flags: Operation-specific flags (saturation, rounding, etc.)

8.3Extended Format (64-bit)

63      58 57    53 52    48 47    43 42    38 37    32
┌─────────┬────────┬────────┬────────┬────────┬────────┐
│  1 1 1  │ opcode │   rd   │   ra   │   rb   │   rc   │
│ (3 bits)│(5 bits)│(5 bits)│(5 bits)│(5 bits)│(5 bits)│
└─────────┴────────┴────────┴────────┴────────┴────────┘
31                                                     0
┌──────────────────────────────────────────────────────┐
│                     immediate                         │
│                     (32 bits)                         │
└──────────────────────────────────────────────────────┘

The extended format is indicated by opcode bits [31:29] = 111. This allows for:

  • 4-operand instructions (FMA, MMA)
  • 32-bit immediate values
  • Extended opcodes (224 additional operations)

9.Conformance

9.1Required Behavior

A conformant implementation MUST:

  1. Execute all required instructions with correct semantics.
  2. Meet or exceed all minimum capability values.
  3. Provide correct memory ordering per Section 4.
  4. Correctly handle divergence per Section 5.
  5. Report capabilities accurately via Section 7.

9.2Implementation-Defined

The following are implementation-defined:

  • Scheduling policy (Wave selection, Workgroup assignment)
  • Memory coalescing behavior
  • Bank conflict penalties
  • Divergence mechanism
  • NaN payload values
  • Clock counter resolution

9.3Undefined Behavior

The following result in undefined behavior:

  • Barrier within divergent control flow
  • Uniform branch that diverges at runtime
  • Access to unallocated registers
  • Out-of-bounds memory access
  • Data races (concurrent access without synchronization where at least one is a write)

9.4Conformance Testing

The reference test suite (forthcoming) will include:

  • Instruction correctness tests for all required operations
  • Memory ordering tests
  • Divergence correctness tests
  • Capability query tests
  • Edge case tests (overflow, NaN handling, etc.)

A.Full Opcode Table

Deferred to a future version of this specification.

B.Vendor Mapping

This appendix provides suggested mappings to vendor ISAs for reference implementations:

WAVE Concept NVIDIA (PTX) AMD (RDNA) Intel (Xe) Apple (M-series)
Wave Warp (32) Wave (32/64) SIMD (8/16) SIMD-group (32)
Workgroup Thread Block Work-group Thread Group Threadgroup
Local Memory Shared Memory LDS SLM Threadgroup Memory
Device Memory Global Memory VRAM Global Memory Device Memory
wave.shuffle shfl.sync ds_permute mov (cross-lane) quad_shuffle
Barrier bar.sync s_barrier barrier threadgroup_barrier
Fence fence.sc S_WAITCNT scoreboard wait_for_loads

C.Revision History

Version Date Changes
0.1 2026-03-22 Initial draft
Defensive Publication Statement: This specification is published as a defensive publication. The architectural concepts described herein are placed in the public domain for the purpose of establishing prior art and preventing proprietary claims on vendor-neutral GPU compute primitives.