I've been exploring the basics of RISC-V Vector Extension (RVV) recently. My main focus is understanding its core concepts and comparing it with other SIMD architectures, especially in terms of compiler support. To that end, I've also studied some basics of SVE for comparison.
During my study, I found RVV to be a highly flexible architecture that differs significantly from NEON, AVX, SVE or other SIMD architectures. Additionally, the compiler support for RVV is still in its early stages, which presents both challenges and opportunities.
Just like other ISA extensions of RISC-V, RVV is powerful but yet simple. The current specification on the extension spans only 111 pages, making it fast to go through the basics. Here are some key concepts and features of RVV.
RVV might be regarded similar to SVE in ARM, as they both support scalable vectors. However, RVV is more flexible than SVE.
The word "scalable" means that the vector length is not designated by the ISA specification, but rather by the implementation. SVE allows the vector length to be between 128 and 2048 bits, as long as it is a multiple of 128 bits. By doing so, the hardware implementation, the ISA and the software can be decoupled. The hardware vendor can choose a suitable vector length for the target application, without worrying about the toolchain or the compatibliity with the existing software.
In the sense of vendor-defined vector length, RVV is similar to SVE and can be called as "scalable". However, RVV provides a more flexible configuration interface for the vector length. The vector length of RVV can be configured by the software at runtime, which is not supported by SVE. Because of that, RVV is sometimes regarded as "dynamic"-length vector 1.
The flexbility of RVV introduces some new concepts, instructions and constants, which actually made me confused at the beginning. Some of the basic concepts are:
VLEN
: The implementation-defined size of vector registers.ELEN
: The maximum element size supported by the implementation.LMUL
: The number of vector registers in a group. This multiplier can also
be fractional. According to the specification, for non-frational LMUL
, the
implementation must support LMUL
of 1, 2, 4, 8, and for fractional LMUL
,
the requirements is dependent on the minimum element size.SEW
: The element size to be operated on.VLMAX
: The maximum number of elements in a register group, given the current
LMUL
and SEW
.AVL
: Application Vector Length, which is the number of elements that the
software wants to operate on.There is also a CSR called vl
that records the current vector length. After
reading some examples utilizing RVV, I found that there are actually two core
aspects that enable the flexibility of RVV:
AVL
does not necessarily equal to VLMAX
.vl
does not necessarily equal to AVL
.This seems awkward at the beginning. To understand this, two questions need to be answered:
AVL
?AVL
?As we have mentioned, AVL
is the number of elements that the software
WANTS to operate on. So the most straightforward way is just using the
element count as the AVL
.
Then it comes to the second question. How can the software pass the requirement
of vector length to the hardware? The vset{i}vl{i}
instructions can be used to
set the vector length2:
vsetvli rd, rs1, SEW, LMUL, ..
Where rs1
is the source register that contains the AVL
, SEW
is the element
size, LMUL
is the number of vector registers in a group. The hardware can
calculate VLMAX
given the SEW
and LMUL
, and then compare it with the AVL
to determine the actual vector length:
vl = AVL
if AVL <= VLMAX
ceil(AVL / 2) <= vl <= VLMAX
if AVL < (2 * VLMAX)
vl = VLMAX
if AVL > (2 * VLMAX)
So now let's get back to the flexibility of RVV. We get a vl
by telling the
hardware how much elements we want to process. And what's next? The vl
may not
be what we want, so how do we proceed with the computation? Usually, when it
comes to vectorization, we use a mask to indicate the active elements in the
vector register. But in RVV, the mask is not necessary. Instead, we use the vl
to indicate the active elements, and more importantly, the steps of the loops.
Assume that we have n
elements to process, here is a pseudo code for the
vectorized loop:
loop:
vl = vsetvl n
vector load
...
vector computation
...
vector store
n = n - vl
if n > 0 goto loop
exit:
...
This method is known as stripmining. It leverages RVV's dynamic vector length (instead of a more general mask) to control the looping steps.
There are already scalable vector support in LLVM with vscale
and in MLIR with
types like vector<[4] x i32>
. However, the scalable vectors are merely
compatible features -- they don't fully exploit RVV's dynamism.
The difficulty arises from the fact that there is an implicit state when vector
operations are performed. CSRs need to be accessed to determine how an
instruction executes. The C intrinsics of RVV requires the programmer to pass
vl
as an explicit argument, and let the compiler to reduce the number of
vset{i}vl{i}
instructions. Of course, current compilers like clang can do an excellent job
optimizing the manually crafted code, but it is still difficult to model the
runtime length of vectors in a general way, which makes it harder for high-level
optimizations.
There are some proposed or experimental methods in LLVM and MLIR to add support for RVV.
get_vector_length
intrinsic,
which might be capable to model the dynamic vector length in RVV.vector_exp
dialect.vector<?xi32>
in MLIR 1:This is also
implemented in vector_exp
dialect in the Buddy Compiler.To me, the most elegant solution would be to introduce a new dynamic vector type
in MLIR, together with the set_vl
operations which encloses all dynamic vector
operations inside its region 3:
func.func @vector_add(%in1: memref<?xi32>, %in2: memref<?xi32>, %out: memref<?xi32>) {
%c0 = arith.constant 0 : index
%dim_size = memref.dim %in1, %c0 : memref<?xi32>
vector.set_vl %dim_size : index {
%vec_input1 = vector.load %in1[%c0] : memref<?xi32>, vector<?xi32>
%vec_input2 = vector.load %in2[%c0] : memref<?xi32>, vector<?xi32>
%vec_output = arith.addi %vec_input1, %vec_input2 : vector<?xi32>
vector.store %vec_output %out[%c0] : memref<?xi32>, vector<?xi32>
}
}
But this is currently just a proposal and is not upstreamed yet.
RVV is really a flexible and powerful SIMD architecture. The dynamic vector length and the strip mining method distinguish it from other SIMD architectures. However, the compiler support for RVV is still maturing, and many challenges remain. I hope that the compiler support for RVV can be improved in the future, and I am looking forward to seeing further advancement in its application.
See the discussion on the LLVM discourse: [RFC] Dynamic Vector Semantics for the MLIR Vector Dialect ↩ ↩2
vset{i}vl{i}
instruction also updates the vtype
, and there are some
special cases when the operand of vset{i}vl{i}
is x0
. These details can be
found in the specification. ↩
This code snippet is from the discussion of the dynamic vector semantics. ↩