Class Vector<E>

java.lang.Object
jdk.incubator.vector.Vector<E>
Type Parameters:
E - the boxed version of ETYPE, the element type of a vector
Direct Known Subclasses:
ByteVector, DoubleVector, FloatVector, IntVector, LongVector, ShortVector

public abstract class Vector<E> extends Object
A sequence of a fixed number of lanes, all of some fixed element type such as byte, long, or float. Each lane contains an independent value of the element type. Operations on vectors are typically lane-wise, distributing some scalar operator (such as addition) across the lanes of the participating vectors, usually generating a vector result whose lanes contain the various scalar results. When run on a supporting platform, lane-wise operations can be executed in parallel by the hardware. This style of parallelism is called Single Instruction Multiple Data (SIMD) parallelism.

In the SIMD style of programming, most of the operations within a vector lane are unconditional, but the effect of conditional execution may be achieved using masked operations such as blend(), under the control of an associated VectorMask. Data motion other than strictly lane-wise flow is achieved using cross-lane operations, often under the control of an associated VectorShuffle. Lane data and/or whole vectors can be reformatted using various kinds of lane-wise conversions, and byte-wise reformatting reinterpretations, often under the control of a reflective VectorSpecies object which selects an alternative vector format different from that of the input vector.

Vector<E> declares a set of vector operations (methods) that are common to all element types. These common operations include generic access to lane values, data selection and movement, reformatting, and certain arithmetic and logical operations (such as addition or comparison) that are common to all primitive types.

Public subtypes of Vector correspond to specific element types. These declare further operations that are specific to that element type, including unboxed access to lane values, bitwise operations on values of integral element types, or transcendental operations on values of floating point element types.

Some lane-wise operations, such as the add operator, are defined as a full-service named operation, where a corresponding method on Vector comes in masked and unmasked overloadings, and (in subclasses) also comes in covariant overrides (returning the subclass) and additional scalar-broadcast overloadings (both masked and unmasked). Other lane-wise operations, such as the min operator, are defined as a partially serviced (not a full-service) named operation, where a corresponding method on Vector and/or a subclass provide some but all possible overloadings and overrides (commonly the unmasked variant with scalar-broadcast overloadings). Finally, all lane-wise operations (those named as previously described, or otherwise unnamed method-wise) have a corresponding operator token declared as a static constant on VectorOperators. Each operator token defines a symbolic Java expression for the operation, such as a + b for the ADD operator token. General lane-wise operation-token accepting methods, such as for a unary lane-wise operation, are provided on Vector and come in the same variants as a full-service named operation.

This package contains a public subtype of Vector corresponding to each supported element type: ByteVector, ShortVector, IntVector, LongVector, FloatVector, and DoubleVector.

The element type of a vector, referred to as ETYPE, is one of the primitive types byte, short, int, long, float, or double.

The type E in Vector<E> is the boxed version of ETYPE. For example, in the type Vector<Integer>, the E parameter is Integer and the ETYPE is int. In such a vector, each lane carries a primitive int value. This pattern continues for the other primitive types as well. (See also sections 5.1.7 and 5.1.8 of the The Java Language Specification.)

The length of a vector is the lane count, the number of lanes it contains. This number is also called VLENGTH when the context makes clear which vector it belongs to. Each vector has its own fixed VLENGTH but different instances of vectors may have different lengths. VLENGTH is an important number, because it estimates the SIMD performance gain of a single vector operation as compared to scalar execution of the VLENGTH scalar operators which underly the vector operation.

Shapes and species

The information capacity of a vector is determined by its vector shape, also called its VSHAPE. Each possible VSHAPE is represented by a member of the VectorShape enumeration, and represents an implementation format shared in common by all vectors of that shape. Thus, the size in bits of of a vector is determined by appealing to its vector shape.

Some Java platforms give special support to only one shape, while others support several. A typical platform is not likely to support all the shapes described by this API. For this reason, most vector operations work on a single input shape and produce the same shape on output. Operations which change shape are clearly documented as such shape-changing, while the majority of operations are shape-invariant, to avoid disadvantaging platforms which support only one shape. There are queries to discover, for the current Java platform, the preferred shape for general SIMD computation, or the largest available shape for any given lane type. To be portable, code using this API should start by querying a supported shape, and then process all data with shape-invariant operations, within the selected shape.

Each unique combination of element type and vector shape determines a unique vector species. A vector species is represented by a fixed instance of VectorSpecies<E> shared in common by all vectors of the same shape and ETYPE.

Unless otherwise documented, lane-wise vector operations require that all vector inputs have exactly the same VSHAPE and VLENGTH, which is to say that they must have exactly the same species. This allows corresponding lanes to be paired unambiguously. The check() method provides an easy way to perform this check explicitly.

Vector shape, VLENGTH, and ETYPE are all mutually constrained, so that VLENGTH times the bit-size of each lane must always match the bit-size of the vector's shape. Thus, reinterpreting a vector may double its length if and only if it either halves the lane size, or else changes the shape. Likewise, reinterpreting a vector may double the lane size if and only if it either halves the length, or else changes the shape of the vector.

Vector subtypes

Vector declares a set of vector operations (methods) that are common to all element types (such as addition). Sub-classes of Vector with a concrete element type declare further operations that are specific to that element type (such as access to element values in lanes, logical operations on values of integral elements types, or transcendental operations on values of floating point element types). There are six abstract sub-classes of Vector corresponding to the supported set of element types, ByteVector, ShortVector, IntVector, LongVector, FloatVector, and DoubleVector. Along with type-specific operations these classes support creation of vector values (instances of Vector). They expose static constants corresponding to the supported species, and static methods on these types generally take a species as a parameter. For example, FloatVector.fromArray creates and returns a float vector of the specified species, with elements loaded from the specified float array. It is recommended that Species instances be held in static final fields for optimal creation and usage of Vector values by the runtime compiler.

As an example of static constants defined by the typed vector classes, constant FloatVector.SPECIES_256 is the unique species whose lanes are floats and whose vector size is 256 bits. Again, the constant FloatVector.SPECIES_PREFERRED is the species which best supports processing of float vector lanes on the currently running Java platform.

As another example, a broadcast scalar value of (double)0.5 can be obtained by calling DoubleVector.broadcast(dsp, 0.5), but the argument dsp is required to select the species (and hence the shape and length) of the resulting vector.

Lane-wise operations

We use the term lanes when defining operations on vectors. The number of lanes in a vector is the number of scalar elements it holds. For example, a vector of type float and shape S_256_BIT has eight lanes, since 32*8=256.

Most operations on vectors are lane-wise, which means the operation is composed of an underlying scalar operator, which is repeated for each distinct lane of the input vector. If there are additional vector arguments of the same type, their lanes are aligned with the lanes of the first input vector. (They must all have a common VLENGTH.) For most lane-wise operations, the output resulting from a lane-wise operation will have a VLENGTH which is equal to the VLENGTH of the input(s) to the operation. Thus, such lane-wise operations are length-invariant, in their basic definitions.

The principle of length-invariance is combined with another basic principle, that most length-invariant lane-wise operations are also shape-invariant, meaning that the inputs and the output of a lane-wise operation will have a common VSHAPE. When the principles conflict, because a logical result (with an invariant VLENGTH), does not fit into the invariant VSHAPE, the resulting expansions and contractions are handled explicitly with special conventions.

Vector operations can be grouped into various categories and their behavior can be generally specified in terms of underlying scalar operators. In the examples below, ETYPE is the element type of the operation (such as int.class) and EVector is the corresponding concrete vector type (such as IntVector.class).

  • A lane-wise unary operation, such as w = v0.neg(), takes one input vector, distributing a unary scalar operator across the lanes, and produces a result vector of the same type and shape. For each lane of the input vector a, the underlying scalar operator is applied to the lane value. The result is placed into the vector result in the same lane. The following pseudocode illustrates the behavior of this operation category:
    
     ETYPE scalar_unary_op(ETYPE s);
     EVector a = ...;
     VectorSpecies<E> species = a.species();
     ETYPE[] ar = new ETYPE[a.length()];
     for (int i = 0; i < ar.length; i++) {
         ar[i] = scalar_unary_op(a.lane(i));
     }
     EVector r = EVector.fromArray(species, ar, 0);
     
  • A lane-wise binary operation, such as w = v0.add(v1), takes two input vectors, distributing a binary scalar operator across the lanes, and produces a result vector of the same type and shape. For each lane of the two input vectors a and b, the underlying scalar operator is applied to the lane values. The result is placed into the vector result in the same lane. The following pseudocode illustrates the behavior of this operation category:
    
     ETYPE scalar_binary_op(ETYPE s, ETYPE t);
     EVector a = ...;
     VectorSpecies<E> species = a.species();
     EVector b = ...;
     b.check(species);  // must have same species
     ETYPE[] ar = new ETYPE[a.length()];
     for (int i = 0; i < ar.length; i++) {
         ar[i] = scalar_binary_op(a.lane(i), b.lane(i));
     }
     EVector r = EVector.fromArray(species, ar, 0);
     
  • Generalizing from unary and binary operations, a lane-wise n-ary operation takes N input vectors v[j], distributing an n-ary scalar operator across the lanes, and produces a result vector of the same type and shape. Except for a few ternary operations, such as w = v0.fma(v1,v2), this API has no support for lane-wise n-ary operations. For each lane of all of the input vectors v[j], the underlying scalar operator is applied to the lane values. The result is placed into the vector result in the same lane. The following pseudocode illustrates the behavior of this operation category:
    
     ETYPE scalar_nary_op(ETYPE... args);
     EVector[] v = ...;
     int N = v.length;
     VectorSpecies<E> species = v[0].species();
     for (EVector arg : v) {
         arg.check(species);  // all must have same species
     }
     ETYPE[] ar = new ETYPE[a.length()];
     for (int i = 0; i < ar.length; i++) {
         ETYPE[] args = new ETYPE[N];
         for (int j = 0; j < N; j++) {
             args[j] = v[j].lane(i);
         }
         ar[i] = scalar_nary_op(args);
     }
     EVector r = EVector.fromArray(species, ar, 0);
     
  • A lane-wise conversion operation, such as w0 = v0.convert(VectorOperators.I2D, 0), takes one input vector, distributing a unary scalar conversion operator across the lanes, and produces a logical result of the converted values. The logical result (or at least a part of it) is presented in a vector of the same shape as the input vector.

    Unlike other lane-wise operations, conversions can change lane type, from the input (domain) type to the output (range) type. The lane size may change along with the type. In order to manage the size changes, lane-wise conversion methods can product partial results, under the control of a part parameter, which is explained elsewhere. (Following the example above, the second group of converted lane values could be obtained as w1 = v0.convert(VectorOperators.I2D, 1).)

    The following pseudocode illustrates the behavior of this operation category in the specific example of a conversion from int to double, retaining either lower or upper lanes (depending on part) to maintain shape-invariance:

    
     IntVector a = ...;
     int VLENGTH = a.length();
     int part = ...;  // 0 or 1
     VectorShape VSHAPE = a.shape();
     double[] arlogical = new double[VLENGTH];
     for (int i = 0; i < limit; i++) {
         int e = a.lane(i);
         arlogical[i] = (double) e;
     }
     VectorSpecies<Double> rs = VSHAPE.withLanes(double.class);
     int M = Double.BITS / Integer.BITS;  // expansion factor
     int offset = part * (VLENGTH / M);
     DoubleVector r = DoubleVector.fromArray(rs, arlogical, offset);
     assert r.length() == VLENGTH / M;
     
  • A cross-lane reduction operation, such as e = v0.reduceLanes(VectorOperators.ADD), operates on all the lane elements of an input vector. An accumulation function is applied to all the lane elements to produce a scalar result. If the reduction operation is associative then the result may be accumulated by operating on the lane elements in any order using a specified associative scalar binary operation and identity value. Otherwise, the reduction operation specifies the order of accumulation. The following pseudocode illustrates the behavior of this operation category if it is associative:
    
     ETYPE assoc_scalar_binary_op(ETYPE s, ETYPE t);
     EVector a = ...;
     ETYPE r = <identity value>;
     for (int i = 0; i < a.length(); i++) {
         r = assoc_scalar_binary_op(r, a.lane(i));
     }
     
  • A cross-lane movement operation, such as w = v0.rearrange(shuffle) operates on all the lane elements of an input vector and moves them in a data-dependent manner into different lanes in an output vector. The movement is steered by an auxiliary datum, such as a VectorShuffle or a scalar index defining the origin of the movement. The following pseudocode illustrates the behavior of this operation category, in the case of a shuffle:
    
     EVector a = ...;
     Shuffle<E> s = ...;
     ETYPE[] ar = new ETYPE[a.length()];
     for (int i = 0; i < ar.length; i++) {
         int source = s.laneSource(i);
         ar[i] = a.lane(source);
     }
     EVector r = EVector.fromArray(a.species(), ar, 0);
     
  • A masked operation is one which is a variation on one of the previous operations (either lane-wise or cross-lane), where the operation takes an extra trailing VectorMask argument. In lanes the mask is set, the operation behaves as if the mask argument were absent, but in lanes where the mask is unset, the underlying scalar operation is suppressed. Masked operations are explained in greater detail elsewhere.
  • A very special case of a masked lane-wise binary operation is a blend, which operates lane-wise on two input vectors a and b, selecting lane values from one input or the other depending on a mask m. In lanes where m is set, the corresponding value from b is selected into the result; otherwise the value from a is selected. Thus, a blend acts as a vectorized version of Java's ternary selection expression m?b:a:
    
     ETYPE[] ar = new ETYPE[a.length()];
     for (int i = 0; i < ar.length; i++) {
         boolean isSet = m.laneIsSet(i);
         ar[i] = isSet ? b.lane(i) : a.lane(i);
     }
     EVector r = EVector.fromArray(species, ar, 0);
     
  • A lane-wise binary test operation, such as m = v0.lt(v1), takes two input vectors, distributing a binary scalar comparison across the lanes, and produces, not a vector of booleans, but rather a vector mask. For each lane of the two input vectors a and b, the underlying scalar comparison operator is applied to the lane values. The resulting boolean is placed into the vector mask result in the same lane. The following pseudocode illustrates the behavior of this operation category:
    
     boolean scalar_binary_test_op(ETYPE s, ETYPE t);
     EVector a = ...;
     VectorSpecies<E> species = a.species();
     EVector b = ...;
     b.check(species);  // must have same species
     boolean[] mr = new boolean[a.length()];
     for (int i = 0; i < mr.length; i++) {
         mr[i] = scalar_binary_test_op(a.lane(i), b.lane(i));
     }
     VectorMask<E> m = VectorMask.fromArray(species, mr, 0);
     
  • Similarly to a binary comparison, a lane-wise unary test operation, such as m = v0.test(IS_FINITE), takes one input vector, distributing a scalar predicate (a test function) across the lanes, and produces a vector mask.

If a vector operation does not belong to one of the above categories then the method documentation explicitly specifies how it processes the lanes of input vectors, and where appropriate illustrates the behavior using pseudocode.

Most lane-wise binary and comparison operations offer convenience overloadings which accept a scalar as the second input, in place of a vector. In this case the scalar value is promoted to a vector by broadcasting it into the same lane structure as the first input. For example, to multiply all lanes of a double vector by a scalar value 1.1, the expression v.mul(1.1) is easier to work with than an equivalent expression with an explicit broadcast operation, such as v.mul(v.broadcast(1.1)) or v.mul(DoubleVector.broadcast(v.species(), 1.1)). Unless otherwise specified the scalar variant always behaves as if each scalar value is first transformed to a vector of the same species as the first vector input, using the appropriate broadcast operation.

Masked operations

Many vector operations accept an optional mask argument, selecting which lanes participate in the underlying scalar operator. If present, the mask argument appears at the end of the method argument list.

Each lane of the mask argument is a boolean which is either in the set or unset state. For lanes where the mask argument is unset, the underlying scalar operator is suppressed. In this way, masks allow vector operations to emulate scalar control flow operations, without losing SIMD parallelism, except where the mask lane is unset.

An operation suppressed by a mask will never cause an exception or side effect of any sort, even if the underlying scalar operator can potentially do so. For example, an unset lane that seems to access an out of bounds array element or divide an integral value by zero will simply be ignored. Values in suppressed lanes never participate or appear in the result of the overall operation.

Result lanes corresponding to a suppressed operation will be filled with a default value which depends on the specific operation, as follows:

  • If the masked operation is a unary, binary, or n-ary arithmetic or logical operation, suppressed lanes are filled from the first vector operand (i.e., the vector receiving the method call), as if by a blend.
  • If the masked operation is a memory load or a slice() from another vector, suppressed lanes are not loaded, and are filled with the default value for the ETYPE, which in every case consists of all zero bits. An unset lane can never cause an exception, even if the hypothetical corresponding memory location does not exist (because it is out of an array's index range).
  • If the operation is a cross-lane operation with an operand which supplies lane indexes (of type VectorShuffle or Vector, suppressed lanes are not computed, and are filled with the zero default value. Normally, invalid lane indexes elicit an IndexOutOfBoundsException, but if a lane is unset, the zero value is quietly substituted, regardless of the index. This rule is similar to the previous rule, for masked memory loads.
  • If the masked operation is a memory store or an unslice() into another vector, suppressed lanes are not stored, and the corresponding memory or vector locations (if any) are unchanged.

    (Note: Memory effects such as race conditions never occur for suppressed lanes. That is, implementations will not secretly re-write the existing value for unset lanes. In the Java Memory Model, reassigning a memory variable to its current value is not a no-op; it may quietly undo a racing store from another thread.)

  • If the masked operation is a reduction, suppressed lanes are ignored in the reduction. If all lanes are suppressed, a suitable neutral value is returned, depending on the specific reduction operation, and documented by the masked variant of that method. (This means that users can obtain the neutral value programmatically by executing the reduction on a dummy vector with an all-unset mask.)
  • If the masked operation is a comparison operation, suppressed output lanes in the resulting mask are themselves unset, as if the suppressed comparison operation returned false regardless of the suppressed input values. In effect, it is as if the comparison operation were performed unmasked, and then the result intersected with the controlling mask.
  • In other cases, such as masked cross-lane movements, the specific effects of masking are documented by the masked variant of the method.

As an example, a masked binary operation on two input vectors a and b suppresses the binary operation for lanes where the mask is unset, and retains the original lane value from a. The following pseudocode illustrates this behavior:


 ETYPE scalar_binary_op(ETYPE s, ETYPE t);
 EVector a = ...;
 VectorSpecies<E> species = a.species();
 EVector b = ...;
 b.check(species);  // must have same species
 VectorMask<E> m = ...;
 m.check(species);  // must have same species
 boolean[] ar = new boolean[a.length()];
 for (int i = 0; i < ar.length; i++) {
     if (m.laneIsSet(i)) {
         ar[i] = scalar_binary_op(a.lane(i), b.lane(i));
     } else {
         ar[i] = a.lane(i);  // from first input
     }
 }
 EVector r = EVector.fromArray(species, ar, 0);
 

Lane order and byte order

The number of lane values stored in a given vector is referred to as its vector length or VLENGTH. It is useful to consider vector lanes as ordered sequentially from first to last, with the first lane numbered 0, the next lane numbered 1, and so on to the last lane numbered VLENGTH-1. This is a temporal order, where lower-numbered lanes are considered earlier than higher-numbered (later) lanes. This API uses these terms in preference to spatial terms such as "left", "right", "high", and "low".

Temporal terminology works well for vectors because they (usually) represent small fixed-sized segments in a long sequence of workload elements, where the workload is conceptually traversed in time order from beginning to end. (This is a mental model: it does not exclude multicore divide-and-conquer techniques.) Thus, when a scalar loop is transformed into a vector loop, adjacent scalar items (one earlier, one later) in the workload end up as adjacent lanes in a single vector (again, one earlier, one later). At a vector boundary, the last lane item in the earlier vector is adjacent to (and just before) the first lane item in the immediately following vector.

Vectors are also sometimes thought of in spatial terms, where the first lane is placed at an edge of some virtual paper, and subsequent lanes are presented in order next to it. When using spatial terms, all directions are equally plausible: Some vector notations present lanes from left to right, and others from right to left; still others present from top to bottom or vice versa. Using the language of time (before, after, first, last) instead of space (left, right, high, low) is often more likely to avoid misunderstandings.

As second reason to prefer temporal to spatial language about vector lanes is the fact that the terms "left", "right", "high" and "low" are widely used to describe the relations between bits in scalar values. The leftmost or highest bit in a given type is likely to be a sign bit, while the rightmost or lowest bit is likely to be the arithmetically least significant, and so on. Applying these terms to vector lanes risks confusion, however, because it is relatively rare to find algorithms where, given two adjacent vector lanes, one lane is somehow more arithmetically significant than its neighbor, and even in those cases, there is no general way to know which neighbor is the more significant.

Putting the terms together, we view the information structure of a vector as a temporal sequence of lanes ("first", "next", "earlier", "later", "last", etc.) of bit-strings which are internally ordered spatially (either "low" to "high" or "right" to "left"). The primitive values in the lanes are decoded from these bit-strings, in the usual way. Most vector operations, like most Java scalar operators, treat primitive values as atomic values, but some operations reveal the internal bit-string structure.

When a vector is loaded from or stored into memory, the order of vector lanes is always consistent with the inherent ordering of the memory container. This is true whether or not individual lane elements are subject to "byte swapping" due to details of byte order. Thus, while the scalar lane elements of vector might be "byte swapped", the lanes themselves are never reordered, except by an explicit method call that performs cross-lane reordering.

When vector lane values are stored to Java variables of the same type, byte swapping is performed if and only if the implementation of the vector hardware requires such swapping. It is therefore unconditional and invisible.

As a useful fiction, this API presents a consistent illusion that vector lane bytes are composed into larger lane scalars in little endian order. This means that storing a vector into a Java byte array will reveal the successive bytes of the vector lane values in little-endian order on all platforms, regardless of native memory order, and also regardless of byte order (if any) within vector unit registers.

This hypothetical little-endian ordering also appears when a reinterpretation cast is applied in such a way that lane boundaries are discarded and redrawn differently, while maintaining vector bits unchanged. In such an operation, two adjacent lanes will contribute bytes to a single new lane (or vice versa), and the sequential order of the two lanes will determine the arithmetic order of the bytes in the single lane. In this case, the little-endian convention provides portable results, so that on all platforms earlier lanes tend to contribute lower (rightward) bits, and later lanes tend to contribute higher (leftward) bits. The reinterpretation casts between ByteVectors and the other non-byte vectors use this convention to clarify their portable semantics.

The little-endian fiction for relating lane order to per-lane byte order is slightly preferable to an equivalent big-endian fiction, because some related formulas are much simpler, specifically those which renumber bytes after lane structure changes. The earliest byte is invariantly earliest across all lane structure changes, but only if little-endian convention are used. The root cause of this is that bytes in scalars are numbered from the least significant (rightmost) to the most significant (leftmost), and almost never vice-versa. If we habitually numbered sign bits as zero (as on some computers) then this API would reach for big-endian fictions to create unified addressing of vector bytes.

Memory operations

As was already mentioned, vectors can be loaded from memory and stored back. An optional mask can control which individual memory locations are read from or written to. The shape of a vector determines how much memory it will occupy. An implementation typically has the property, in the absence of masking, that lanes are stored as a dense sequence of back-to-back values in memory, the same as a dense (gap-free) series of single scalar values in an array of the scalar type. In such cases memory order corresponds exactly to lane order. The first vector lane value occupies the first position in memory, and so on, up to the length of the vector. Further, the memory order of stored vector lanes corresponds to increasing index values in a Java array or in a MemorySegment.

Byte order for lane storage is chosen such that the stored vector values can be read or written as single primitive values, within the array or segment that holds the vector, producing the same values as the lane-wise values within the vector. This fact is independent of the convenient fiction that lane values inside of vectors are stored in little-endian order.

For example, FloatVector.fromArray(fsp,fa,i) creates and returns a float vector of some particular species fsp, with elements loaded from some float array fa. The first lane is loaded from fa[i] and the last lane is initialized loaded from fa[i+VL-1], where VL is the length of the vector as derived from the species fsp. Then, fv=fv.add(fv2) will produce another float vector of that species fsp, given a vector fv2 of the same species fsp. Next, mnz=fv.compare(NE, 0.0f) tests whether the result is zero, yielding a mask mnz. The non-zero lanes (and only those lanes) can then be stored back into the original array elements using the statement fv.intoArray(fa,i,mnz).

Expansions, contractions, and partial results

Since vectors are fixed in size, occasions often arise where the logical result of an operation is not the same as the physical size of the proposed output vector. To encourage user code that is as portable and predictable as possible, this API has a systematic approach to the design of such resizing vector operations.

As a basic principle, lane-wise operations are length-invariant, unless clearly marked otherwise. Length-invariance simply means that if VLENGTH lanes go into an operation, the same number of lanes come out, with nothing discarded and no extra padding.

As a second principle, sometimes in tension with the first, lane-wise operations are also shape-invariant, unless clearly marked otherwise. Shape-invariance means that VSHAPE is constant for typical computations. Keeping the same shape throughout a computation helps ensure that scarce vector resources are efficiently used. (On some hardware platforms shape changes could cause unwanted effects like extra data movement instructions, round trips through memory, or pipeline bubbles.)

Tension between these principles arises when an operation produces a logical result that is too large for the required output VSHAPE. In other cases, when a logical result is smaller than the capacity of the output VSHAPE, the positioning of the logical result is open to question, since the physical output vector must contain a mix of logical result and padding.

In the first case, of a too-large logical result being crammed into a too-small output VSHAPE, we say that data has expanded. In other words, an expansion operation has caused the output shape to overflow. Symmetrically, in the second case of a small logical result fitting into a roomy output VSHAPE, the data has contracted, and the contraction operation has required the output shape to pad itself with extra zero lanes.

In both cases we can speak of a parameter M which measures the expansion ratio or contraction ratio between the logical result size (in bits) and the bit-size of the actual output shape. When vector shapes are changed, and lane sizes are not, M is just the integral ratio of the output shape to the logical result. (With the possible exception of the maximum shape, all vector sizes are powers of two, and so the ratio M is always an integer. In the hypothetical case of a non-integral ratio, the value M would be rounded up to the next integer, and then the same general considerations would apply.)

If the logical result is larger than the physical output shape, such a shape change must inevitably drop result lanes (all but 1/M of the logical result). If the logical size is smaller than the output, the shape change must introduce zero-filled lanes of padding (all but 1/M of the physical output). The first case, with dropped lanes, is an expansion, while the second, with padding lanes added, is a contraction.

Similarly, consider a lane-wise conversion operation which leaves the shape invariant but changes the lane size by a ratio of M. If the logical result is larger than the output (or input), this conversion must reduce the VLENGTH lanes of the output by M, dropping all but 1/M of the logical result lanes. As before, the dropping of lanes is the hallmark of an expansion. A lane-wise operation which contracts lane size by a ratio of M must increase the VLENGTH by the same factor M, filling the extra lanes with a zero padding value; because padding must be added this is a contraction.

It is also possible (though somewhat confusing) to change both lane size and container size in one operation which performs both lane conversion and reshaping. If this is done, the same rules apply, but the logical result size is the product of the input size times any expansion or contraction ratio from the lane change size.

For completeness, we can also speak of in-place operations for the frequent case when resizing does not occur. With an in-place operation, the data is simply copied from logical output to its physical container with no truncation or padding. The ratio parameter M in this case is unity.

Note that the classification of contraction vs. expansion depends on the relative sizes of the logical result and the physical output container. The size of the input container may be larger or smaller than either of the other two values, without changing the classification. For example, a conversion from a 128-bit shape to a 256-bit shape will be a contraction in many cases, but it would be an expansion if it were combined with a conversion from byte to long, since in that case the logical result would be 1024 bits in size. This example also illustrates that a logical result does not need to correspond to any particular platform-supported vector shape.

Although lane-wise masked operations can be viewed as producing partial operations, they are not classified (in this API) as expansions or contractions. A masked load from an array surely produces a partial vector, but there is no meaningful "logical output vector" that this partial result was contracted from.

Some care is required with these terms, because it is the data, not the container size, that is expanding or contracting, relative to the size of its output container. Thus, resizing a 128-bit input into 512-bit vector has the effect of a contraction. Though the 128 bits of payload hasn't changed in size, we can say it "looks smaller" in its new 512-bit home, and this will capture the practical details of the situation.

If a vector method might expand its data, it accepts an extra int parameter called part, or the "part number". The part number must be in the range [0..M-1], where M is the expansion ratio. The part number selects one of M contiguous disjoint equally-sized blocks of lanes from the logical result and fills the physical output vector with this block of lanes.

Specifically, the lanes selected from the logical result of an expansion are numbered in the range [R..R+L-1], where L is the VLENGTH of the physical output vector, and the origin of the block, R, is part*L.

A similar convention applies to any vector method that might contract its data. Such a method also accepts an extra part number parameter (again called part) which steers the contracted data lanes one of M contiguous disjoint equally-sized blocks of lanes in the physical output vector. The remaining lanes are filled with zero, or as specified by the method.

Specifically, the data is steered into the lanes numbered in the range [R..R+L-1], where L is the VLENGTH of the logical result vector, and the origin of the block, R, is again a multiple of L selected by the part number, specifically |part|*L.

In the case of a contraction, the part number must be in the non-positive range [-M+1..0]. This convention is adopted because some methods can perform both expansions and contractions, in a data-dependent manner, and the extra sign on the part number serves as an error check. If vector method takes a part number and is invoked to perform an in-place operation (neither contracting nor expanding), the part parameter must be exactly zero. Part numbers outside the allowed ranges will elicit an indexing exception. Note that in all cases a zero part number is valid, and corresponds to an operation which preserves as many lanes as possible from the beginning of the logical result, and places them into the beginning of the physical output container. This is often a desirable default, so a part number of zero is safe in all cases and useful in most cases.

The various resizing operations of this API contract or expand their data as follows:

  • Vector.convert() will expand (respectively, contract) its operand by ratio M if the element size of its output is larger (respectively, smaller) by a factor of M. If the element sizes of input and output are the same, then convert() is an in-place operation.
  • Vector.convertShape() will expand (respectively, contract) its operand by ratio M if the bit-size of its logical result is larger (respectively, smaller) than the bit-size of its output shape. The size of the logical result is defined as the element size of the output, times the VLENGTH of its input. Depending on the ratio of the changed lane sizes, the logical size may be (in various cases) either larger or smaller than the input vector, independently of whether the operation is an expansion or contraction.
  • Since Vector.castShape() is a convenience method for convertShape(), its classification as an expansion or contraction is the same as for convertShape().
  • Vector.reinterpretShape() is an expansion (respectively, contraction) by ratio M if the vector bit-size of its input is crammed into a smaller (respectively, dropped into a larger) output container by a factor of M. Otherwise it is an in-place operation. Since this method is a reinterpretation cast that can erase and redraw lane boundaries as well as modify shape, the input vector's lane size and lane count are irrelevant to its classification as expanding or contracting.
  • The unslice() methods expand by a ratio of M=2, because the single input slice is positioned and inserted somewhere within two consecutive background vectors. The part number selects the first or second background vector, as updated by the inserted slice. Note that the corresponding slice() methods, although inverse to the unslice() methods, do not contract their data and thus require no part number. This is because slice() delivers a slice of exactly VLENGTH lanes extracted from two input vectors.
The method partLimit() on VectorSpecies can be used, before any expanding or contracting operation is performed, to query the limiting value on a part parameter for a proposed expansion or contraction. The value returned from partLimit() is positive for expansions, negative for contractions, and zero for in-place operations. Its absolute value is the parameter M, and so it serves as an exclusive limit on valid part number arguments for the relevant methods. Thus, for expansions, the partLimit() value M is the exclusive upper limit for part numbers, while for contractions the partLimit() value -M is the exclusive lower limit.

Moving data across lane boundaries

The cross-lane methods which do not redraw lanes or change species are more regularly structured and easier to reason about. These operations are:
  • The slice() family of methods, which extract contiguous slice of VLENGTH fields from a given origin point within a concatenated pair of vectors.
  • The unslice() family of methods, which insert a contiguous slice of VLENGTH fields into a concatenated pair of vectors at a given origin point.
  • The rearrange() family of methods, which select an arbitrary set of VLENGTH lanes from one or two input vectors, and assemble them in an arbitrary order. The selection and order of lanes is controlled by a VectorShuffle object, which acts as an routing table mapping source lanes to destination lanes. A VectorShuffle can encode a mathematical permutation as well as many other patterns of data movement.
  • The compress(VectorMask) and expand(VectorMask) methods, which select up to VLENGTH lanes from an input vector, and assemble them in lane order. The selection of lanes is controlled by a VectorMask, with set lane elements mapping, by compression or expansion in lane order, source lanes to destination lanes.

Some vector operations are not lane-wise, but rather move data across lane boundaries. Such operations are typically rare in SIMD code, though they are sometimes necessary for specific algorithms that manipulate data formats at a low level, and/or require SIMD data to move in complex local patterns. (Local movement in a small window of a large array of data is relatively unusual, although some highly patterned algorithms call for it.) In this API such methods are always clearly recognizable, so that simpler lane-wise reasoning can be confidently applied to the rest of the code.

In some cases, vector lane boundaries are discarded and "redrawn from scratch", so that data in a given input lane might appear (in several parts) distributed through several output lanes, or (conversely) data from several input lanes might be consolidated into a single output lane. The fundamental method which can redraw lanes boundaries is reinterpretShape(). Built on top of this method, certain convenience methods such as reinterpretAsBytes() or reinterpretAsInts() will (potentially) redraw lane boundaries, while retaining the same overall vector shape.

Operations which produce or consume a scalar result can be viewed as very simple cross-lane operations. Methods in the reduceLanes() family fold together all lanes (or mask-selected lanes) of a method and return a single result. As an inverse, the broadcast family of methods can be thought of as crossing lanes in the other direction, from a scalar to all lanes of the output vector. Single-lane access methods such as lane(I) or withLane(I,E) might also be regarded as very simple cross-lane operations.

Likewise, a method which moves a non-byte vector to or from a byte array could be viewed as a cross-lane operation, because the vector lanes must be distributed into separate bytes, or (in the other direction) consolidated from array bytes.

Implementation Note:

Hardware platform dependencies and limitations

The Vector API is to accelerate computations in style of Single Instruction Multiple Data (SIMD), using available hardware resources such as vector hardware registers and vector hardware instructions. The API is designed to make effective use of multiple SIMD hardware platforms.

This API will also work correctly even on Java platforms which do not include specialized hardware support for SIMD computations. The Vector API is not likely to provide any special performance benefit on such platforms.

Currently the implementation is optimized to work best on:

  • Intel x64 platforms supporting at least AVX2 up to AVX-512. Masking using mask registers and mask accepting hardware instructions on AVX-512 are not currently supported.
  • ARM AArch64 platforms supporting NEON. Although the API has been designed to ensure ARM SVE instructions can be supported (vector sizes between 128 to 2048 bits) there is currently no implementation of such instructions and the general masking capability.
The implementation currently supports masked lane-wise operations in a cross-platform manner by composing the unmasked lane-wise operation with blend as in the expression a.blend(a.lanewise(op, b), m), where a and b are vectors, op is the vector operation, and m is the mask.

The implementation does not currently support optimal vectorized instructions for floating point transcendental functions (such as operators SIN and LOG).

No boxing of primitives

Although a vector type like Vector<Integer> may seem to work with boxed Integer values, the overheads associated with boxing are avoided by having each vector subtype work internally on lane values of the actual ETYPE, such as int.

Value-based classes and identity operations

Vector, along with all of its subtypes and many of its helper types like VectorMask and VectorShuffle, is a value-based class.

Once created, a vector is never mutated, not even if only a single lane is changed. A new vector is always created to hold a new configuration of lane values. The unavailability of mutative methods is a necessary consequence of suppressing the object identity of all vectors, as value-based classes.

With Vector, identity-sensitive operations such as == may yield unpredictable results, or reduced performance. Oddly enough, v.equals(w) is likely to be faster than v==w, since equals is not an identity sensitive method. Also, these objects can be stored in locals and parameters and as static final constants, but storing them in other Java fields or in array elements, while semantically valid, may incur performance penalties.