I decided to write a blog on this as I was doing a problem on local judge and I decided to try to speed up my brute force code. However, it was quite difficult to find resources on SIMD vectorisation, so I decided to try to compile some of the resources I found together to hopefully allow more people to learn <s>to scam brute force solutions</s>↵
↵
Thanks to ~iloveioi,2022-01-02 and ~jamessngg,2022-01-02 for proof reading.↵
↵
Introduction↵
==================↵
↵
SIMD stands for single instruction, multiple data. SIMD allows us to give vector instructions which will allow the code to run faster. Vector instructions are instructions that handle short (length 2-16) vectors of integers / floats / characters in a parallel way by making use of the extra bits of space to do operations simultaneously.↵
↵
The most common form of vectorisation is making use of pragmas such as↵
↵
~~~~~↵
#pragma GCC optimize("O3,unroll-loops")↵
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")↵
~~~~~↵
↵
This form of vectorisation is known as auto-vectorisation as the compiler vectorises the code for you. However, for more complicated examples, the compiler might be unable to detect what to vectorise automatically, so in this case we have to vectorise our code manually by using SIMD vectorisation.↵
↵
Syntax↵
==================↵
↵
The compilation of all the code given in the syntax section is given here↵
↵
<spoiler summary="Code">↵
↵
~~~~~↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int main() {↵
/****** LOADING ******/↵
↵
__m128i zero = _mm_setzero_si128(); // set everything to 0↵
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8↵
↵
__m128i pi = _mm_setr_epi32(1, 4, 1,33, 1, 4, 1); // NOTE: set and setr are opposites of each other↵
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1↵
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3↵
↵
int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};↵
__m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]↵
__m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]↵
↵
__m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]↵
// _mm_insert_epi32(a, i, j) changes j-th number of a to value i↵
a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]↵
↵
/****** ARITHMETIC ******/↵
↵
__m128i sum = _mm_add_epi32(a0, a4); // [4, 6, 8, 10] (sum i-th element of a0 with i-th element of a4)↵
__m128isbmx = _mm_submax_epi32(pi, a0); // [1, 3, -1, 0] (subtract4, 3] (maximum of the i-th element of a0 frompi and i-th element of pia0)↵
__m128imxsb = _mm_maxsub_epi32(pi, a0); // [1, 3, -1, 4, 3] (maximum of the0] (subtract i-th element of pi anda0 from i-th element of a0pi)↵
↵
__m128i mul = _mm_mul_epi32(sum, mx); // [12, 32] (4*3=12, 8*4=32) (more details below)↵
↵
/****** LOGICAL ARITHMETIC ******/↵
↵
__m128i three = _mm_set1_epi32(3);↵
__m128i mskl = _mm_cmplt_epi32(pi, three); // contains [0, 2^32-1, 0, 2^32-1] as only 1 < 3↵
__m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]↵
__m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]↵
↵
__m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1]↵
↵
/****** REORDERING ******/↵
↵
__m128i a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_shuffle_epi32(a, 39); // [4, 2, 3, 1]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_slli_si128(a, 8); // [0, 0, 1, 2]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_srli_si128(a, 4); // [2, 3, 4, 0]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
__m128i b = _mm_setr_epi32(5, 6, 7, 8);↵
a = _mm_alignr_epi8(a, b, 8); // [7, 8, 1, 2]↵
↵
/****** EXTRACTING ******/↵
↵
int mxarr[4];↵
_mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr↵
long long mularr[2];↵
_mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr↵
↵
// mul = [12, 32]↵
long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)↵
// sm = [4, 6, 8, 10]↵
int sum0 = _mm_extract_epi32(sum, 2); // extract the 2-thnd element of sum (= 8)↵
}↵
~~~~~↵
↵
</spoiler>↵
↵
To make use of SIMD, we have to add the following at the top of the code.↵
↵
~~~~~↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
~~~~~↵
↵
Data types↵
------------------↵
↵
There are three 128 bit data types for integers, floats and doubles respectively.↵
↵
~~~~~↵
__m128i i; // stores 4 integers since each integer is 32 bit and 128/32=4↵
__m128 f; // stores 4 floats since each float is 32 bit and 128/32=4↵
__m128d d; // stores 2 doubles since each double is 64 bit and 128/64=2↵
~~~~~↵
↵
Note that `__m128i` can also be used to store 32 characters.↵
↵
Intrinsics↵
------------------↵
↵
In order to do operations on these data types, we will make use of SIMD intrinsics.↵
↵
All the intrinsic functions have the following form `_mm_instruction_datatype`, for example, `_mm_add_epi32(a, b)` is a function to add 32-bit integers from $a$ and $b$ together.↵
↵
Some examples of data-types are↵
↵
- `si128` — signed 128-bit integer↵
↵
- `epi8`, `epi32`, `epi64`, `epu8`, `epu32`, `epu64` — vector of signed 8-bit integers (character), signed 32-bit integers, signed 64-bit integers, together with the unsigned variations.↵
↵
- `ps` — 32-bit floats↵
↵
- `pd` — 64-bit doubles↵
↵
In the following examples, I will be focusing on signed 32-bit integers (i.e. `epi32`) as it is most commonly used.↵
↵
### Loading↵
↵
To make use of our `__m128i` data structure, we first have to load data into it.↵
↵
~~~~~↵
__m128i zero = _mm_setzero_si128(); // set everything to 0↵
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8↵
↵
__m128i pi = _mm_setr_epi32(3, 1, 4, 1, 3); // NOTE: set and setr are opposites of each other↵
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1↵
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3↵
↵
int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};↵
__m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]↵
__m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]↵
↵
__m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]↵
// _mm_insert_epi32(a, i, j) changes j-th number of a to value i↵
a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]↵
~~~~~↵
↵
### Arithmetic↵
↵
~~~~~↵
// Continue from previous code block↵
__m128i sum = _mm_add_epi32(a0, a4); // [4, 6, 8, 10] (sum i-th element of a0 with i-th element of a4)↵
__m128i mx = _mm_max_epi32(pi, a0); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)↵
__m128i sb = _mm_sub_epi32(pi, a0); // [1, 3, -1, 0] (subtract i-th element of a0 from i-th element of pi)↵
↵
__m128i mxul = _mm_maxul_epi32(pi, a0sm, mx); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)↵
12, 32] (4*3=12, 8*4=32) (more details below)↵
~~~~~↵
↵
Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However, for `_mm_mul_epi32`, since the multiplication of two 32-bit integers will become 64-bit integers, we can only store two 64-bit integers in the result. Hence, `_mm_mul_epi32` works by multiplying the 0-th index and 2-nd index.↵
↵
### Logical Arithmetic↵
↵
~~~~~↵
// Continue from previous code block↵
__m128i three = _mm_set1_epi32(3);↵
__m128i muskl = _mm_mulcmplt_epi32(sum, mx); // [12, 32] (4*3=12, 8*4=32) (more details below)↵
~~~~~↵
↵
Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However,pi, three); // contains [0, 2^32-1, 0, 2^32-1] as only 1 < 3↵
__m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]↵
__m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]↵
↵
__m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1] (more details below)↵
~~~~~↵
↵
Most comparison functions are in the form `_mm_mulcmpxx_epi32`, s. Since the multiplication of two 32-bit integers will become 64-bit integers, we can only sfunctions compare 32-bit integers, if the comparison evaluates to true, all 32 bits will become 1, which is why it is equals to $2^{32}-1$.↵
↵
`__m128i _mm_blendv_epi8 (__m128i a, __m128i b, __m128i mask)`↵
↵
This function blends 2 vectores two 64-bit integers in the result. Hence, `_mm_mul_epi32` works by multiplying the 0-th index and 2-nd indexogether by using the mask. If position $i$ of the mask is $0$, position $i$ of the result will be equals to $a$. Otherwise if position $i$ of mask is $2^{32}-1$, position $i$ of the result will be equals to $b$.↵
↵
### Reordering↵
↵
Sometimes we might need to change the order of the 4 numbers↵
↵
`_mm128i _mm_shuffle_epi32 (__m128i a, int imm8)`↵
↵
imm8 is an 8-bit mask that represents which value is at which position. The position of the first element will be the value formed by the first two bits of imm8, the position of the second element will be the value formed by the third and fourth bits, and so on. ↵
↵
For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 39 = \text{0b00100111}$, the result is equal to $[4, 2, 3, 1]$ as the first 2 bits is $\text{0b11} = 3$, the next 2 bits is $\text{0b01} = 1$, the following 2 bits is $\text{0b10} = 2$ and the last 2 bits is $\text{0b00} = 0$↵
↵
`_mm128i _mm_slli_si128 (__m128i a, int imm8)`↵
↵
This function can shift the numbers to the left (think of it as left-shift in binary form). However, this function shifts 8 bits at a time, so since we are using 32-bit integers, if we want to shift by 1 position, imm8 has to be equals to 4. For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 4$, the result is equal to $[0, 1, 2, 3]$, and if $\text{imm8} = 8$, the result is equal to $[0, 0, 1, 2]$.↵
↵
`_mm128i _mm_srli_si128 (__m128i a, int imm8)`↵
↵
Instead of shifting to the left, this function shifts to the right. For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 4$, the result is equal to $[2, 3, 4, 0]$.↵
↵
`_mm128i _mm_alignr_epi8 (__m128i a, __m128i b, int imm8)`↵
↵
This function concatenates $a$ and $b$ to form 8 elements $[b_0, b_1, b_2, b_3, a_0, a_1, a_2, a_3]$, then right-shift the 256 bits by $\text{imm8} \times 8$ bits, and then return the lowest 128 bits. **Note that b is before a** . Similar to the previous function, since this function shifts 8 bits at a time, if we want to shift by 1 position, imm8 has to be equals to 4.↵
↵
For example, if $a = [1, 2, 3, 4]$, $b = [5, 6, 7, 8]$ and $\text{imm8} = 4$, the result is equal to $[6, 7, 8, 1]$. If $\text{imm8} = 8$ instead, the result is equal to $[7, 8, 1, 2]$↵
↵
### Extracting↵
↵
~~~~~↵
// Continue from previous code block↵
int mxarr[4];↵
_mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr↵
long long mularr[2];↵
_mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr↵
↵
// mul = [12, 32]↵
long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)↵
// sm = [4, 6, 8, 10]↵
int sum0 = _mm_extract_epi32(sum, 2); // extract the 2-thnd element of sum (= 8)↵
~~~~~↵
↵
Examples↵
==================↵
↵
Multiplication↵
------------------↵
↵
Given 2 arrays $A$ and $B$ of length $N$, compute array $C$ where for all $1 \le i \le N$, $C_i = A_i \cdot B_i$.↵
↵
#### Solution↵
↵
Since `_mm_mult_epi32` can only multiply 2 numbers at a time, we make use of `_mm_shuffle_epi32` to load the integers to the correct position and store the result after multiplication using `mm_storeu_si128`.↵
↵
<spoiler summary="Code">↵
↵
Take note to buffer the array by at least 3 spaces 4 integers are processed at a time↵
↵
~~~~~↵
#include <bits/stdc++.h> ↵
using namespace std;↵
↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int n;↵
int arr[200005], brr[200005];↵
long long crr[200005];↵
↵
int main() {↵
scanf("%d", &n);↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &arr[i]);↵
}↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &brr[i]);↵
}↵
for (int i = 0; i < n; i += 4) {↵
__m128i a = _mm_loadu_si128((__m128i*) &arr[i]),↵
b = _mm_loadu_si128((__m128i*) &brr[i]); ↵
// 0b|00|01|00|00 first element stores a[0] and third stores a[1]↵
// 0b|00|11|00|10 first element stores a[2] and third stores a[3]↵
__m128i afirst = _mm_shuffle_epi32(a, 0b00010000),↵
asecond = _mm_shuffle_epi32(a, 0b00110010),↵
bfirst = _mm_shuffle_epi32(b, 0b00010000),↵
bsecond = _mm_shuffle_epi32(b, 0b00110010);↵
a = _mm_mul_epi32(afirst, bfirst);↵
b = _mm_mul_epi32(asecond, bsecond);↵
_mm_storeu_si128((__m128i*) &crr[i], a);↵
_mm_storeu_si128((__m128i*) &crr[i + 2], b);↵
}↵
for (int i = 0; i < n; i++) {↵
printf("%lld ", crr[i]);↵
}↵
printf("\n");↵
return 0;↵
}↵
~~~~~↵
↵
</spoiler>↵
↵
Maximum Prefix↵
------------------↵
↵
Given an array $A$ of length $N$, we define the prefix sum of $A$ as an array $P$ where $P_i = \sum_{j=1}^i A_j$ for all $1 \le i \le N$. Find the maximum value in array $P$.↵
↵
#### Solution↵
↵
Let us observe how we can get the prefix sums of 4 integers $[a, b, c, d]$ quickly. If we follow a fenwick tree / binary-indexed tree structure, we can observe that we only have to do the following↵
↵
1. Add $[0, a, b, c]$ to $[a, b, c, d]$ to get $[a, a + b, b + c, c + d]$↵
2. Add $[0, 0, a, a + b]$ to $[a, a + b, b + c, c + d]$ to get $[a, a + b, a + b + c, a + b + c + d]$↵
↵
Notice that the added array step 1 is basically `_mm_slli_si128(a, 4)` and step 2 is `mm_slli_si128(a, 8)`.↵
↵
We can store the maximum by using the function `_mm_max_epi32`.↵
↵
At the end, since we stored the maximum in 4 integers, to get the maximum integer among the 4 integers, we can use the same method for addition but replace addition with maximum.↵
↵
<spoiler summary="Code">↵
↵
~~~~~↵
#include <bits/stdc++.h> ↵
using namespace std;↵
↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int n;↵
int a[200005];↵
↵
int main() {↵
scanf("%d", &n);↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &a[i]);↵
}↵
__m128i mx = _mm_setzero_si128();↵
__m128i sum = _mm_setzero_si128();↵
for (int i = 0; i < n; i += 4) {↵
__m128i tmpsum = _mm_loadu_si128((__m128i*) &a[i]);↵
tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 4));↵
tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 8));↵
sum = _mm_add_epi32(sum, tmpsum);↵
mx = _mm_max_epi32(mx, sum);↵
sum = _mm_shuffle_epi32(sum, 0b11111111);↵
}↵
mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 4));↵
mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 8));↵
int ans = _mm_extract_epi32(mx, 3);↵
printf("%d\n", ans);↵
return 0;↵
}↵
~~~~~↵
↵
</spoiler>↵
↵
↵
Thanks to ~iloveioi,2022-01-02 and ~jamessngg,2022-01-02 for proof reading.↵
↵
Introduction↵
==================↵
↵
SIMD stands for single instruction, multiple data. SIMD allows us to give vector instructions which will allow the code to run faster. Vector instructions are instructions that handle short (length 2-16) vectors of integers / floats / characters in a parallel way by making use of the extra bits of space to do operations simultaneously.↵
↵
The most common form of vectorisation is making use of pragmas such as↵
↵
~~~~~↵
#pragma GCC optimize("O3,unroll-loops")↵
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")↵
~~~~~↵
↵
This form of vectorisation is known as auto-vectorisation as the compiler vectorises the code for you. However, for more complicated examples, the compiler might be unable to detect what to vectorise automatically, so in this case we have to vectorise our code manually by using SIMD vectorisation.↵
↵
Syntax↵
==================↵
↵
The compilation of all the code given in the syntax section is given here↵
↵
<spoiler summary="Code">↵
↵
~~~~~↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int main() {↵
/****** LOADING ******/↵
↵
__m128i zero = _mm_setzero_si128(); // set everything to 0↵
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8↵
↵
__m128i pi = _mm_setr_epi32(
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1↵
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3↵
↵
int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};↵
__m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]↵
__m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]↵
↵
__m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]↵
// _mm_insert_epi32(a, i, j) changes j-th number of a to value i↵
a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]↵
↵
/****** ARITHMETIC ******/↵
↵
__m128i s
__m128i
__m128i
↵
__m128i mul = _mm_mul_epi32(s
↵
/****** LOGICAL ARITHMETIC ******/↵
↵
__m128i three = _mm_set1_epi32(3);↵
__m128i mskl = _mm_cmplt_epi32(pi, three); // contains [0, 2^32-1, 0, 2^32-1] as only 1 < 3↵
__m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]↵
__m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]↵
↵
__m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1]↵
↵
/****** REORDERING ******/↵
↵
__m128i a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_shuffle_epi32(a, 39); // [4, 2, 3, 1]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_slli_si128(a, 8); // [0, 0, 1, 2]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
a = _mm_srli_si128(a, 4); // [2, 3, 4, 0]↵
↵
a = _mm_setr_epi32(1, 2, 3, 4);↵
__m128i b = _mm_setr_epi32(5, 6, 7, 8);↵
a = _mm_alignr_epi8(a, b, 8); // [7, 8, 1, 2]↵
↵
/****** EXTRACTING ******/↵
↵
int mxarr[4];↵
_mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr↵
long long mularr[2];↵
_mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr↵
↵
// mul = [12, 32]↵
long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)↵
// sm = [4, 6, 8, 10]↵
int s
}↵
~~~~~↵
↵
</spoiler>↵
↵
To make use of SIMD, we have to add the following at the top of the code.↵
↵
~~~~~↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
~~~~~↵
↵
Data types↵
------------------↵
↵
There are three 128 bit data types for integers, floats and doubles respectively.↵
↵
~~~~~↵
__m128i i; // stores 4 integers since each integer is 32 bit and 128/32=4↵
__m128 f; // stores 4 floats since each float is 32 bit and 128/32=4↵
__m128d d; // stores 2 doubles since each double is 64 bit and 128/64=2↵
~~~~~↵
↵
Note that `__m128i` can also be used to store 32 characters.↵
↵
Intrinsics↵
------------------↵
↵
In order to do operations on these data types, we will make use of SIMD intrinsics.↵
↵
All the intrinsic functions have the following form `_mm_instruction_datatype`, for example, `_mm_add_epi32(a, b)` is a function to add 32-bit integers from $a$ and $b$ together.↵
↵
Some examples of data-types are↵
↵
- `si128` — signed 128-bit integer↵
↵
- `epi8`, `epi32`, `epi64`, `epu8`, `epu32`, `epu64` — vector of signed 8-bit integers (character), signed 32-bit integers, signed 64-bit integers, together with the unsigned variations.↵
↵
- `ps` — 32-bit floats↵
↵
- `pd` — 64-bit doubles↵
↵
In the following examples, I will be focusing on signed 32-bit integers (i.e. `epi32`) as it is most commonly used.↵
↵
### Loading↵
↵
To make use of our `__m128i` data structure, we first have to load data into it.↵
↵
~~~~~↵
__m128i zero = _mm_setzero_si128(); // set everything to 0↵
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8↵
↵
__m128i pi = _mm_setr_epi32(3, 1, 4, 1
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1↵
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3↵
↵
int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};↵
__m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]↵
__m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]↵
↵
__m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]↵
// _mm_insert_epi32(a, i, j) changes j-th number of a to value i↵
a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]↵
~~~~~↵
↵
### Arithmetic↵
↵
~~~~~↵
// Continue from previous code block↵
__m128i s
__m128i mx = _mm_max_epi32(pi, a0); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)↵
__m128i sb = _mm_sub_epi32(pi, a0); // [1, 3, -1, 0] (subtract i-th element of a0 from i-th element of pi)↵
↵
__m128i m
~~~~~↵
↵
Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However, for `_mm_mul_epi32`, since the multiplication of two 32-bit integers will become 64-bit integers, we can only store two 64-bit integers in the result. Hence, `_mm_mul_epi32` works by multiplying the 0-th index and 2-nd index.↵
↵
### Logical Arithmetic↵
↵
~~~~~↵
// Continue from previous code block↵
__m128i three = _mm_set1_epi32(3);↵
__m128i m
~~~~~↵
↵
Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However,
__m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]↵
__m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]↵
↵
__m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1] (more details below)↵
~~~~~↵
↵
Most comparison functions are in the form `_mm_
↵
`__m128i _mm_blendv_epi8 (__m128i a, __m128i b, __m128i mask)`↵
↵
This function blends 2 vector
↵
### Reordering↵
↵
Sometimes we might need to change the order of the 4 numbers↵
↵
`_mm128i _mm_shuffle_epi32 (__m128i a, int imm8)`↵
↵
imm8 is an 8-bit mask that represents which value is at which position. The position of the first element will be the value formed by the first two bits of imm8, the position of the second element will be the value formed by the third and fourth bits, and so on. ↵
↵
For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 39 = \text{0b00100111}$, the result is equal to $[4, 2, 3, 1]$ as the first 2 bits is $\text{0b11} = 3$, the next 2 bits is $\text{0b01} = 1$, the following 2 bits is $\text{0b10} = 2$ and the last 2 bits is $\text{0b00} = 0$↵
↵
`_mm128i _mm_slli_si128 (__m128i a, int imm8)`↵
↵
This function can shift the numbers to the left (think of it as left-shift in binary form). However, this function shifts 8 bits at a time, so since we are using 32-bit integers, if we want to shift by 1 position, imm8 has to be equals to 4. For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 4$, the result is equal to $[0, 1, 2, 3]$, and if $\text{imm8} = 8$, the result is equal to $[0, 0, 1, 2]$.↵
↵
`_mm128i _mm_srli_si128 (__m128i a, int imm8)`↵
↵
Instead of shifting to the left, this function shifts to the right. For example, if $a = [1, 2, 3, 4]$ and $\text{imm8} = 4$, the result is equal to $[2, 3, 4, 0]$.↵
↵
`_mm128i _mm_alignr_epi8 (__m128i a, __m128i b, int imm8)`↵
↵
This function concatenates $a$ and $b$ to form 8 elements $[b_0, b_1, b_2, b_3, a_0, a_1, a_2, a_3]$, then right-shift the 256 bits by $\text{imm8} \times 8$ bits, and then return the lowest 128 bits. **Note that b is before a** . Similar to the previous function, since this function shifts 8 bits at a time, if we want to shift by 1 position, imm8 has to be equals to 4.↵
↵
For example, if $a = [1, 2, 3, 4]$, $b = [5, 6, 7, 8]$ and $\text{imm8} = 4$, the result is equal to $[6, 7, 8, 1]$. If $\text{imm8} = 8$ instead, the result is equal to $[7, 8, 1, 2]$↵
↵
### Extracting↵
↵
~~~~~↵
// Continue from previous code block↵
int mxarr[4];↵
_mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr↵
long long mularr[2];↵
_mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr↵
↵
// mul = [12, 32]↵
long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)↵
// sm = [4, 6, 8, 10]↵
int s
~~~~~↵
↵
Examples↵
==================↵
↵
Multiplication↵
------------------↵
↵
Given 2 arrays $A$ and $B$ of length $N$, compute array $C$ where for all $1 \le i \le N$, $C_i = A_i \cdot B_i$.↵
↵
#### Solution↵
↵
Since `_mm_mult_epi32` can only multiply 2 numbers at a time, we make use of `_mm_shuffle_epi32` to load the integers to the correct position and store the result after multiplication using `mm_storeu_si128`.↵
↵
<spoiler summary="Code">↵
↵
Take note to buffer the array by at least 3 spaces 4 integers are processed at a time↵
↵
~~~~~↵
#include <bits/stdc++.h> ↵
using namespace std;↵
↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int n;↵
int arr[200005], brr[200005];↵
long long crr[200005];↵
↵
int main() {↵
scanf("%d", &n);↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &arr[i]);↵
}↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &brr[i]);↵
}↵
for (int i = 0; i < n; i += 4) {↵
__m128i a = _mm_loadu_si128((__m128i*) &arr[i]),↵
b = _mm_loadu_si128((__m128i*) &brr[i]); ↵
// 0b|00|01|00|00 first element stores a[0] and third stores a[1]↵
// 0b|00|11|00|10 first element stores a[2] and third stores a[3]↵
__m128i afirst = _mm_shuffle_epi32(a, 0b00010000),↵
asecond = _mm_shuffle_epi32(a, 0b00110010),↵
bfirst = _mm_shuffle_epi32(b, 0b00010000),↵
bsecond = _mm_shuffle_epi32(b, 0b00110010);↵
a = _mm_mul_epi32(afirst, bfirst);↵
b = _mm_mul_epi32(asecond, bsecond);↵
_mm_storeu_si128((__m128i*) &crr[i], a);↵
_mm_storeu_si128((__m128i*) &crr[i + 2], b);↵
}↵
for (int i = 0; i < n; i++) {↵
printf("%lld ", crr[i]);↵
}↵
printf("\n");↵
return 0;↵
}↵
~~~~~↵
↵
</spoiler>↵
↵
Maximum Prefix↵
------------------↵
↵
Given an array $A$ of length $N$, we define the prefix sum of $A$ as an array $P$ where $P_i = \sum_{j=1}^i A_j$ for all $1 \le i \le N$. Find the maximum value in array $P$.↵
↵
#### Solution↵
↵
Let us observe how we can get the prefix sums of 4 integers $[a, b, c, d]$ quickly. If we follow a fenwick tree / binary-indexed tree structure, we can observe that we only have to do the following↵
↵
1. Add $[0, a, b, c]$ to $[a, b, c, d]$ to get $[a, a + b, b + c, c + d]$↵
2. Add $[0, 0, a, a + b]$ to $[a, a + b, b + c, c + d]$ to get $[a, a + b, a + b + c, a + b + c + d]$↵
↵
Notice that the added array step 1 is basically `_mm_slli_si128(a, 4)` and step 2 is `mm_slli_si128(a, 8)`.↵
↵
We can store the maximum by using the function `_mm_max_epi32`.↵
↵
At the end, since we stored the maximum in 4 integers, to get the maximum integer among the 4 integers, we can use the same method for addition but replace addition with maximum.↵
↵
<spoiler summary="Code">↵
↵
~~~~~↵
#include <bits/stdc++.h> ↵
using namespace std;↵
↵
#include <nmmintrin.h>↵
↵
#pragma GCC target("avx2")↵
↵
int n;↵
int a[200005];↵
↵
int main() {↵
scanf("%d", &n);↵
for (int i = 0; i < n; i++) {↵
scanf("%d", &a[i]);↵
}↵
__m128i mx = _mm_setzero_si128();↵
__m128i sum = _mm_setzero_si128();↵
for (int i = 0; i < n; i += 4) {↵
__m128i tmpsum = _mm_loadu_si128((__m128i*) &a[i]);↵
tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 4));↵
tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 8));↵
sum = _mm_add_epi32(sum, tmpsum);↵
mx = _mm_max_epi32(mx, sum);↵
sum = _mm_shuffle_epi32(sum, 0b11111111);↵
}↵
mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 4));↵
mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 8));↵
int ans = _mm_extract_epi32(mx, 3);↵
printf("%d\n", ans);↵
return 0;↵
}↵
~~~~~↵
↵
</spoiler>↵
↵