I decided to write a blog on this as I was doing a problem on our local judge and I decided to try to speed up my brute force code. However, it was quite difficult to find resources on SIMD vectorisation, so I decided to try to compile some of the resources I found to hopefully allow more people to learn ~~to scam brute force solutions~~

Thanks to iLoveIOI and jamessngg for proofreading.

Introduction

SIMD stands for single instruction, multiple data. SIMD allows us to give vector instructions which will allow the code to run faster. Vector instructions are instructions that handle short (length 2-16) vectors of integers / floats / characters in a parallel way by making use of the extra bits of space to do operations simultaneously.

The most common form of vectorisation is making use of pragmas such as

#pragma GCC optimize("O3,unroll-loops")
#pragma GCC target("avx2,bmi,bmi2,lzcnt,popcnt")

This form of vectorisation is known as auto-vectorisation as the compiler vectorises the code for you. However, for more complicated examples, the compiler might be unable to detect what to vectorise automatically, so in this case, we have to vectorise our code manually by using SIMD vectorisation.

Syntax

The compilation of all the code given in the syntax section is given here

Code

#include <nmmintrin.h>

#pragma GCC target("avx2")

int main() {
    /******   LOADING   ******/

    __m128i zero = _mm_setzero_si128(); // set everything to 0
    __m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8

    __m128i pi = _mm_setr_epi32(3, 1, 4, 1); // NOTE: set and setr are opposites of each other
    // mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1
    // mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3

    int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
    __m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]
    __m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]

    __m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]
    // _mm_insert_epi32(a, i, j) changes j-th number of a to value i
    a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]

    /******   ARITHMETIC   ******/

    __m128i sm = _mm_add_epi32(a0, a4); // [4, 6, 8, 10] (sum i-th element of a0 with i-th element of a4)
    __m128i mx = _mm_max_epi32(pi, a0); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)
    __m128i sb = _mm_sub_epi32(pi, a0); // [3, 0, 2, -2] (subtract i-th element of a0 from i-th element of pi)

    __m128i smallmul = _mm_mullo_epi32(sm, mx); // [12, 6, 32, 30] (multiply i-th element of sm with mx)
    __m128i mul = _mm_mul_epi32(sm, mx); // [12, 32] (4*3=12, 8*4=32)

    /******   LOGICAL ARITHMETIC   ******/

    __m128i three = _mm_set1_epi32(3);
    // _mm_cmplt_epi32(a, b) returns mask containing a less than b
    __m128i mskl = _mm_cmplt_epi32(pi, three); // contains [0, 2^32-1, 0, 2^32-1] as only 1 < 3
    // _mm_cmpgt_epi32(a, b) returns mask containing a greater than b
    __m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]
    // _mm_cmpeq_epi32(a, b) returns mask containing a equal to b
    __m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]

    __m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1]

    /******   REORDERING   ******/

    __m128i a = _mm_setr_epi32(1, 2, 3, 4);
    a = _mm_shuffle_epi32(a, 0b00100111); // [4, 2, 3, 1]

    a = _mm_setr_epi32(1, 2, 3, 4);
    a = _mm_slli_si128(a, 8); // [0, 0, 1, 2]

    a = _mm_setr_epi32(1, 2, 3, 4);
    a = _mm_srli_si128(a, 4); // [2, 3, 4, 0]

    a = _mm_setr_epi32(1, 2, 3, 4);
    __m128i b = _mm_setr_epi32(5, 6, 7, 8);
    a = _mm_alignr_epi8(a, b, 8); // [7, 8, 1, 2]

    /******   EXTRACTING   ******/

    int mxarr[4];
    _mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr
    long long mularr[2];
    _mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr

    // mul = [12, 32]
    long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)
    // sm = [4, 6, 8, 10]
    int sm0 = _mm_extract_epi32(sm, 2); // extract the 2-nd element of sum (= 8)

    /******   OTHERS   ******/

    __m128i llsm = _mm_cvtepi32_epi64(sm); // converts first 2 numbers of sm into 64-bit integers
    // _mm_extract_epi64(llsm, 0) == 4 && _mm_extract_epi64(llsm, 1) == 6
}

To make use of SIMD, we have to add the following at the top of the code.

#include <nmmintrin.h>

#pragma GCC target("avx2")

Data types

There are three 128-bit data types for integers, floats and doubles respectively.

__m128i i; // stores 4 integers since each integer is 32-bit and 128/32=4
__m128 f; // stores 4 floats since each float is 32-bit and 128/32=4
__m128d d; // stores 2 doubles since each double is 64-bit and 128/64=2

Note that __m128i can also be used to store 32 characters.

256 and 512-bit data types also exist, however, they are quite similar to 128-bit and are not as supported as 128-bit (Codeforces does not support 512-bit) so I will not go through them in much detail.

Intrinsics

To do operations on these data types, we will make use of SIMD intrinsics.

All the intrinsic functions have the following form _mm_{instruction}_{datatype}, for example, _mm_add_epi32(a, b) is a function to add 32-bit integers from $$$a$$$ and $$$b$$$ together.

Some examples of data-types are

si128 — signed 128-bit integer
epi8, epi32, epi64, epu8, epu32, epu64 — vector of signed 8-bit integers (character), signed 32-bit integers, signed 64-bit integers, together with their unsigned variants
ps — 32-bit floats
pd — 64-bit doubles

In the following examples, I will be focusing on signed 32-bit integers (i.e. epi32) as it is most commonly used.

Loading

To make use of our __m128i data structure, we first have to load data into it.

__m128i zero = _mm_setzero_si128(); // set everything to 0
__m128i eight = _mm_set1_epi32(8); // set the vector of 4 integers to be equal to 8

__m128i pi = _mm_setr_epi32(3, 1, 4, 1); // NOTE: set and setr are opposites of each other
// mm_setr_epi32(3, 1, 4, 1) -> first value == 3, second value == 1, third value == 4, forth value == 1
// mm_set_epi32(3, 1, 4, 1) -> first value == 1, second value == 4, third value == 1, forth value == 3

int arr[8] = {0, 1, 2, 3, 4, 5, 6, 7};
__m128i a0 = _mm_loadu_si128((__m128i*) &arr[0]); // [0, 1, 2, 3]
__m128i a4 = _mm_loadu_si128((__m128i*) &arr[4]); // [4, 5, 6, 7]

__m128i a2 = _mm_loadu_si128((__m128i*) &arr[2]); // [2, 3, 4, 5]
// _mm_insert_epi32(a, i, j) changes j-th number of a to value i
a2 = _mm_insert_epi32(a2, 99, 1); // [2, 99, 4, 5]

Arithmetic

// Continue from previous code block
__m128i sm = _mm_add_epi32(a0, a4); // [4, 6, 8, 10] (sum i-th element of a0 with i-th element of a4)
__m128i mx = _mm_max_epi32(pi, a0); // [3, 1, 4, 3] (maximum of the i-th element of pi and i-th element of a0)
__m128i sb = _mm_sub_epi32(pi, a0); // [3, 0, 2, -2] (subtract i-th element of a0 from i-th element of pi)

__m128i smallmul = _mm_mullo_epi32(sm, mx); // [12, 6, 32, 30] (multiply i-th element of sm with mx)
__m128i mul = _mm_mul_epi32(sm, mx); // [12, 32] (4*3=12, 8*4=32) (more details below)

Most of the simpler arithmetic operations work by doing the arithmetic operation on each index from 0 to 3. However, for _mm_mul_epi32, since the multiplication of two 32-bit integers will become 64-bit integers, we can only store two 64-bit integers in the result. Hence, _mm_mul_epi32 works by multiplying the 0-th index and 2-nd index.

However, if you are very sure that the multiplication of the numbers does not exceed a 32-bit integer, you can use _mm_mullo_epi32 which will do the multiplication for each index from 0 to 3 but store the result as a 32-bit integer.

Logical Arithmetic

// Continue from previous code block
__m128i three = _mm_set1_epi32(3);
// _mm_cmplt_epi32(a, b) returns mask containing a less than b
__m128i mskl = _mm_cmplt_epi32(pi, three); // contains [0, 2^32-1, 0, 2^32-1] as only 1 < 3
// _mm_cmpgt_epi32(a, b) returns mask containing a greater than b
__m128i mskg = _mm_cmpgt_epi32(pi, three); // contains [0, 0, 2^32-1, 0]
// _mm_cmpeq_epi32(a, b) returns mask containing a equal to b
__m128i mske = _mm_cmpeq_epi32(pi, three); // contains [2^32-1, 0, 0, 0]

__m128i mix = _mm_blendv_epi8(eight, pi, mskl); // contains [8, 1, 8, 1] (more details below)

Most comparison functions are in the form _mm_cmpxx_epi32. Since the functions compare 32-bit integers, if the comparison evaluates to true, all 32 bits will become 1, which is why it is equal to $$$2^{32}-1$$$.

__m128i _mm_blendv_epi8 (__m128i a, __m128i b, __m128i mask)

This function blends 2 vectors together by using the mask. If position $$$i$$$ of the mask is $$$0$$$, position $$$i$$$ of the result will be equal to $$$a$$$. Otherwise, if position $$$i$$$ of mask is $$$2^{32}-1$$$, position $$$i$$$ of the result will be equal to $$$b$$$.

Reordering

Sometimes we might need to change the order of the 4 numbers

_mm128i _mm_shuffle_epi32 (__m128i a, int imm8)

imm8 is an 8-bit mask that represents which value is at which position. The position of the first element will be the value formed by the first two bits of imm8, the position of the second element will be the value formed by the third and fourth bits, and so on.

For example, if $$$a = [1, 2, 3, 4]$$$ and $$$\text{imm8} = 39 = \text{0b00100111}$$$, the result is equal to $$$[4, 2, 3, 1]$$$ as the first 2 bits is $$$\text{0b11} = 3$$$, the next 2 bits is $$$\text{0b01} = 1$$$, the following 2 bits is $$$\text{0b10} = 2$$$ and the last 2 bits is $$$\text{0b00} = 0$$$. If $$$\text{imm8} = 170 = \text{0b10101010}$$$, the result is equal to $$$[3, 3, 3, 3]$$$.

_mm128i _mm_slli_si128 (__m128i a, int imm8)

This function can shift the numbers to the left (think of it as left-shift in binary form). However, this function shifts 8 bits at a time, so since we are using 32-bit integers, if we want to shift by 1 position, imm8 has to be equal to 4. For example, if $$$a = [1, 2, 3, 4]$$$ and $$$\text{imm8} = 4$$$, the result is equal to $$$[0, 1, 2, 3]$$$, and if $$$\text{imm8} = 8$$$, the result is equal to $$$[0, 0, 1, 2]$$$.

_mm128i _mm_srli_si128 (__m128i a, int imm8)

Instead of shifting to the left, this function shifts to the right. For example, if $$$a = [1, 2, 3, 4]$$$ and $$$\text{imm8} = 4$$$, the result is equal to $$$[2, 3, 4, 0]$$$.

_mm128i _mm_alignr_epi8 (__m128i a, __m128i b, int imm8)

This function concatenates $$$a$$$ and $$$b$$$ to form 8 elements $$$[b_0, b_1, b_2, b_3, a_0, a_1, a_2, a_3]$$$, then right-shift the 256 bits by $$$\text{imm8} \times 8$$$ bits, and then return the lowest 128 bits. Note that b is before a. Similar to the previous function, since this function shifts 8 bits at a time, if we want to shift by 1 position, imm8 has to be equal to 4.

For example, if $$$a = [1, 2, 3, 4]$$$, $$$b = [5, 6, 7, 8]$$$ and $$$\text{imm8} = 4$$$, the concatenated intermediate will be $$$[5, 6, 7, 8, 1, 2, 3, 4]$$$, so after right shifting by 1 position and taking the first 4 elements, the result is equal to $$$[6, 7, 8, 1]$$$. If $$$\text{imm8} = 8$$$ instead, the result is equal to $$$[7, 8, 1, 2]$$$

Extracting

// Continue from previous code block
int mxarr[4];
_mm_storeu_si128((__m128i*) mxarr, mx); // stores values of mx into mxarr
long long mularr[2];
_mm_storeu_si128((__m128i*) mularr, mul); // stores values of mul into mularr

// mul = [12, 32]
long long mul0 = _mm_extract_epi64(mul, 0); // extract the 0-th element of mul (= 12)
// sm = [4, 6, 8, 10]
int sm0 = _mm_extract_epi32(sm, 2); // extract the 2-nd element of sum (= 8)

Others

// Continue from previous code block
__m128i llsm = _mm_cvtepi32_epi64(sm); // converts first 2 numbers of sm into 64-bit integers
// _mm_extract_epi64(llsm, 0) == 4 && _mm_extract_epi64(llsm, 1) == 6

Examples

Multiplication

Given 2 arrays $$$A$$$ and $$$B$$$ ($$$1 \le A_i, B_i \le 10^9$$$) of length $$$N$$$ ($$$1 \le N \le 2 \cdot 10^5$$$), compute array $$$C$$$ where for all $$$1 \le i \le N$$$, $$$C_i = A_i \cdot B_i$$$.

Solution

Since _mm_mul_epi32 can only multiply 2 numbers at a time, we make use of _mm_shuffle_epi32 to load the integers to the correct position and store the result after multiplication using mm_storeu_si128.

Take note to buffer the array by at least 3 spaces as 4 integers are processed at a time.

Code

#include <bits/stdc++.h> 
using namespace std;

#include <nmmintrin.h>

#pragma GCC target("avx2")

int n;
int arr[200005], brr[200005];
long long crr[200005];

int main() {
    scanf("%d", &n);
    for (int i = 0; i < n; i++) {
        scanf("%d", &arr[i]);
    }
    for (int i = 0; i < n; i++) {
        scanf("%d", &brr[i]);
    }
    for (int i = 0; i < n; i += 4) {
        __m128i a = _mm_loadu_si128((__m128i*) &arr[i]),
                b = _mm_loadu_si128((__m128i*) &brr[i]); 
        // 0b|00|01|00|00 first element stores a[0] and third stores a[1]
        // 0b|00|11|00|10 first element stores a[2] and third stores a[3]
        __m128i afirst = _mm_shuffle_epi32(a, 0b00010000),
                asecond = _mm_shuffle_epi32(a, 0b00110010),
                bfirst = _mm_shuffle_epi32(b, 0b00010000),
                bsecond = _mm_shuffle_epi32(b, 0b00110010);
        a = _mm_mul_epi32(afirst, bfirst);
        b = _mm_mul_epi32(asecond, bsecond);
        _mm_storeu_si128((__m128i*) &crr[i], a);
        _mm_storeu_si128((__m128i*) &crr[i + 2], b);
    }
    for (int i = 0; i < n; i++) {
        printf("%lld ", crr[i]);
    }
    printf("\n");
    return 0;
}

Maximum Prefix

Given an array $$$A$$$ ($$$-10^4 \le A_i \le 10^4$$$) of length $$$N$$$ ($$$1 \le N \le 2 \cdot 10^5$$$), we define the prefix sum of $$$A$$$ as an array $$$P$$$ where $$$P_i = \sum_{j=1}^i A_j$$$ for all $$$1 \le i \le N$$$. Find the maximum value in array $$$P$$$.

Solution

Let us observe how we can get the prefix sums of 4 integers $$$[a, b, c, d]$$$ quickly. If we follow a fenwick tree / binary-indexed tree structure, we can observe that we only have to do the following

Add $$$[0, a, b, c]$$$ to $$$[a, b, c, d]$$$ to get $$$[a, a + b, b + c, c + d]$$$
Add $$$[0, 0, a, a + b]$$$ to $$$[a, a + b, b + c, c + d]$$$ to get $$$[a, a + b, a + b + c, a + b + c + d]$$$

Notice that the added array in step 1 is basically _mm_slli_si128(a, 4) and step 2 is mm_slli_si128(a, 8).

We can store the maximum by using the function _mm_max_epi32.

In the end, since we stored the maximum in 4 integers, to get the maximum integer among the 4 integers, we can use the same method for prefix sum but replace addition with maximum.

Code

#include <bits/stdc++.h> 
using namespace std;

#include <nmmintrin.h>

#pragma GCC target("avx2")

int n;
int a[200005];

int main() {
    scanf("%d", &n);
    for (int i = 0; i < n; i++) {
        scanf("%d", &a[i]);
    }
    __m128i mx = _mm_setzero_si128();
    __m128i sum = _mm_setzero_si128();
    for (int i = 0; i < n; i += 4) {
        __m128i tmpsum = _mm_loadu_si128((__m128i*) &a[i]);
        tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 4));
        tmpsum = _mm_add_epi32(tmpsum, _mm_slli_si128(tmpsum, 8));
        sum = _mm_add_epi32(sum, tmpsum);
        mx = _mm_max_epi32(mx, sum);
        sum = _mm_shuffle_epi32(sum, 0b11111111);
    }
    mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 4));
    mx = _mm_max_epi32(mx, _mm_slli_si128(mx, 8));
    int ans = _mm_extract_epi32(mx, 3);
    printf("%d\n", ans);
    return 0;
}

855F - Nagini

Snakes have values $$$l$$$, $$$r$$$ and $$$k$$$ ($$$1 \le l \le r \le 10^5$$$, $$$-10^9 \le k \le 10^9$$$, $$$k \neq 0$$$), indicating that the snake occupies the number line from $$$l$$$ to $$$r$$$ and has a danger value of $$$|k|$$$. If $$$k$$$ is positive, the snake is a type 1 snake, otherwise, it is a type 2 snake. We will define a function $$$f(x)$$$ to be equal to $$$0$$$ if either type 1 or type 2 or both snakes are not present at position $$$x$$$. Otherwise, $$$f(x)$$$ is the sum of the minimum $$$|k|$$$ value of all type 1 snakes at position $$$x$$$ and the minimum $$$|k|$$$ value of all type 2 snakes at position $$$x$$$. There are $$$Q$$$ ($$$1 \le Q \le 5 \cdot 10 ^ 4$$$) queries of 2 types, the first is to add a snake, and the second is to calculate the sum of $$$f(x)$$$ for all $$$l \le x \le r$$$.

Solution

To add a snake, we can make use of the function _mm_min_epi32 to update the minimum $$$|k|$$$ for each type of snake.

To answer the query, we can make use of _mm_cmplt_epi32 to check whether a snake is occupying the position, and then use _mm_and_epi32 to combine type 1 and type 2 snakes. Using _mm_blendv_epi8 will allow us to set the positions where snakes are gone to be equal to $$$0$$$. Since $$$k$$$ can be up to $$$10^9$$$, we have to store the result in a 64-bit integer, so we use _mm_cvtepi32_epi64 to convert the first 2 integers into a 64-bit integer and add it to the answer.

141448511

Code

#include <bits/stdc++.h>
using namespace std;

#include <nmmintrin.h>

#pragma GCC target("avx2")

const int N = 100000, INF = 1000000010;

int q;
int pos[N], neg[N];

int main() {
    scanf("%d", &q);
    for (int i = 0; i < N; i++) {
        pos[i] = INF;
        neg[i] = INF;
    }
    const __m128i MZERO = _mm_setzero_si128();
    const __m128i MINF = _mm_set1_epi32(INF);
    while (q--) {
        int t, l, r; scanf("%d%d%d", &t, &l, &r);
        if (t == 1) {
            int k; scanf("%d", &k);
            if (k > 0) {
                __m128i ck = _mm_set1_epi32(k);
                int j = l;
                for (int i = l; i + 3 < r; i += 4) {
                    j = i + 4;
                    __m128i a = _mm_loadu_si128((__m128i*) &pos[i]);
                    a = _mm_min_epi32(a, ck);
                    _mm_storeu_si128((__m128i*) &pos[i], a);
                }
                for (int i = j; i < r; i++) {
                    pos[i] = min(pos[i], k);
                }
            } else {
                k = -k;
                __m128i ck = _mm_set1_epi32(k);
                int j = l;
                for (int i = l; i + 3 < r; i += 4) {
                    j = i + 4;
                    __m128i a = _mm_loadu_si128((__m128i*) &neg[i]);
                    a = _mm_min_epi32(a, ck);
                    _mm_storeu_si128((__m128i*) &neg[i], a);
                }
                for (int i = j; i < r; i++) {
                    neg[i] = min(neg[i], k);
                }
            }
        } else {
            long long ans = 0;
            __m128i msum = _mm_setzero_si128();
            int j = l;
            for (int i = l; i + 3 < r; i += 4) {
                j = i + 4;
                __m128i mpos = _mm_loadu_si128((__m128i*) &pos[i]),
                        mneg = _mm_loadu_si128((__m128i*) &neg[i]);
                __m128i msk = _mm_cmplt_epi32(mpos, MINF);
                // elements < INF are valid
                msk = _mm_and_si128(msk, _mm_cmplt_epi32(mneg, MINF));
                __m128i mcomb = _mm_add_epi32(mpos, mneg);
                mcomb = _mm_blendv_epi8(MZERO, mcomb, msk);
                __m128i mfirst = _mm_cvtepi32_epi64(mcomb),
                        msecond = _mm_cvtepi32_epi64(_mm_srli_si128(mcomb, 8));
                msum = _mm_add_epi64(msum, mfirst);
                msum = _mm_add_epi64(msum, msecond);
            }
            long long sumarr[2];
            _mm_storeu_si128((__m128i*) sumarr, msum);
            ans = sumarr[0] + sumarr[1];
            for (int i = j; i < r; i++) {
                if (pos[i] == INF || neg[i] == INF) continue;
                ans += pos[i] + neg[i];
            }
            printf("%lld\n", ans);
        }
    }
}

911G - Mass Change Queries

You have an array $$$A$$$ ($$$1 \le A_i \le 100$$$) of $$$N$$$ ($$$1 \le N \le 2 \cdot 10 ^ 5$$$) integers. You have to process $$$Q$$$ ($$$1 \le Q \le 2 \cdot 10^5$$$) queries, each one in the form $$$l, r, x, y$$$, ($$$1 \le l \le r \le N$$$, $$$1 \le x, y \le 100$$$), meaning that for all $$$l \le i \le r$$$, if $$$A_i = x$$$ change it to $$$A_i = y$$$. Output array $$$A$$$ after all the queries are completed.

Solution

For the replacement of $$$x$$$ to $$$y$$$, we can simply use _mm_cmpeq_epi32 and _mm_blendv_epi8. However, if we try to code it, we will realise that it causes a TLE verdict. To optimise our code even further, we will use 256-bit vectors and store 8-bit integers inside as $$$A_i, x, y \le 100 \le 2^8-1$$$.

To make use of 256-bit vectors, we need to #include <immintrin.h>. Then, replace all __m128i data type with __m256i data type, and rename the functions _mm_... with _mm256_... and ..._si128 with ..._si256.

To use 8-bit integers, we will store the arrays with char and rename the functions ..._epi32 with ..._epi8.

141464454

Code

#include <bits/stdc++.h>
using namespace std;

#include <nmmintrin.h>
#include <immintrin.h>

#pragma GCC target("avx2")

int n, q;
char a[200005];

int main() {
    scanf("%d", &n);
    for (int i = 1; i <= n; i++) {
        scanf("%hhi", &a[i]);
    }
    scanf("%d", &q);
    while (q--) {
        int l, r;
        char x, y; scanf("%d%d%hhi%hhi", &l, &r, &x, &y);
        int j = l;
        __m256i mx = _mm256_set1_epi8(x);
        __m256i my = _mm256_set1_epi8(y);
        for (int i = l; i + 31 <= r; i += 32) {
            j = i + 32;
            __m256i ma = _mm256_loadu_si256((__m256i*) &a[i]);
            __m256i msk = _mm256_cmpeq_epi8(ma, mx);
            ma = _mm256_blendv_epi8(ma, my, msk);
            _mm256_storeu_si256((__m256i*) &a[i], ma);
        }
        for (int i = j; i <= r; i++) {
            if (a[i] == x) {
                a[i] = y;
            }
        }
    }
    for (int i = 1; i <= n; i++) {
        printf("%hhi ", a[i]);
    }
    printf("\n");
    return 0;
}

Conclusion

~~All in all this blog is mostly useless 🤡~~ However, it might be useful for problem setters to code these SIMD solutions to make sure that brute force doesn't pass ~~or if the problem setters troll and this is the intended solution~~. Please feel free to correct me in the comments below if you find any mistakes or ask any questions about this topic.

If you would like to explore more SIMD functions, you can scroll through the first link in the references section. It is a very clear documentation of all the functions, and it also provides all the required targets for it to work (Codeforces does not support AVX-512, if you want to use it in Atcoder, you need to add #pragma GCC target("avx512vl") at the start)

~~Also, don't forget to upvote this blog if you learned something new.~~

References

Комментарии (12)

Написать комментарий?

maomao90

4 года назад, скрыть # |

Auto comment: topic has been updated by maomao90 (previous revision, new revision, compare).

→ Ответить

jamessngg

maomao90 orz

4 года назад, скрыть # ^ |

← Rev. 2 →

jamessngg orz

iLoveIOI

maomao90 jamessngg orz

iLoveIOI Most orz newbie orz orz orz

box

+13

Let's explore even weirder brute force!

epic blog by the way

sslotin

+71

Not many people know, but compilers actually allow you do define your own vector types that feel more like arrays with some operators overloaded to match the relevant intrinsics.

For example, in GCC, here is how you can define vector of 8 integers packed into a 256-bit (32-byte) register:

typedef int v8si __attribute__ (( vector_size(32) ));
// type ^   ^ typename          size in bytes ^

You can then use them in a numpy-like fashion:

v4si a = {1, 2, 3, 5};
v4si b = {8, 13, 21, 34};

v4si c = a + b; // elementwise addition

for (int i = 0; i < 4; i++)
    printf("%d\n", c[i]);

c *= 2; // multiply by scalar

for (int i = 0; i < 4; i++)
    printf("%d\n", c[i]);

Here is how to add two arrays together with it:

v4d a[1000/8], b[1000/8], c[1000/8];

for (int i = 0; i < 1000/8; i++)
    c[i] = a[i] + b[i];

Of course, sometimes you need mask moves, in-register shuffles and other specific operations, and you have to resort to intrinsics for that, but in most cases, vector types cover all your needs.

A large part of my book discusses SIMD and its applications in algorithms. Most of it is still in the draft stage, but the code may be worth looking at if you are interested.

clyring

+10

Since _mm_mult_epi32 can only multiply 2 numbers at a time, ...

_mm_mult_epi32 does not exist. _mm_mul_epi32 has the limitation you describe, but there is also the narrow multiplication primitive _mm_mullo_epi32 which doesn't.

Oops sorry for the typo.

Thanks for informing me about mm_mullo_epi32, I will add it to the blog.

UPD: I have made the changes. Thanks once again for pointing out my mistake

Alan233

Hello, I'm confused:

// Author: wlzhouzhuan
#pragma GCC optimize("Ofast,no-stack-protector,unroll-loops")
#pragma GCC target("sse,sse2,sse3,ssse3,sse4,sse4.1,sse4.2,popcnt,abm,mmx,avx,avx2,fma,tune=native")
#include <bits/stdc++.h>
#include <immintrin.h>
#include <nmmintrin.h>
using namespace std;

void debug_out1(__m128i a) {
    int qwq[4]; _mm_storeu_si128((__m128i*) qwq, a);
    for (int i = 0; i < 4; i++) printf("%d ", qwq[i]);
    puts("");
}

void debug_out2(__m256i a) {
    int qwq[8]; _mm256_storeu_si256((__m256i*) qwq, a);
    for (int i = 0; i < 8; i++) printf("%d ", qwq[i]);
    puts("");
}

int main(){
    __m256i a=_mm256_setr_epi32(20,15,20,15,20,20,20,20);
    a=_mm256_srli_si256(a,4);
    debug_out2(a);
    
    __m128i b=_mm_setr_epi32(20,15,20,15);
    b=_mm_srli_si128(b,4);
    debug_out1(b);
}

When I use _mm256i_srli_si256, it outputs 15 20 15 0 20 20 20 0 (expected 15 20 15 20 20 20 20 0). I think it's because _mm*_srli_si* makes it to (a, b, c, d) (d, e, f, g) and both shift them (b, c, d, 0) (e, f, g, 0). But I don't know how to fix it (I want (a, b, c, d, e, f, g, h) to (b, c, d, e, f, g, h, 0)).

Can you tell me how to deal with this problem? Thanks in advance :)

_mm256_srli_si256 does a left shift within two 128-bit lanes separately as if try were operations on 128-bit registers (many AVX/AVX2 operations do that lane separation for some reason, I guess it's easier to implement in hardware it that way). The Intel reference is quite detailed and provides a pseudocode to sort out issues like that.

I guess the easiest way to rotate a vector is to use a permutation. I haven't tested it, but it should go like this:

typedef __m256i reg;

// counterintuitively, values in "set" instructions go right to left
const reg perm_mask = _mm256_set_epi32(0, 7, 6, 5, 4, 3, 2, 1);
x = _mm256_permutevar8x32_epi32(x, perm_mask);

// if you really need to zero out the last element
const reg last_zero = _mm256_set_epi32(0, -1, -1, -1, -1, -1, -1, -1);
x = _mm256_and_si256(x, last_zero);

This whole sequence is slower, so it may make sense to just switch to plain srli with a single 128-bit register instead. By the way, what do you need it for?

№	Пользователь	Рейтинг
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

№	Пользователь	Вклад
1	Qingyu	164
2	adamant	150
3	Um_nik	146
4	Dominater069	144
5	errorgorn	141
6	cry	139
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

Блог пользователя maomao90

Introduction

Syntax

Data types

Intrinsics

Loading

Arithmetic

Logical Arithmetic

Reordering

Extracting

Others

Examples

Multiplication

Solution

Maximum Prefix

Solution

855F - Nagini

Solution

911G - Mass Change Queries

Solution

Conclusion

References