std::generator and recursive lambdas in C++23

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	158
2	adamant	152
3	Proof_by_QED	146
3	Um_nik	146
5	Dominater069	144
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

Hi everyone!

As Codeforces now supports C++23, it seems to be the right time to discuss some of the particularly interesting features.

Some noteworthy ones that were already mentioned elsewhere include:

views::zip that maps two ranges into pairs (A[i], B[i]).
views::enumerate that maps range into pairs (i, a[i]).
views::adjacent<k> that maps range into tuples (a[i], ..., a[i+k-1]).
views::cartesian_product that maps two ranges into pairs (A[i], B[j]) with all possible i and j.
Some more specialized views.
ranges::to<Container> that creates a container out of a view, e.g. to<vector>(views::iota(0, n)).
ranges::fold_left and ranges::fold_right, range versions of std::accumulate/std::reduce.
insert_range/append_range/prepend_range/assign_range for containers (not in GCC yet).
print/println for formatted printing (it seems that standard formatters for ranges are not in GCC yet).

But there are also two features that weren't covered in as much detail as they deserve, deducing this and generators.

Deducing this and recursive lambdas

Assume that you want to write a recursive lambda. What are your options? Naturally, you'd try something like this:

    auto fact = [&](int x) {
        return x ? x * fact(x - 1) : 1;
    };

You will not be allowed to do it, because inside the lambda, you use fact before its type is deduced. One way to circumvent it is to write function<int(int)> fact = ... instead. While it is attractive, it makes the calls to fact more expensive, and in certain cases might even lead you to TLE, e.g. if you try to use this for depth-first traversal of a big graph.

Until C++23, the best practice was to do something like this:

    auto fact = [&](auto &&self, int x) -> int {
        return x ? x * self(self, x - 1) : 1;
    };
    cout << fact(fact, n) << endl;

While much more efficient on practice, it looks very ugly. And C++23 helps us, allowing to do this instead:

    auto fact = [&](this auto fact, int x) -> int {
        return x ? x * fact(x - 1) : 1;
    };
    cout << fact(n) << endl;

Just as if it was a proper recursive function!

std::generator

Another interesting feature that I haven't seen mentioned in competitive programming discussions at all are coroutines. Assume that you need to factorize a number. If you depend on Pollard's rho algorithm, your flow probably looks as follows:

    vector<int> factors;
    void factorize(uint64_t m) {
        if(is_prime(m)) {
            factors.push_back(m);
        } else if(m > 1) {
            auto g = proper_divisor(m);
            factorize(g);
            factorize(m / g);
        }
    }

And it's always annoying that to store the result, you have to either keep a global vector, or take output vector as an argument by reference (and pass it around each time). Ideally you'd want to return a vector, but then you might be afraid of accidentally getting a quadratic runtime, and even besides that you'd need to write extra code to merge the results. Coroutines allow us to rewrite it as follows:

    std::generator<uint64_t> factorize(uint64_t m) {
        if(is_prime(m)) {
            co_yield m;
        } else if(m > 1) {
            auto g = proper_divisor(m);
            co_yield std::ranges::elements_of(factorize(g));
            co_yield std::ranges::elements_of(factorize(m / g));
        }
    }
    
    for(int p: factorize(m)) {
        ...
    }

In this manner, factorize will return a generator, which is basically a view wrapper to the function above which generates consecutive elements on the range "on the fly", while also suspending execution of the function between accesses to the resulting range. This way, we avoid the need to store the results of the recursive function somewhere, as well as the need to incorporate external logic or callbacks into the function if we want to do something as soon as you get the next element.

From performance perspective, of course, coroutines may be slightly inferior to having a global vector to store the results. For example, this submission to finding strongly connected components takes 278ms and 108 MB of memory, while its global vector version only needs 229ms and 65 MB. Still I think coroutines are pretty nice concept to keep in mind and might simplify code or make it better structured in a lot of cases.

#include <bits/stdc++.h> using namespace std; namespace std::ranges::views { auto solve() { return iota(1, 5) | transform([](int a) { return a + 10; }); } }; int main() { for (int i : std::ranges::views::solve()) cout << i << " "; cout << endl; }

vector<int> factorize(uint64_t m) { vector<int> factors; auto inner = [&](this auto inner, uint64_t m) -> void { if(is_prime(m)) { factors.push_back(m); } else if(m > 1) { auto g = proper_divisor(m); inner(g); inner(m / g); } } inner(m); return factors; }

std::generator<Node*> traverse(Node* n) { if (n) { if(n -> left) { co_yield std::ranges::elements_of(traverse(n -> left));; } co_yield n; if(n -> right) { co_yield std::ranges::elements_of(traverse(n -> right));; } } }

#include <generator> using namespace std; using i64 = long long; constexpr int N = 1e6; // or 1e8 std::generator<int> traverse(int n) { if (n < N) { if ((n << 1) < N) co_yield std::ranges::elements_of(traverse(n << 1)); co_yield n; if ((n << 1 | 1) < N) co_yield std::ranges::elements_of(traverse(n << 1 | 1)); } } int main() { for (int _ : traverse(1)) ; return 0; }

std::generator<int> traverse(int n) { if (n < N) { if(false) for(auto it: traverse(n << 1)); else co_yield ranges::elements_of(traverse(n << 1)); co_yield n; if(false) for(auto it: traverse(n << 1 | 1)); else co_yield ranges::elements_of(traverse(n << 1 | 1)); } }

traditional function result: 295201906 time: 133.8241ms this auto lambda result: 295201906 time: 87.658835ms this auto &&lambda result: 295201906 time: 94.996726ms auto lambda result: 295201906 time: 94.969018ms auto &&lambda result: 295201906 time: 94.40388ms

Comments (29)

Show archived | Write comment?

Gapp1e

18 months ago, hide # |

← Rev. 2 →

+21

It smells like Python.

→ Reply

beaaaan

+10

is there a way to make std::print not flush?

drdilyor

If you do

using namespace std;
using namespace std::ranges;
using namespace std::ranges::views;

then some stuff will become available from multiple namespaces, and c++ gives an ambigiuity error. Such as with sort that's available as std::sort or std::ranges::sort. Same with iota and a lot of other things.

If you don't want to type ranges:: and views:: everytime then use this trick:

This way, in case of ambiguities, the compiler will resolve to views and ranges library.

I learned this trick from someone's submission, idr.

18 months ago, hide # ^ |

hey arent you that guy on the monkeytype discord

i am

Neev4321

+11

Modifying the namespace std is strictly undefined behavior in C++. This is a bad idea.

I mean the compiler isn't going to sue us for that.

adamant

+43

The compiler is legally allowed to do anything in undefined behavior, so this includes suing us...

bramar2

Can someone explain why using function<int(int)> fact = ... is more expensive?

← Rev. 3 →

std::function is a complex object that can store lambdas, function pointers and functors. if you store a lambda, it allocates on heap (significant performance loss). also, the compiler (and cpu (?)) can't reason which function you are going to call (similar performance loss as to function pointers).

Thanks for explaining it. I just tested and found that std::function can be 3x slower than auto&& for just a fibonacci lambda (even slower for more complex ones).

Igorjan94

And it's always annoying that to store the result, you have to either keep a global vector, or take output vector as an argument by reference

Pardon?

Well, "global" in a sense that it should be defined outside of main function's scope and managed externally (or outside the lambda like in your example). I suppose one other way to circumvent it is for function to take a callback as an argument and invoke it with returned values each time this is needed.

peltorator

Ok, fine... I replaced all the recursive function<>s with auto &&self monstrosity in my prewritten library. It was a sad decision to make.

If you're fine using C++23, deducing this should be a much nicer alternative :)

+19

But c++23 does not exist everywhere yet, so I am not ready to put it in my templates

Syrian

A question regarding the generator part, suppose I ran an in-order traversal on a perfect binary tree using this technique:

Ignore the bugs if any.

Suppose the tree consisted of n vertices would the traversal be O(n) or O(n log n) given that each vertex v would need to propagate through depth(v) calls to reach the first call to "traverse".

After timing the code below with N = 1e6 and N = 1e8 (see line 6), I believe that it's $$$O(n \log n)$$$.

(Compiled with -std=gnu++23, without optimization flags.)

This code used ~310ms on N = 1e6, and used more than 8s on N = 1e8.

P.S. If there's anything that I'm doing wrong, please tell me :)

Thanks for the analysis. I got similar results. I also made an iterative version and it was much faster than the recursive one.

← Rev. 4 →

While there is a certain overhead to using generators, I don't think it's just depth(v), because e.g. the following code runs quickly enough:

generator<int> traverse(int n) {
    if (n < N) {
        co_yield n;
        co_yield ranges::elements_of(traverse(n + 1));
    }
}

While a proper depth(v) would mean that it should be quadratic. But I also don't know how exactly it is optimized out, as running the code above from Gapp1e in Codeforces invocations, indeed, gets 108ms on N=1e6, but suddenly around 8s on N=1e7 (not N=1e8).

It's also worth pointing out that elements_of was specifically designed with avoiding the issue of depth(v) calls in mind (see p2168r3), and the intended behavior is for the coroutine to recursively delegate control when it returns elements_of, rather than to copy all elements into the returned view.

← Rev. 6 →

I think the underlying issue might simply be that constructing elements_of is very expensive for small subroutines. Compare the following three versions:

843ms, O(n log n)

std::generator<int> traverse(int n) {
    if (n < N) {
        if ((n << 1) < N)
            for(auto it: traverse(n << 1)) co_yield it;

        co_yield n;

        if ((n << 1 | 1) < N)
            for(auto it: traverse(n << 1 | 1)) co_yield it;
    }
}

155ms, O(n)

std::generator<int> traverse(int n) {
    if (n < N) {
        if ((n << 1) < N)
            co_yield std::ranges::elements_of(traverse(n << 1));

        co_yield n;

        if ((n << 1 | 1) < N)
            co_yield std::ranges::elements_of(traverse(n << 1 | 1));
    }
}

But at the same time:

1061ms, O(n log n)

std::generator<int> traverse(int n) {
    if (n < N) {
        for(auto it: traverse(n << 1)) co_yield it;
        co_yield n;
        for(auto it: traverse(n << 1 | 1)) co_yield it;
    }
}

1436ms, O(n)

std::generator<int> traverse(int n) {
    if (n < N) {
        co_yield std::ranges::elements_of(traverse(n << 1));
        co_yield n;
        co_yield std::ranges::elements_of(traverse(n << 1 | 1));
    }
}

So, elements_of seems to perform terrible when the wrapped coroutine will consist of just 1, or generally a small number of elements. Here's a more direct example:

1749ms, O(n), N = 1e7

std::generator<int> traverse(int n) {
    if (n < N) {
        if(N / n < 8) for(auto it: traverse(n << 1)) co_yield it;
        else co_yield ranges::elements_of(traverse(n << 1));
        co_yield n;
        if(N / n < 8) for(auto it: traverse(n << 1 | 1)) co_yield it;
        else co_yield ranges::elements_of(traverse(n << 1 | 1));
    }
}

bicsi

I don't understand why std::generator isn't just rewritten to callback argument containing the continuation...

Here is an even funnier example:

This works in 1921ms, but removing if(false) makes it run in 9233ms.

Also notewrothy, it seems to be Codeforces-specific, locally they both run much faster for me, and putting N=1e8 runs in under 5s.

EarthMessenger

17 months ago, hide # |

Should I use [](this auto&& self) or [](this auto self)? I’m curious about the difference between using && and not using &&.

17 months ago, hide # ^ |

I benchmarked recursive factorial implementations with a traditional function, this auto/this auto&& lambdas, and auto/auto&& lambdas.

benchmark code

#include <chrono>
#include <cstdint>
#include <print>
#include <string>

constexpr int M = 998'244'353;

template <typename F> void measure_time(std::string info, F &&f) {
  const auto t1 = std::chrono::high_resolution_clock::now();
  const auto res = f();
  const auto t2 = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double, std::milli> ms = t2 - t1;
  std::println("{}\tresult: {}\ttime: {}ms", info, res, ms.count());
}

constexpr int N = 10'000'000;

int fact(int x) {
  if (x == 0)
    return 1;
  return (long long)fact(x - 1) * x % M;
}

int main() {
  measure_time("traditional function", []() { return fact(N); });

  measure_time("this auto lambda", []() {
    auto fact = [](this auto self, int x) -> int {
      if (x == 0)
        return 1;
      return (long long)self(x - 1) * x % M;
    };
    return fact(N);
  });

  measure_time("this auto &&lambda", []() {
    auto fact = [](this auto &&self, int x) -> int {
      if (x == 0)
        return 1;
      return (long long)self(x - 1) * x % M;
    };
    return fact(N);
  });

  measure_time("auto lambda", []() {
    auto fact = [](auto self, int x) -> int {
      if (x == 0)
        return 1;
      return (long long)self(self, x - 1) * x % M;
    };
    return fact(fact, N);
  });

  measure_time("auto &&lambda", []() {
    auto fact = [](auto &&self, int x) -> int {
      if (x == 0)
        return 1;
      return (long long)self(self, x - 1) * x % M;
    };
    return fact(fact, N);
  });
}

With result:

It seems that the speed of all these lambdas are nearly the same and lambdas are faster than traditional function (why?).

I think there is some comprehensive info about why std::function is really bad in nor's blog.

shsh

9 months ago, hide # |

-18

To anyone reading in the future, it looks like the recursive lambda syntax shown here is supported only on msvc, according to this article. Although the article is from 3 years ago, I just tried both clang and g++ with this syntax and neither worked.

If any other Mac users see this, please let me know if you found a workaround.

9 months ago, hide # ^ |

It is supported on GCC for sure, as it works on Codeforces and I tested it locally before posting. Generally,

For GCC, it compiles since GCC 14.1: https://godbolt.org/z/fnPTnMn7x.
For Clang, it compiles since 19.1.0: https://godbolt.org/z/11Trj46h3.

Your issue on Mac could be that it symlinks GCC to Clang or that you don't include -std=c++23 flag.

Oh, thanks a lot! I was using the -std=c++23 flag, but I didn't realize I needed to be using Clang 19 as well (I was using v16 before).

adamant's blog

Deducing this and recursive lambdas

std::generator