### gabrielwu's blog

By gabrielwu, history, 3 years ago,

(You can find a slightly different version of this article, tailored more for a non-CP audience, on my website.)

Thank you to smax for his feedback on this post.

Prerequisites:

• expected value, linearity of expectation
• basic linear algebra
• familiarity with string algorithms such as KMP (only in certain sections)

## The Problem

Say you're given a string of coin flips, such as HTTH or TTTHHT. What is the expected number of times you must flip a coin until you encounter that string?

A common initial (incorrect) intuition about this problem is that all strings of length $n$ should have the same answer -- something like $2^n$. We sense that there should be some sort of symmetry between heads and tails, so it feels odd that HHHH should appear any earlier or later than HTHT. But in this case, our intuition is simply wrong. It is true that the answer will always be on the order of $2^n$ (specifically, bounded between $2^{n}$ and $2^{n+1}-2$), but the string itself does matter, as we shall see.

Let's consider a simple example. How many coins do you need to flip before getting a H? For this particular string, the number of flips you need creates the Bernoulli distribution, which has an expected value of $2$. One slick way to see this is to imagine computing the answer by conducting many consecutive "trials" by flipping the coin a large number of times (say you do $N = 10^9$ total flips). The total number of trials you end up completing is simply the number of heads you encounter, which should be close to $N/2$. And since the expected number of flips per trial is approximated by the average number of flips over a large number of trials, our answer is $\frac{N}{N/2} = 2$. While this method of considering a large number of consecutive trials works well for the string H, we will need to find a more general method for longer strings.

(why?)

## States

The solution is to use states, a technique common to expected value problems. The idea of states is simple: define a bunch of variables to represent different stages of progress towards a goal. Create relations among the variables corresponding to "transitions" from one state to another, then solve the system of (usually) linear equations for the answer. In our case, when given string $S$ of length $n$, we will use $n+1$ state variables: $e_0, e_1, \dots, e_n$. The variable $e_i$ will represent the expected number of additional flips required to obtain string $S$ given that we've just flipped the first $i$ characters of $S$. By definition, $e_n$ represents the expected number of flips to obtain $S$ if we've already flipped all of $S$, so $e_n = 0$. Our final answer will be $e_0$ (because we start out with nothing flipped).

Formally, what we're doing here is a creating "state function" $f: \{\text{H}, \text{T}\}^* \to \{0, 1, \dots, n\}$ that maps every possible string of coin flips to a unique state in $\{0, 1, \dots, n\}$. In this case, our state function $f$ takes in a string $s$ and outputs the largest nonnegative integer $i$ such that $|s| \geq i$ and the last $i$ characters of $s$ are the first $i$ characters of $S$ (note the difference between $s$ and $S$ -- little $s$ is the input to $f$, while big $S$ is our target string which acts as a global variable). For example, if $S$ is HTHH, then the empty string gets assigned state $0$, HHT gets state $2$, HHHTHTH gets state $3$, and THTHH gets state $4$.

If you're interested in some more formality...

On a more intuitive level, what we're doing is finding a way to compress the infinite set of coin flip strings into a finite set of states. We do this in a way that preserves all relevant information to the number of coin flips needed to get accepted, including how likely it is to transition from one state to another. This will allow us to build a finite system of equations (as opposed to an infinite system of equations) that we can solve to find the answer.

So what exactly is this system of equations? Take the example in which $S$ is HTHH. For each state we have an equation that relates its expected value to the expected values of the states it can transition to. If we currently have the first $0$ characters of $S$ (which would correspond to flip sequences such as the empty string, or TTT, or HTHTT), then if we flip an H next we will end up in state $1$. But if we flip a T next, we will remain in state $0$. By linearity of expectation, this lets us write the equation $e_0 = 1 + \frac{1}{2}e_1 + \frac{1}{2}e_0$. Similarly, if we currently are in state $1$, then flipping a H will keep us in state $1$, while flipping a T will transition us to state $2$. So $e_1 = 1 + \frac{1}{2}e_1 + \frac{1}{2}e_2$. Continuing this process, we get the following equations:

State New state (H flipped) New state (T flipped) Equation
$0$ (...) $1$ (...H) $0$ (...T) $e_0 = 1 + \frac{1}{2}e_1 + \frac{1}{2}e_0$
$1$ (...H) $1$ (...HH) $2$ (...HT) $e_1 = 1 + \frac{1}{2}e_1 + \frac{1}{2}e_2$
$2$ (...HT) $3$ (...HTH) $0$ (...HTT) $e_2 = 1 + \frac{1}{2}e_3 + \frac{1}{2}e_0$
$3$ (...HTH) $4$ (...HTHH) $2$ (...HTHT) $e_3 = 1 + \frac{1}{2}e_4 + \frac{1}{2}e_2$
$4$ (...HTHH) n/a n/a $e_4 = 0$

Now we can solve this system of four equations and four unknowns with our favorite linear system technique (for example, Gaussian elimination). In the end, we get $e_0 = 18$, which is our final answer to the question "What is the expected number of coin flips I need to perform before encountering HTHH?"

For a general string $S$, how do we systematically generate these equations? Each of these equations comes in the form $e_i = 1 + \frac{1}{2}e_j + \frac{1}{2}e_k$. In particular, for a given state $i$, we get that $j$ is the state of the string $S_{0\dots i-1}$ H (i.e. the first $i$ characters of $S$ followed by a H), while $k$ is the state of $S_{0\dots i-1}$ T.

This suffices as a complete algorithm to calculate the answer for any given string $S$. First, generate the $n$ equations by computing the states of the appropriate strings. Then, solve the system of equations for $e_0$. In the next section, I will analyze the time complexity of this algorithm and present two powerful optimizations to the naive strategy.

## Time complexity

Naively, we can compute the state of a string $s$ by checking if the $i$th suffix of $s$ is equal to the $i$th prefix of $S$ for all possible $i$. Each string comparison can be done in $O(n)$ and there are $O(n)$ possible $i$ values, so this takes $O(n^2)$ time. We will need to call this state function for each equation we generate, so it takes $O(n^3)$ time to generate all $n$ equations. Once we have these equations, Gaussian elimination takes an additional $O(n^3)$ time, so overall our naive algorithm has a cubic time complexity.

Technically, we must add on an extra factor of $n$ to account for computation using arbitrarily large numbers. The integer values encountered in this problem grow exponentially in $n$, so they can be $O(n)$ digits long. When the numbers are this big, even addition takes $O(n)$ time (and speicial BigInteger classes will have to be used in certain programming languages because the values will not fit into 64-bit words). But in this analysis I will treat the standard operations of addition and subtraction as taking only $O(1)$ time (perhaps we only need the answer modulo some large prime). This means I will omit the extra factor of $n$ in the time complexities.

Now that we have established a straightforward $O(n^3)$ algorithm, I will show how we can make this process much more efficient. The result will be linear-time ($O(n)$) algorithm. This optimization must affect both stages of the algorithm: we must be able to generate the $n$ equations in linear time, and we must be able to solve the resultant system in linear time.

## The Longest Prefix-Suffix (LPS) Array

As explained earlier, the problem of generating the $n$ equations reduces to computing the states of $S_{0\dots i-1}$ H and $S_{0\dots i-1}$ T for all $i$ from $0$ to $n-1$. For any given $i$, it is obvious which state $S_{0\dots i-1}S_i$ corresponds to: it's just state $i+1$. So the only question is what state the string $S_{0\dots i-1}\overline{S_i}$ corresponds to. To compute this efficiently, we will introduce the Longest Prefix-Suffix (LPS) array, which some readers may recognize as the cornerstone of the Knuth-Morris-Pratt (KMP) string searching algorithm. We define $LPS$ to be an integer array of length $n$, where $LPS[i]$ (for $0 \leq i \leq n-1$) represents the length of the longest proper prefix of $S_{0\dots i}$ that is also a suffix of $S_{0\dots i}$ (we will call this a proper prefix-suffix). For example, if $S =$ HHTHHTH, then $LPS = [0, 1, 0, 1, 2, 3, 4]$ because:

• H has length 1, so it has no positive-length proper prefix
• the first character of HH is the last character of HH
• there is no positive-length proper prefix of HHT that is also a suffix
• the first character of HHTH is the last character of HHTH
• the first two characters of HHTHH are the last two characters of HHTHH
• the first three characters of HHTHHT are the last three characters of HHTHHT
• the first four characters of HHTHHTH are the last four characters of HHTHHTH

In addition to $LPS$, we will have three more arrays of length $n$: $LIPS$, $LPSH$, and $LPST$. These arrays are not typically used in the KMP algorithm, but they are useful for solving this problem. We define them as follows. $LIPS[i]$ represents the length of the longest proper prefix-suffix of $S_{0\dots i-1}\overline{S_i}$ (the $I$ stands for "last character inverted"). $LPSH[i]$ represents the length of the longest proper prefix-suffix of $S_{0\dots i}$ that has H as the next character after the prefix, or $-1$ if this does not exist. In other words, for $LPSH[i] \neq -1$ we must have $S_{LPSH[i]} =$ H. Similarly, $LPST[i]$ represents the length of the longest proper prefix-suffix of $S_{0\dots i}$ that has T for the next character after the prefix, or $-1$ if this does not exist. In the example from above ($S =$ HHTHHTH), we have $LIPS = [0, 0, 2, 0, 0, 2, 0]$, $LPSH = [0, 1, 0, 1, 1, 3, 4]$, and $LPST = [-1, -1, -1, -1, 2, -1, -1]$.

It turns out that we can generate all four of these arrays in linear time using dynamic programming! Here's my C++ code:

Note
//initialize base cases
LIPS[0] = 0;
LPS[0] = 0;
if(S[0] == 'H'){
LPSH[0] = 0;
LPST[0] = -1;
} else {	// S[0] == 'T'
LPSH[0] = -1;
LPST[0] = 0;
}

for(int i=1; i<n; i++){

// set LPS[i] and LIPS[i]
if(S[i] == 'H'){
LPS[i] = LPSH[i-1] + 1;
LIPS[i] = LPST[i-1] + 1;
} else {	// S[i] == 'T'
LPS[i] = LPST[i-1] + 1;
LIPS[i] = LPSH[i-1] + 1;
}

// set LPSH[i] and LPST[i]
if(S[LPS[i]] == 'H'){
LPSH[i] = LPS[i];
LPST[i] = (LPS[i] > 0 ? LPST[LPS[i]-1] : -1);
} else {	// S[LPS[i]] == 'T'
LPST[i] = LPS[i];
LPSH[i] = (LPS[i] > 0 ? LPSH[LPS[i]-1] : -1);
}
}


Let's break down the transitions here. The first observation is that $LPS[i]$ must correspond to a prefix-suffix of $S_{0\dots i}$ that is one longer than a prefix-suffix of $S_{0\dots i-1}$. But it cannot extend just any prefix-suffix of $S_{0\dots i-1}$. It must extend one that agrees with $S_i$ on the next character. If $S_i =$ H, then this is exactly the prefix-suffix that $LPSH[i-1]$ represents, so we set $LPS[i] = LPSH[i-1] + 1$. Otherwise, we set $LPS[i] = LPST[i-1] + 1$. We can determine $LIPS[i]$ similarly -- just imagine $S_i$ is inverted and treat it like $LPS[i]$.

Now, without loss of generality, say that the longest prefix-suffix of $S_{0\dots i}$ has H as the next character after that prefix (the other case is handled symmetrically). Then we can simply set $LPSH[i]$ to be the same length as $LPS[i]$. We also know that $LPST[i]$ must be strictly smaller than $LPS[i]$, i.e. it must indicate the longest proper prefix-suffix of $S_{0\dots LPS[i]-1}$ that has a T as the next character (if this exists). But this is exactly what $LPST[LPS[i]-1]$ represents! This gives us $O(1)$ transitions for all four arrays.

Long story short, we can build the four arrays in linear time. How does this help us? Well, notice that the state of a string of the form $S_{0\dots i-1}\overline{S_i}$ is equal to its longest proper prefix-suffix. This is because its first $j \leq i$ characters are guaranteed to also be the first $j$ characters of $S$. But we already know how to generate the longest proper prefix-suffix of $S_{0\dots i-1}\overline{S_i}$ -- this is just $LIPS[i]$! Thus, we can easily generate our $n$ equations by reading off the values of $LIPS$.

## A Well-Behaved Matrix

All that's left is to show how we can solve our system of $n$ equations in linear time. In general, this is impossible. But in this specific problem there are many nice properties of our system. For example, it is extremely sparse: each equation $e_i = 1 + \frac{1}{2}e_j + \frac{1}{2}e_k$ relates at most three different variables. We also know that (swapping $j$ and $k$ if necessary) we get the conditions $j \leq i$ and $k = i+1$ for each equation. If we rewrite each equation in the form $-e_j + 2e_i - e_{i+1} = 2$ and aggregate them into the standard linear system matrix used in Gaussian elimination, the structure becomes clearer (the following matrix is for $S =$ HTHH):

$\begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 2 \\ 0 & 1 & -1 & 0 & 0 & 2 \\ -1 & 0 & 2 & -1 & 0 & 2 \\ 0 & 0 & -1 & 2 & -1 & 2 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}$

In all rows except the last, the main diagonal is filled with $2$ s, with a $-1$ just to the right. There is also a $-1$ somewhere to the left (or on top of) the $2$ (if it is on top of the $2$, it is just written as a $1$). As with any system of linear equations, our goal is now to reduce this matrix by adding rows to each other. Here's where the well-behaved structure of the matrix becomes useful. Define a row to be almost-reduced if it has a $1$ on the main diagonal, a $-1$ just to the right of it, and $0$ s everywhere else besides the last column. In the matrix above, the top two rows are almost-reduced.

Observe that if the first $i$ rows of the matrix are already almost-reduced, then it is easy to almost-reduce the $i+1$th row. Say that row $i+1$ has a $-1$ in position $j < i+1$, a $2$ in position $i+1$ and another $-1$ in position $i+2$. Then if we add all rows $j, j+1, \dots, i$ to row $i+1$, it will become almost-reduced because the $1$ s and $-1$ s end up telescoping in on each other.

$\begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 2 \\ 0 & 1 & -1 & 0 & 0 & 2 \\ -1 & 0 & 2 & -1 & 0 & 2 \\ 0 & 0 & -1 & 2 & -1 & 2 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix} \implies \begin{bmatrix} 1 & -1 & 0 & 0 & 0 & 2 \\ 0 & 1 & -1 & 0 & 0 & 2 \\ 0 & 0 & 1 & -1 & 0 & 6 \\ 0 & 0 & -1 & 2 & -1 & 2 \\ 0 & 0 & 0 & 0 & 1 & 0 \end{bmatrix}$
The first two rows are added to the third row, resulting in it becoming almost-reduced.

In this manner, we can almost-reduce a row in $O(1)$ time by maintaining prefix sums of the values in the last column of the matrix. In fact, the only values we ever need to store besides these prefix sums are the locations of the left-most $-1$ in each row (which correspond to $LIPS[i]$). Once we have almost-reduced all rows, the final answer $e_0$ is the sum of the entire right-most column (the last prefix sum). All of this can be done in linear time.

## Conclusion

The algorithm presented here computes the expected number of coin flips required to achieve a given string in $O(n)$ time. As has been pointed out in the comments, variations of this problem have appeared on contests in the past, and there is a variety of potential approaches -- this is just one way. Thank you for reading!

Here is my final code.

• +185

By gabrielwu, history, 3 years ago,

UPD: Editorials have been released!

Thank you to everyone who participated in the mBIT Spring 2021 competition today! All of the information on this blog post is also available in our archive, which includes the full leaderboard.

# Results

First, we would like to congratulate the team Texas (Adam Bertelli, Dilhan Salgado, Zack Lee) for being the first team to find all of the Among Us references hidden in the problems! Without further ado, we are pleased to announce the division winners:

Winning Teams:

1. KoreshaMaksim Gorokhovskii, Ramazan Rakhmatullin

2. jharada fan clubHuaiyu Wu, Antonio Molina, Yuting Zhou, Maryam Bahrani

3. ඞYTN3M (HS team) — Elliot En-Yi Liu (National Experimental High School at Hsinchu Science Park), Ashley Aragorn Khoo (NUS High School), Udit Sanghi (Bhavan Vidyalaya), Panchkula Kostiantyn Savchuk (Lyceum No. 2 of Korostyshiv)

4. ScrubsRaymond Kang, Li Yao'an, Koh Li Chen

5. we can fill this out later but I have to put something to save (HS team) — Moses Xu (Appleby College), Eric Pei (Don Mills CI), Chris Trevisan (William Lyon Mackenzie CI), Victor Gao (Victoria Park CI)

6. Coast (HS team) — Anand John (Brandywine High School), Nathan Chen (Garnet Valley High School), Richard Qi (Princeton High School), Siyong Huang (Homestead High School)

### Standard Division

Winning HS Team: Remdi™Nishchay Bhutoria (Nehru Smarka Vidyalaya), Vishesh Saraswat (DPS RK Puram), Aryan Raina (Sri Guru Harkrishan Model School), Ritul Kumar Singh (DPS Bokaro Steel City)

First Place MS Team: The Purple DuckDaniel Wu (Tilden MS), Paul Trusov (Tilden MS), David Sy (Tilden MS)

• +136

By gabrielwu, history, 3 years ago,

The Montgomery Blair HS Computer Team is proud to present the fourth iteration of the semi-annual Montgomery Blair Informatics Tournament (mBIT), which will be held online from 10:00 AM — 2:00 PM EDT on Saturday, June 12 (6/12/21). If you are interested in competing, please register at https://mbit.mbhs.edu/. The contest will be hosted on our personal servers at https://mbit.live/. All problems were written by the Montgomery Blair Computer Team, including 12tqian, gabrielwu, galen_colin, smax, meiron03, Blastman, alien_lover, czhang2718, and spiralsim. Special thanks to our testers, Monogon and DenjellBoone!

mBIT is split into two divisions: Standard and Advanced. Teams may choose which division to compete in. Standard division problems are roughly USACO Bronze to Silver difficulty, while Advanced division problems range from USACO Silver to Platinum. Most Standard problems could be found on a Div 3 contest or early on in a Div 2 contest. Advanced problems are more comparable to Div 1 (and more difficult Div 2) problems.

mBIT allows teams of up to 4 competitors! Anybody may compete, regardless of their age or country. There will be Amazon gift cards reserved for the top high school teams in both divisions, in addition to the top overall teams in the Advanced division. To qualify for "high school status," all members of your team must be current high school (or middle school) students.

Prizes (per person):

• $50,$25, $25 for the first, second, and third HS teams in Advanced •$25, $10,$10 for the first, second, and third overall teams in Advanced that are not among the top three HS teams

• $25,$10 for the first HS team and first MS team in Standard

• All prize winners will get a Wolfram|Alpha Pro subscription, and the top middle school Standard team will receive AoPS coupons.

Here are the mBIT problems from our most recent contest last November:

All information, including problems from our other two contests, can be found on our website. Registration will remain open up until the day of the contest.

Message me or email mbit.organizers@gmail.com if you have any questions! We hope to see you compete!

mBIT is proudly sponsored by the amazing people at United Therapeutics, Wolfram Research, and AoPS! We are working in partnership with the Competitive Programming Initiative (CPI).

• +230

By gabrielwu, history, 4 years ago,

Thank you to everyone who participated in the mBIT Fall 2020 competition today! All of the information on this blog post is also available in our archive, which also includes the full leaderboard.

# Results

We are pleased to announce the winners!

Winning Teams:

1. Zagreb Oblutci (HS team) — Dorijan Lendvaj (XV. gimnazija Zagreb), Krešimir Nežmah (XV. gimnazija Zagreb), Patrick Pavić (XV. gimnazija Zagreb)

2. qncqwy6w69 (HS team) — Suneet Mahajan (Douglas Community School), Kostia Savchuk (коростишівське нвк "школа-ліцей"), Ashley Khoo (NUS High School), Fedor Romashov (The Advanced Educational Scientific Center MSU)

3. Peer Pressure Aaron to go on a Date with BessieViraj Maddur, Aditya Parulekar, Steven Cheng, Aaron Lamoreaux

4. cantsolvediv2a (HS team) — Hankai Zhang (Detroit Country Day School), Anthony Wang (Ladue Horton Watkins HS)

5. jharada fan clubHuaiyu Wu, Yuting Zhou, Antonio Molina Lovett, Maryam Bahrani

6. Coastal Demolishers (HS team) — Richard Qi (Princeton HS), Anand John (Brandywine HS), Nathan Chen (Garnet Valley HS), Shiva Oswal (Seven Springs Academy)

### Standard Division

Winning Team: <input class="input" placeholder="Insert Creative Name Here">Ryan Bai (Carmel Valley MS)

Second Place MS Team: Watermelon JuiceDaniel Wu (Tilden MS), David Sy (Tilden MS), Paul Trusov (Tilden MS)

• +95

By gabrielwu, 4 years ago,

If you are interested in participating in the Montgomery Blair Informatics Tournament (mBIT), which will be held online from 12:00-4:00 PM EST on Saturday, November 14 (11/14/20), please register at https://mbit.mbhs.edu/. The contest will be hosted on our personal servers at https://mbit.live/. All problems were written by the Montgomery Blair HS Computer Team, including 12tqian, gabrielwu, galen_colin, smax, meiron03, Blastman, czhang2718, and spiralsim. Special thanks to balbit for spending hours test solving our hardest problems!

mBIT is split into two divisions: Standard and Advanced. Teams may choose which division to compete in. Standard division problems are roughly USACO Bronze to Silver difficulty, while Advanced division problems range from USACO Silver to Platinum. Most Standard problems could be found on a Div 3 contest or early on in a Div 2 contest. Advanced problems are more comparable to Div 1 (and more difficult Div 2) problems.

mBIT allows teams of up to 4 competitors! Anybody may compete, regardless of their age or country. There will be Amazon gift cards reserved for the top high school teams in both divisions, in addition to the top overall teams in the Advanced division (this means that non-high school teams in the Advanced division are eligible for prizes). To qualify for "high school status," all members of your team must be current high school (or middle school) students.

UPDATE: Problems, editorials, and results are available in this blog post.

Prizes (per person, subject to change):

• $50,$25, $25 for the first, second, and third HS teams in Advanced •$25, $10,$10 for the first, second, and third overall teams in Advanced that are not among the top three HS teams

• $25,$10 for the first HS team and first MS team in Standard

• All prize winners will get an exclusive mBIT T-shirt (we can only ship within the US) and a Wolfram|Alpha Pro subscription (AoPS coupons for the MS team).

Here are the mBIT problems from our most recent contest in June:

All information, including problems from our contest last November, can be found on our website. Registration will remain open up until the day of the contest.

Message me or email mbit.organizers@gmail.com if you have any questions! We hope to see you compete!

UPD: mBIT is now proudly sponsored by the amazing people at United Therapeutics, Wolfram Research, and Art of Problem Solving (AoPS)!

• +163

By gabrielwu, 4 years ago,

If you are interested in participating in the Montgomery Blair Informatics Tournament, which will be held online from 1:00-4:00 PM EDT on Sunday (6/7/20), please register at https://mbit.mbhs.edu/.

Teams of up to 4 are allowed! Anyone can compete, but only teams of US high school (or middle school) students can win Amazon gift cards for prizes:

• $50 per person for 1st place in Advanced Division •$25 per person for 1st place in Standard Division, 2nd place in Advanced Division, and 3rd place in Advanced Division

If you want to get a sense of what mBIT problems are like, take a look at the problems from our contest last November:

Message me or email mbit.organizers@gmail.com if you have any questions!

UPDATE: The contest is over! You can view all results on our website.

Here are the problems:
Standard problems