Some reflections as a round setter

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	jiangly	3631
4	Kevin114514	3574
5	maroonrk	3521
6	strapple	3515
7	Radewoosh	3461
8	tourist	3428
9	turmax	3378
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	161
2	adamant	147
3	Um_nik	145
4	Dominater069	142
5	errorgorn	140
6	cry	138
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
10	soullless	133

I was fortunate enough to have the opportunity to co-organize a div3 round Codeforces Round 974 (Div. 3) last week, and whilst my memory is fresh, I'd like to share some thoughts which might be of interest to future setters and users.

The project started as a mere aspiration 6 months ago, when I contributed to Codeforces Round 946 (Div. 3). It gave me some confidence to think about setting a round, and Robin Hood seemed like a fun theme to weave some legends around. Over the next months, I collected a few ideas and gathered enough courage to approach the div3 coordinator Vladosiya, who was surprisingly 'chill' about my obvious lack of experience and low rating. I enlisted ChairmanFMao and Filikec, and the round became a summer pastime for us.

So, don't be put off by your rating or experience. If you come across nice ideas, discuss them, collect them, you might be the next contributor. Setting problems was quite straightforward with the polygon system. Like everything else, there is a learning curve, but most things are intuitive. The reward is that you can go from having an idea to a working problem in literally ten minutes.

Statement writing can be tricky. It's obvious that, with 30k people reading it, your statement will be read in every possible way, so you have to make everything as clear as possible. We tried hard, but it's nearly impossible to achieve. For example, a few people in contest queried whether the average in 2014C - Robin Hood in Town should be rounded to an integer. This was not the case and was not implied, but still we sent out a notification. In hindsight, we could have added that in the original statement. At times, it was hard to make sure the problems and the Robin Hood stories are framed naturally and not in a contrived or frivolous way. Our writing skills were tested, but we are happy to have tried.

Problem difficulties are hard to judge. For example, I originally thought 2014G - Milky Days and its 'frugal' variation involve some 'gentle' implementation, and could occupy the postions of C, D or E. In fact, I still don't understand why it ended up being the hardest problem of the set. I mean, stack is surely not as advanced (or as difficult) as xor hashing? As a setter, you have days and weeks to think about a problem, so it can appear easier than it really is. I mean, I probably wouldn't solve G either in a contest, like most people. Testers helped us out, and we adjusted the questions as explained in the editorial.

Problem balance was very much on our mind. We rejected some problems and add others for that reason. I love Dijkstra and DP, but I still need to improve on those topics. I often mess up the implementation in contests myself, but thinking about these problems have definitely improved my own understanding of these algorithms. We varied problems to cover different tpoics, but I dare not set a proper number theory problem, because I have an allergy to Bézout's theorem, but I am working on it. Maybe in the future.

We checked all problems on ChatGPT, and were confident that it could not solve the later problems. In fact, it even struggles with 2014B - Robin Hood and the Major Oak. However, ChatGPT did bring up relevant information, and it could definintely help a contestant. New AI will likely do better, so it's inevitable this discussion would go on. As a setter, I find the incessant demand for AI-proof questions quite lame and ignorant. AI will march on, we will have to deal with it the best we can. For now, it's banned for CP, I support that and we should probably leave it at that.

It was thrilling to watch the contest unfold in real time. I was eagerly watching the leaderboard and trying to answer some of the incoming questions for clarification. It was more exciting than say, cricket world cup final, and I love watching cricket! I saw Neal was leading most of the way, and BurnedChicken overtook him on G! In contrast, I tried to stay away from the comment sections of the announcement and editorial of the round, as I didn't want to respond to one and not another, and I didn't have energy to respond to all. So apologies to people expecting direct answers from me on the blog.

Final thought: Given the chance, I would definitely do it again, and would probably do a better job next time, but it's really a labour of love. I would definitely encourage any new setters to give it a go!

Comments (31)

Show archived | Write comment?

The_Mad_Max

20 months ago, hide # |

+19

It was a well balanced contest RobinFromTheHood. Thanks for such contests. See you soon in Div 3 or maybe Div 2 again. Kudos :))

→ Reply

MathModel

← Rev. 7 →

-11

I want to give a go :) , btw The Round was wonderful (But I was supposed to be faster and Avoid Overflows T_T)

RobinFromTheHood

20 months ago, hide # ^ |

You should try!

Part of the reason I wrote the blog is to encourage others, and mention things to look out for.

Thanks for your encouragement too.

cbertake

The problems were really good :)

wasif2397

the problems were truly wonderful! So were they in your last round :)

Dominater069

← Rev. 2 →

-50

We checked all problems on ChatGPT, and were confident that it could not solve the later problems

Very interesting claim, I was pretty sure this was false, so i decided to investigate myself.

2014D - Robert Hood and Mrs Hood, 2014E - Rendez-vous de Marian et Robin and 2014F - Sheriff's Defense all got solved within 40 seconds. I am pretty sure the previous 3 would also be solved, but I didn't bother to check. In contest as well, I believe there was a person who solved A — F in 11 minutes and then later got removed from ranklist.

Does this mean Chatgpt is just too strong? No.....it still can't solve even d1A or some d2B even. Why? Well, it is incapable of independent thought. It can only perform well on reptitive standard problems or problems where the solution is very simple and guessy, because it is trained on those. For example, on H, it cannot get the winning condition, instead if gets the wrong condition of sum of largest half is <= sum of all elements / 2, and then uses Wavelet Matrix to implement it. However, once told that the winning condition is all elements frequency even, it immediately spits out the MO's algorithm and solves the task (with one tle due to not using global arrays which got fixed too upon request).

So to contest authors, please set original problems rather than blaming the advancement of AI. AI being able to perform at 1900 (A-F was 1900+ i believe) is not due to it being too strong (at least currently) Here is the chat link : https://chatgpt.com/share/66f58d2a-ab44-8003-8aac-63541f3941f9

bigSchrodinger

+15

Man stfu and let them have their moment. They are just trying to reflect and you come with your AI and standard problems bs.

-10

I replied to a very specific part in their blog, which is false....

If they had not mentioned that, I would have said nothing

johhnrade

which model did you use to test the problems ? maybe the contest author used the free version of chatgpt to conclude it couldnot solve them

Probably, i used o1 mini the best available to us ofcourse.

If the contest author does not want to use the advanced model, then they should not mention it is not solvable by AI. In their estimation, it had trouble solving B. In my estimation, it easily solved till F

-16

Okay, yes, I used the free version of ChatGPT.

BTW, I never claimed AI couldn't solve B, I said ChatGPT struggled. I should have clarified that it's the free version.

Rohak

-13

Bruh the free version? About time mike provides o1 API to problem setters. Don't you think testing with the free version was kinda stupid (no offence), given that all the talk was specifically about the "new" model?

I do CP for fun. I am not going to fight people who pays for the latest AI to compete in a Div3 contest.

Did not say you didn't. I liked the contest. I just think checking with o1 was a responsible thing to do, given that setting a problem affects many other. But then again it was Div 3 so meh.

I see your point. However, with AI reaching 1900 rating, most of the Div3 problems should be solvable by AI. Should we just cancel Div3/4 then? Personally, I don't think rounds need to be AI-proof, but it's sensible to check with the most accessible version of the ChatGPT. It's likely that future AI will solve all Div1 problems.

-8

I disagree if you are claiming that ChatGPT would solve D-F during contest. When I tried, I literally pasted code provided by ChatGPT to submit, and they did not pass system test. In all cases, human intervention was required. I believe E was the closest for ChatGPT with minor correction needed for the horse transition. I have not tried again after contest. Did you actually paste the code in and test them for D-F? Is it possible that ChatGPT has more information to solve the problem after the contest? I am not even sure what point you are trying to make here?

In any case, I am telling it as it was. I am not blaming AI, we tried to check for it because it's a practical issue. I don't agree with your view that setting original problem stops it being solved by AI.

-7

I linked the chat, i think you can see i did paste as is

Also, i was told there was a low rated contestant who was very fast on A — F. I believe he also used chatgpt during the contest

Ok, is your ChatGPT version different from the free version? Did you actually paste the code to submit & check AC? Could we see that? I don't know much about the low rated coder you refer to. BTW I didn't claim our questions are AI-proof, I was explaining what we checked using ChatGPT (the free version). I don't think there is anything to disagree on. I should have made it clear which version of ChatGPT I used.

SuperJ6

-9

I agree with your sentiments here, but where does "very simple and guessy" end and "original" begin, particularly on easier problem that still build upon the same lines of arguments?

I think the only difference is search space considerations, which the ai is still not great at considering in 40sec (it's not good at trying multiple possibilities "by hand" in my limited usage, it needs to be considering the right direction from the start). However, I strongly think you can add it in manually. You don't think that having a list of ~50 condition suggestions as you did with H, or even better having the program write a brute force and compare program to test a list of conditions it generates will allow it to consistently get d1A? Well, I'll try it myself and find out if not beat to it soon ig.

You don't think that having a list of ~50 condition suggestions as you did with H

I only suggested the right one lol

even better having the program write a brute force and compare program to test a list of conditions it generates will allow it to consistently get d1A

agreed, probably will

but where does "very simple and guessy" end and "original" begin

also agreed that it is rare if not impossible for simple problems.

tickbird

Stop telling the truth!

You are not only advocate use of ChatGPT for low division participants, but also demotivating problem setter.

myst-6

It's nice to have some problem setters from the UK. Thanks for putting so much work in.

cry

← Rev. 5 →

+16

Hi RobinFromTheHood, idk if you know me, but I’m one of the testers for the original set (when frugal was still in the mashup). I tested through Vlad directly.

I recall providing some harsh criticisms regarding the set. In my feedback (which I DMed directly to Vlad), my feedback included the following:

Remove frugal version. I didn’t see an easier way to approach this problem than the nonfrugal version.
That dijkstra and tree dp were way too standard even for Div.3. I didn’t explicitly say that they should be replaced, but I hoped you guys would see the issue.
Regarding the last xor hashing problem, I received a message from Vlad to test it before it was even prepared. In my feedback, I remember saying that it’s not a good problem and it should be replaced. I even provided an atCoder problem which is really similar to it. To my surprise, the exact same version ended up on the actual contest. It’s almost like my feedback didn’t matter at all. At the end, I also proposed a modification of the problem on a tree. I received no response from Vlad after that.

Since we were not communicating directly, I’m interested in how my feedback reached you. Were you able to receive all of my feedback, including criticism, directly from Vlad? I’m especially curious about the third situation which I listed. Did you guys receive my modification and think it was not necessary, or didn’t receive it at all?

Hi cry, thank you for testing and providing the feedback, which we received together with many others. We took out the frugal version because of your comment, and moved old B to C and invented the current B to fill the gap. I have a simple solution for the frugal version without even using a stack, I thought the pair of problem would be fun; but sadly the problem did not receive enough love and was rejected. At the time, I was staring at the stack solution for G, and thought many people would solve it. We definitely took your feedback onboard on that one, and G did prove to be the hardest. The feedback from lower rated testers were more positive on E&F. We realize they are 'simple' variations on classic themes, we felt they might be better received by the target audience. This brings an interesting question, should all problems be very 'original' and don't resemble any of the previous questions? I don't know. Some 'simple variation' might be trivial to a red coder, but are very educational and enjoyable for someone starting out. I think my top priority was for the questions to be fun and educational to solve, but there are definitely a lot to think about and learn.

← Rev. 3 →

I don’t mind standard problems. In fact. the last div.3 I set were all textbook problems (according to Um_nik). However, where I draw the line is when the problem feels like knowledge-checks. H is a good example of it. I don’t think the idea is difficult whatsoever, but knowing xor-hashing is the only bottleneck that makes it H. This is the main reason why I was against the problem.

The same can be said about the dijkstra problem. The same problem (with almost no modification) can be found in the first few listed problems in the shortest paths section of usaco.guide. The hardest part of the problem is implementing dijkstra (and knowing that it exists).

Yes, I remember Codeforces Round 962 (Div. 3), and it was well received! Different people draw the line differently, I think the comments in this blog reflect that, and should be interesting for future setters. If I had to do it again, I would definitely improve on those aspects.

wddd

I like your problems (the Dijkstra one and the xor hashing one but not the Milk Days :( ). I learned new things from them. And looking back, they really fit the purpose of education. I would say if people want more challenging problems, they can just attend Div1 or Div2 contests.

rangerscowboys

I liked the contest, but I feel like the issue with H was that it was a "know it or not" problem for XOR hashing. This is a great first contest that you created though. I have always been wanting to create problems :), it seems so interesting.

djm03178

Using a technique in an advanced way is always miles harder than merely knowing the technique itself.

Problem G was the former case. Although it wasn't so hard to reach to conclusion that a stack can be used there (although I ended up using a priority_queue that works just same as a stack), carefully considering all the cases and writing the formulas was very tricky. For me, the hardest part was to differentiate between a case where I need to leave some of today's milk and another case where the remainder should be added up with the older milk and recalculate the expiring day. It also required some extra works to make it easier to implement. For example, as you stated in the editorial, appending a fictitious entry with large day number and 0 pints trivializes an edge case, but it may not be too easy to come up with.

On the other hand, problem H didn't require any advanced thought and knowing either XOR hashing or Mo's was enough to solve it. A good thing to note is that many Div. 3 participants do study these techniques because learning a new existing algorithm is much easier than developing one's abilities to make advanced usages of the things they learned. It also barely required any implementation other than the basic XOR hashing or Mo's template, so there was nothing to struggle when writing the code.

Thanks for your thoughts, it makes sense. I was very surprised to see a 2200 rating for G, I wonder what the rating would be for the frugal version, where he drinks the oldest drinkable milk first.

-12

sorry for ping :(

RobinFromTheHood's blog