Ending Season 1 of Twitch Solves ISL

To all my loyal viewers —

After two years, 480 hours, and 102 episodes of Twitch Solves ISL, I’ve finally decided that it’s time for me to take a hiatus from my weekly Friday night streams. So, episode 102 will be the end of Season 1.

To answer the obvious question: I’m definitely open to having a Season 2, although it will probably have to wait until after the summer (as I will be away from my stream setup then), and it probably won’t last 100 episodes. Between now and then, I might also sporadically stream video game sessions if I’m in the mood for it, so if you come by Friday night, you might get to watch me play through something from my Steam library.

I’d like to thank all my viewers and 1800 followers on Twitch, for being such a great audience all this time. Having you all come by to support my stream means the world to me.

Until next time!

Sometimes the best advice is no advice


I get a lot of questions that are so general that there is no useful answer I can give, e.g., “how do I get better at geometry?”. What do you want from me? Go do more problems, sheesh.

These days, in my instructions for contacting me, I tell people to be as specific as possible e.g. including specific problems they recently tried and couldn’t solve. Unsurpisingly the same kind of people who ask me a question like that are also not the kind of people who read instructions, so it hasn’t helped much. 😛

But it’s occurred to me it’s possible to take this too far. Or maybe more accurately, it’s always better to ask a specific question, but sometimes the best answer will still be “go do more problems, sheesh”.

Here’s a metaphorical example.

Suppose that someone is learning the game of chess, and they just died to scholar’s mate. So they go on the Internet and write something like,

Some cheapskape used scholar’s mate on me, and I died in four turns WTF!? How do I defend this!? We played e4 e5 Bc4 Nc6 Qh5 Nf6 Qxf7.

Okay, it’s a specific question, at least.

If you’re a nice coach, you could give a straight answer to this. Like, “reply g6 if the queen is on h5 and Nf6 if the queen is on f3”. And maybe some nice words about not getting discouraged.

But I wonder if the best advice is really “go play 100 more games”.

Because, well, if someone (a) saw the queen move out, (b) died anyways, (c) can’t figure out what they should have done differently on turn 3, (d) can’t google the answer themself, and (e) is complaining on the Internet about it, then I think two things are clear. First, they have not played many games, and second, they desperately need to learn how to fish.

In some sense, the Scholar’s Mate issue will correct itself automatically after enough games. So I worry that I do a long-term disservice by giving a specific answer over the general answer, and implicitly suggesting that this is how the learning process should work.

Experience is the best teacher of all. No contest. A pupil who doesn’t internalize this, and instead tries to short-circuit the learning process by overfitting their internal models on too few data points, is going to hit a wall really soon.

Unfortunately for me, I don’t play the tough-coach card well, unlike some other people. So for the foreseeable future, I’m still likely to respond with g6/Nf6. Or maybe I will start linking this post when I’m out of patience, we’ll see.

Book pitch

This is a pitch for a new text that I’m thinking of writing. I want to post it here to solicit opinions from the general community before investing a lot of time into the actual writing.


There are a lot of students who ask me a question isomorphic to:

How do I learn to write proofs?

I’ve got this on my Q&A. For the contest kiddos out there, it basically amounts to saying “read the official solutions to any competition”.

But I think I can do better.


Calling into question the obvious, by insisting that it be “rigorously proved”, is to say to a student, “Your feelings and ideas are suspect. You need to think and speak our way.”

Now there is a place for formal proof in mathematics, no question. But that place is not a student’s first introduction to mathematical argument. At least let people get familiar with some mathematical objects, and learn what to expect from them, before you start formalizing everything.

— Paul Lockhart

There was a while I tried to look around to find an introduction-to-proofs textbook that I liked. I specifically wanted to have the following requirements:

  • Pragmaticism: the textbook should not start with foundational issues like logical quantifiers or set theory. I have held a long belief that these are emphatically not the right way to start proofs, because in practice when one really does proofs, one is usually not thinking too much about the axioms of set theory.
  • Substantial: the results one proves as practice should feel interesting. They should have meat. For example, the statement that a tree always has one fewer edge than vertex is not obvious at first, so when one sees the proof it gives an idea. I believe this is important because I want to develop a student’s intuition, rather than try to teach them to work against it.
  • Intuitive: I reject the approach of some other instructors in which students start by proving basic results from first principles like the well-ordering principle, “all right angles are congruent”, etc. I think this is an experience that is worth having, but it should not be the first experience one has. (This is the same reason people’s first programming language is Python and not assembly.)
  • Combinatorial: for competition reasons. My currently recommended combinatorics textbook by Pascal96 is a bit on the difficult side. It would be nice to cover some ground here.

The closest I got was Jospeh Rotman’s Journey Into Mathematics textbook, which satisfies the first three conditions but not the fourth (the book draws from algebra, geometry, and number theory). I adore Rotman’s book and the copy I read at age 12 is tattered from extended use. I’d like to get the combinatorics in, too.

Picking a fight

I should state now this is against common wisdom. Terence Tao for example describes mathematical education in three parts: pre-rigorous, rigorous, post-rigorous. Relevant quotes:

[In the rigorous stage], one is expected to be able to comfortably manipulate abstract mathematical objects without focusing too much on what such objects actually “mean”. … The transition from the first stage to the second is well known to be rather traumatic.

My thesis is that for high school students with an enriched math background, the rigorous and post-rigorous stages should be merged or even inverted. Attending a math circle, going to math camps, or participating in competitions gives you a much better intuition than a typical starting undergraduate would otherwise have access to. I propose that we take advantage of this intuition, rather than ignore or suppress it.


I’m eyeing graph theory as a topic to start off on, if not use wholesale. I think it is an amazing topic for teaching proofs with. Definitions that make sense, proofs that are intuitive but not obvious, lots of pictures that don’t lose rigor, and so on. I imagine I would start there and see where it takes me.

If I go through with it, I think it would take about a year for me to get some initial drafts available to the public.

Pay-what-you-want model

I want to try this out. I think it would look something like:

  1. You can download the nicely typeset PDF for 20 dollars;
  2. The entire source code is publicly readable on GitHub, so if you can’t pay or don’t want to pay just download the source and compile it. It might not have some formatting polishes or whatever but all the content is going to be there.
  3. If you don’t have a computer to compile things on, email me nicely and I’ll send you a copy.
  4. Pull requests welcome, and if you fix some sufficient number of typos or some major errors I’ll add your name to acknowledgments.

But I’m not sure yet.

Questions for the audience

  1. Is this something people would want to see?
  2. Is there any existing text that already satisfies my requirements?
  3. Is the payment model fair?
  4. Other comments or suggestions?

I reserved a Steam Deck!

I might be going too far with this Arch Linux brand loyalty, but I am so psyched I don’t care.

There’s three pieces of hardware that I’ve always kind of wanted but never been willing to spend money on in isolation:

  • A handheld gaming console
  • A laptop with a touch screen
  • A laptop with enough power to play games on a Steam library (my laptop is a bit too old for this)

So when I heard the Steam Deck existed and it would be running an Arch Linux system beneath it, I totally caved.

I’m just sad I didn’t hear about in time to be early enough in the queue to get it by Christmas. Ah well.

In unrelated news, careful readers might notice that the blog has been moved to blog.evanchen.cc. I think the old URL usamo.wordpress.com will continue to redirect for the foreseeable future, though. But I picked that wordpress username when I was like 12, so I’m glad to not be as tied to it anymore now. (The name “vEnhance” isn’t going anywhere, though.)

A short dissent on USA eligibility

Sometime this week the American Math Competitions released the following new policy:

For the IMO, EGMO, RMM, TSTSTs, and TSTs, and MOP, students must be US citizens or US permanent residents. Visas are not a valid substitute. (AMC Policies)

I want to make a rather brief statement on why I was opposed to this change. To do this I want to draw an analogy. In the American Math Competitions, students are asked what gender they identify with, which is used to determine whether the are eligible for the European Girl’s Math Olympiad and also for invitations to MOP.

This means that in theory, you could try to abuse the system by deliberately misrepresenting your self-identified gender. But in practice, nobody has attempted this. So, we continue to allow students to self-identify their gender, in order to make sure to be inclusive to students with gender dysphoria, and trusting our students to have the integrity to report their gender honestly. The other is to be exclusive and follow the sex assigned at birth, which closes a potential loophole that nobody was exploiting.

I think we should have done the same with visa-holding students.

A student with, say, an F-1 visa, is also in between two points, and there is a spectrum of how much time they have spent in the US. Sometimes they have been in the US for only a short time, other times, they have been studying in the USA for all of high school and participated multiple times in AMC/AIME/USAMO track; and then go on to MIT for another four years, and so on. No matter what they do, there will be some eyebrows raised. If they compete with the United States, some people will complain they are not proper citizens; if they compete with their home country, some people will complain they are too American.

Again we have two options. One is to be inclusive and allow anyone legally residing in the United States and feeling they identify as American a chance to represent us. This is what we have done for the last 20-ish years, and I stand behind these teams. The other is to be exclusive and follow legal citizenship, again closing a loophole that nobody was exploiting. Though I understand the rationale, I am still saddened to hear that we have changed to the latter, and I am worried about the message that we send when we tell certain people they are no longer welcome in the USA math community.

(I do believe that one should not compete for spots on more than one IMO country in the same calendar year, even as a dual national. This was the main reason I felt comfortable competing for Taiwan in 2014, because I was no longer eligible to compete for the United States.)

So that is my public statement, but I have one more thing to add. I am hearing lots and lots of rumors going around claiming that specific students (by name) should not have been allowed on the team or even were taking advantage of the USA system in bad faith. To that I have to ask: please don’t spread rumors like this if you don’t know the students in question. It’s not nice to talk about kids you don’t even know behind their backs; but also please take my word for it when I say some of you are really picking the wrong targets.

On a definitely-kinda-related note, the USEMO for this year is gonna be October 30 and October 31, and remains open to anyone who is studying in an American school, even on a visa :) or otherwise identifies as American. If you don’t identify as American you’re instead invited to come to grading or propose problems, or both!

MOP 2021 puzzle hunt

I worked with a few friends on writing a mini one-round puzzle hunt for this year’s MOP students. If you want to play, you can do so now at the following URL:


I’m not sure whether or not this is going to become a recurring tradition. On the one hand, it was loads of fun to make (including figuring out how to draw art on my iPad); on the other hand, I definitely didn’t blow a couple hundred hours of my life putting this together. ;)

Unrelated advertisement: the Carina Initiatives is currently looking to hire a junior data scientist. They’re a young philanthropic fund that’s interested in supporting initiatives that build actual math problem solving (like, the problem-solving you’d see on this blog, not the mainstream buzzword stuff). I’ve had a lot of fun talking with them over the last few months about problem-solving in general.

My impression is they’re looking for people who have finished a bachelor’s degree in some quantitative field and ideally have experience with any subset of {math/CS/etc competition/circle/camp, education, gaming, entrepreneurship, non-profit operations}. They’re based in SoCal but open to remote + quarterly travel.

If you’re interested in this and I know who you are, send me an email and I’ll connect you.

Unnecessarily detailed stories of my Mystery Hunt puzzles

As a follow-up to the previous post, I am recording here some details about the (numerous, varied, and deep) revisions that a bunch of my puzzles went through during the production process. These puzzles are in chronological order of when I wrote the puzzle.

Spoiler warning: Unlike the last post, these will completely spoil all of the puzzles below.

1. A Bit of Light

Surprisingly, the gist of the puzzle mostly survived revisions: the idea of an encoding-based puzzle with the key step to look at seven-segment displays between the Braille and resistor color encoding. One issue I ran into while constructing was that ABCDEF6 is actually too big in base 16 to store in an eight-bit number, which is part of why I ended up 1337-ing the seven-segment display bits.

Screenshot_2021-02-20 A Bit of Light — MIT Mystery Hunt 2021

The final presentation of the “A Bit of Light” puzzle; it looked pretty much like this at the start too, with worse encodings.

However, in the first draft I apparently had the bright idea that resistor colors should include the last digit being a 10^n multiplier as in the electronic color code. Unsurprisingly, nobody actually does this in puzzle hunts, and it made it impossible to extract the hex correctly because there was even “rounding off” of the hexadecimal numbers involved. So the first testsolve session went up in flames, oops.

I over-corrected a bit after this where, in addition to making the colors simple digits, I made the clue phrase for the next step a lot more explicit, something like “resist > 8d int > hex > SSD > hex”. The next test-solves returned a few more comments that the puzzle felt a bit like “here is what to do, now do it”. So over time, the clue phrase became less explicit until it was just “int=>hex=>SSD”. That got better reviews, since it made things a little more rewarding.

The other thing that happened is that the length of the puzzle (in terms of the number of cells) got increasingly short over versions. There was a lot of concern that this puzzle was quite tedious if you don’t know enough spreadsheet/coding to automate transcribing the colors and bits and so on. Earlier versions of the puzzle had 150ish cells; the final version ended up with just 56.

We also agreed that it was better to make the 1337-ing of the LED segments consistent; it used to be that either 4 or A could be used, for example, but it was changed to be always 4. This meant that only 16 color patterns ever appeared, which turned out to be not a big deal.

Screenshot_2021-02-20 Solution to A Bit of Light — MIT Mystery Hunt 2021

The various encodings, with the consistent 1337 -> 7SD step.

The title of the puzzle used to be “I See The Light”, but we agreed “A Bit of Light” was better. The flavortext wasn’t really used in any solve, but we kept it because it was nice confirmation. The extraction used to include a crossword type thing, which was quickly deleted because it was completely un-thematic and didn’t add anything.

One thing I’m grateful for was having everything automated at the beginning; I basically had a Python script the whole time for building prototypes and editing them.

2. The IMO Shortlist

The basic idea behind the puzzle was: solve some math problems, insert the proposing country into the problem in some way, get a new country, and extract somehow.

The first idea for how to insert the country was by using the calling code for the country as a way to convert the input to a number. It was immediately pointed out that this was totally unthematic, so I dropped that idea pretty quickly.

It’s at this point I got the idea to use host countries of recent IMO’s, say the year they hosted the IMO or the number of the IMO, etc. I thought it would be cute if we used a different mechanism for each subject, although this was in part necessary because it’s not easy at all to write a reasonable-looking geometry question that gives a function from R to R. So I chose the “three collinear points” gimmick for geometry, and then noticed it lined up with the columns of IMO timeline, so used that for the rest.

Anyways, this gave a way to change each old host year into a new year. My original idea for extraction was that you could take these new years and compare which of them had a real IMO problem attached by that year-number pair, and find out that you get one of P1 through P6 exactly once. The true proposing countries of these six problems could then be diagonalized to extract a six-letter answer.

This extraction crashed and burned during test-solving, for a few reasons. First, the transformation I was specifying was a bijection from a set to itself, and people usually expect this to give a cycle, which I didn’t use. Secondly, there were just too many possibilities to make something this far-fetched reasonable, e.g. whether to use the old year or the new year, the old country or the new country or the true country, etc. And third, data-mining the past shortlists was a royal pain (I did try to plant the relevant data on my website but that was after the fact). We basically agreed that to make this mechanism work, we would need a really explicit shell that would be sort of clunky and not too elegant.

Screenshot_2021-02-20 Evan Chen • Problems

Definitely not planting data on my website.

The good news was the feedback about the first part of the puzzle was mostly positive, although the puzzle was possibly too difficult (I remember hearing in feedback that the extraction felt “three steps too long”, and that even without the extraction this would still probably be one of the hardest puzzles in the Hunt.)

As I thought about this a bit more, I started thinking about the cycle idea that was pointed out (before this, I hadn’t known that bijections in puzzles are often cycles), and thinking about what I could do if instead the extraction led to a single cycle. It certainly occurred to me that there was one meme that I could exploit for this: that A/C/G/N is almost exactly the same as A/C/G/T. If I abbreviated “Number Theory” as “NT”, then I could use the resulting cycle to spell out a short phrase using DNA codons, the answer. This wasn’t too thematic, but then again it’s something that most people who have done math contests extensively have noticed at some point or other, and maybe it would be okay to use this idea now.

Screenshot_2021-02-20 puzzle pdf(1)

Using the abbreviation “NT”. It’s not an accident!

In building the cycle, there was an obvious bit of extra information: the shortlist numbers themselves are never used (e.g. A1 vs A2). So it was natural to have them be increasing in the correct order of the cycle, i.e. A1 came before A2 in the cycle, A2 came before A3, etc. This provided a strong confirmation that the cycle was the right thing to look at, that the number of the problems is unimportant, and it also gave a lot of error-correction to fix a few mistakes in the math problems.

Anyways, this involved redoing a lot of the math problems, which was “fun” :P

One other thing that happened during test-solving: one group failed to realize that combo was years instead of IMO number (like algebra) until after they had finished the entire puzzle (error-correction works, hooray). I fixed this by rewriting one of the problems (C3) so that it was obvious the input had to be a year rather than a small number to feasibly work.

The flavor-text is almost littered with clues (“stranded”, “timeline”, “official”). The first couple pages of the PDF file were also added to clue a bit more heavily what the solver was expected to do (originally, the proposing countries were alphabetical).

Screenshot_2021-02-20 puzzle pdf

The cover page of the IMO Shortlist.

3. Escape From Life

The author’s notes for this puzzle give a summary of some of the thoughts behind this puzzle. In short, Conway’s Game of Life has appeared before, but it never seemed to be used beyond simply “run one generation”. As I mentioned in my last post, I thought the elaborate list of names of still lifes was a silly/fun source of puzzle material, and wanted to use it.


An example of an ash with cartoons substituted in, featuring MOOSE ANTLERS.

The basic premise of the puzzle (i.e. what I started with) was as follows:

  • Solvers realize the silly images are Conway’s game of life structures.
  • Solvers run a set of specified life seeds and match it to pictures.
  • The life simulations have escaping gliders giving semaphore, so the matching now gives a grid of letters.

The presentation of the seeds for the simulation went some revisions. Initially, I specified the coordinates as just text, but changed this to a pictorial grid with numbers typed in since there was a bit of confusion on which way the axes were indexed. This was quite important for this puzzle — a tiny error makes the simulation completely useless, so it was imperative for solvers that the right way to input things was completely unambiguous. I learned this the hard way! (I also originally didn’t think to specify the orientation of the gliders, since I thought that teams would probably be using a codable simulator and just try all 4^4=256 ways. Ha, ha, ha. We quickly reversed this decision.)


The final format of the glider seeds. Solvers complete the squares to a glider in a unique way so they fly towards the center.

Since the order of the seeds didn’t matter, we sorted them alphabetically (which gave a strong confirmation of the semaphore being correct).

There was some discussion the whole way through of perhaps presenting the seeds in a standard Life format like RLE, because there was some mixed test-solving feedback about how much work it was to plug these into a Life simulator: essentially, some teams did the “right” thing and downloaded golly or similar, but others were using various browser apps that were more clunky, and the latter group of people did not have a good time running the simulations. We eventually kept the human-readable format rather than the machine-readable format because the step of figuring out how to fit the gliders to run towards the center was at least a little interesting, and because machine formats might carry “too much” information (e.g. solvers might start reading into grid size, starting location, etc.) and it was unclear why every scenario had exactly four gliders.

For me the most difficult part of setting this puzzle was what to do after obtaining the grid of letters. I had an idea that you could try and place the 16 things into a grid, and then run more Life so that the gliders collided further with each other. This turned out to be infeasible and the idea didn’t go anywhere.

The better idea I had was something like the following: when the semaphore is matched onto the 4×4 grid, the solver gets


and the cells with the capital letters above are tinted. I was hoping that the solver could read out the letters T-FIVE which would helpfully tell the solver to run the simulation five steps and end up with the correct answer FLYER.

There was one problem with this: I could not get any of my simulations to produce two orthogonal gliders, no matter how hard I tried. This was with my gaming PC doing glider collisions overnight.

Alas this meant I could not spell the letter F or R using this method. I tried to work around this by having one letter from each word replaced by X which the solver would have to fill in themselves, hoping that was good enough.

In test-solving, this went up in flames. The random X’s actually convinced the solvers that semaphore did not work, and even after this, the message T-FIVE was apparently not noticeable. We thought about a fix where we just told the solvers explicitly to mirror the image once they had it, which would at least work, but was not really elegant, in my opinion.

For the next week, I played around with all sorts of terrible alternative mechanisms to try and get something clean. Almost nothing I tried worked; the fundamental problem of being unable to construct the letter F or R when my assigned answer was FLYER made most things I did quite unnatural. I didn’t sleep well those nights, because all I could think about was how I was going to work my way out of this one.

(One approach would have been to just give up on having 4-glider syntheses, which I had decided on from the start. But simulations with more than four gliders were annoying, and the directions would become a lot more confusing.)

It wasn’t until I had the idea of using a word square that I finally made any progress. Word squares have a lot of redundancy, so if I could specify the letters in the “living” cells, the dead cells could also have any letter I wanted. A-ha! I went and scraped Google for a compilation of word squares, and started searching through it for ones that might work. I eventually found one that seemed like it could work.

So eventually, we managed to the get the 6×6 word square that did make it into the final version. I deliberately opted for a symmetric word square, because I wanted to keep the number of simulations less than 20 (because of earlier feedback about the tediousness of using a bad simulator). We added a “t+1” clue because “+1” was apparently not enough during testsolving.

Screenshot_2021-02-20 Escape From Life — MIT Mystery Hunt 2021

The word square, all done!

This was one of the few puzzles I wrote that have blanks in them, in the image where the still-lifes were first identified. Technically, I think some people could probably solve this puzzle without that image, if glider synthesis was clued a bit more in a different way. However, I thought this was a case where it’s more fun to break in by figuring out the (in some cases, ludicrous) names of the still-lifes gradually, and then being told explicitly to build the gliders with the resulting message.

4. Le Chiffre Indéchiffrable

To be honest, this puzzle was on the verge of not making it. In short, I had the idea of a Vigenere cipher using Unicode with a foreign language key… and that was pretty much it, lol.

Screenshot_2021-02-20 Le Chiffre Indéchiffrable — MIT Mystery Hunt 2021

It’s pretty minimalist as far as puzzle pages go.

The first draft was based on the idea that you have some famous song (think Taylor Swift) except it had been translated into some language L and back, and you needed to figure out the language L. This didn’t work at all, because there isn’t any good way to do this other than brute-force languages in Google Translate.

The second idea was that each key was the name of a KPop band and the plaintext was the song lyrics of one of their songs. It turns out there are a lot of KPop bands whose names can be read as a number like Iz*One, TWICE, GOT7, etc. Well, it turns out this was done already.

So I kind of gave up and just made a maximally uninspired extraction where you just take the first letters of song titles. We decided to get a test-solve of this, where it naturally got pretty low ratings, but by luck, Brian Chen testsolved it and mentioned Google Translate Sings as possibly thematic puzzle material.

I had never heard of Google Translate Sings, but when I did, I realized I now had puzzle data: there was a canonical episode number for all the songs. After doing a cursory search for “puzzle hunt” and “google translate” to confirm this hadn’t been done already, I went for it.

Anyways, at this point, the parameters I had to play with were:

  • Choose a language for the key
  • Choose a language for the plaintext
  • Choose the content of the key
  • Get an episode number for the plaintext

I had the idea that we could run a “do it again”, in which the key itself was a Unicode character, while the episode number from the plaintext was what you subtracted from. That left the languages free to vary, and I wanted to “use up” this information, so I rigged the first letters to spell “FANDOM” and “NUMBER”, to make sure the episode numbers were hinted specifically.

Screenshot_2021-02-20 Solution to Le Chiffre Indéchiffrable — MIT Mystery Hunt 2021

Rigging the languages to spell out a cluephrase.

One small wrench in this plan was that there was a Genius listing for Google Translate Sings that was (a) incomplete, (b) had wrong numbers. I had to carefully make sure to select songs that didn’t appear in Genius, in order to make sure anyone who found the Genius page knew immediately that it wasn’t what they were looking for.

The other wrench is that the Malinda Wiki is not that frequently updated, so some of the more recent songs I needed were not there. So, I am the proud author of the Never Gonna Give You Up page.

Screenshot_2021-02-20 Google Translate Sings Never Gonna Give You Up

The song is great, by the way.

It was also a bit constraining what Unicode characters I could use for the keys, because I needed to make sure the translation of the key back into English was recognizable (it wasn’t always). One other thing I did here was specifically use NULL (unicode point 0) as one of the characters, as a de-confirmation of several wrong approaches like trying to index by the Unicode number somehow.

We gave the first key to the solver because we wanted the solver to first have a confirmation they understand the Unicode Vigenere correctly before going on to try and do cryptanalysis.

5. Blind Calculation

This was pretty straightforward. I saw the answer PRODUCT TESTING in the pool, and looked at my abacus, and was like, “huh, the abacus has a grid”. So the idea to have a mapping Braille and the abacus was born.

Well, actually, to be pedantic, I own a soroban specifically (rather than the Chinese suanpan). Anyways, in keeping with the Japanese roots of the soroban, I thought it would be appropriate to use Japanese Braille rather than standard Braille, which was nice.


The soroban as it appears in the puzzle.

I was a bit worried about the cluephrase seihin tesuto not being unambiguous enough, though test-solvers seemed to get it fine. (There was a suggestion at some point of adding an enumeration 7 7, but by coincidence this is exactly the number of rods, which probably would have led to some red herrings.)

6. Clueless

During test-solving, most people figured out Hanabi and the letter encoding pretty quickly (and there is a lot of confirmation both are correct) As I mentioned in the previous post, there used to only be four scenarios, but we added a fifth easy one to make the rules more well-defined after the first test-solve group struggled. In addition, the plan was moved to the bottom and de-emphasized, since the first test-solving group tried to use them too early; we also added “starts” to make it clear that the first player starts.

Screenshot_2021-02-20 Clueless — MIT Mystery Hunt 2021

The eventual presentation of the puzzle.

Quoting further from the Author’s Notes for the original inspiration:

After playing maybe too many hours of Hanabi during quarantine, I knew I wanted to write a Hanabi puzzle for the Mystery Hunt as well, especially after seeing a Hanabi duck konundrum in the inaugural teammate hunt.

When I played Hanabi at summer math camps as a teenager, the only conventions we would have were play left, discard right, (forward) finesses, and the “fake finesses” (a worse form of the bluff). Plays and saves weren’t even clearly differentiated. Thus our win rate was quite low, and it was not until I played with a more well-developed convention set that I started appreciating the complexity (and fun) of Hanabi.

So, I wanted to develop a puzzle that would showcase at least some of that complexity. Since it’s so difficult to use “conventions” as a puzzle mechanic, I settled on the idea of using the notion of pace for a logic-only puzzle (without regards for efficiency, a measure of how much information is communicated by clues). The idea that the number of discards is limited, and clues can be used not just to indicate cards but also to stall for time, gave me a mechanic that I thought would be new even and interesting even to people who may have casually played Hanabi before. It was a lot of fun during test-solving to watch groups slowly realize (and then prove) that no cards could be discarded in the second scenario.

For extraction, choosing the plans was a fun exercise in Nutrimatic, since if I used a “random” string, it would be too easy to figure out which letters overlap just by looking at which cards were still left to play. Various strings that appeared in one version or other, often that barely didn’t work:


I also originally considered using tap code for the encoding of the letters, but this seemed like an un-thematic complication that was quickly scrapped.

I didn’t really want to use the crossword cluephrase FINAL PROPOSAL, but the current answer has the unfortunate property that it is almost uniquely determined by the first two letters, so this was necessary to prevent solvers from short-circuiting the puzzle too prematurely.

The title of the puzzle was suggested by Danny Bulmash.

One of the things that was hotly debated was the “Chances: m of n!” notation. Technically, the number m is not needed at all, and it seems like it would be more straightforward to say “n left” or similar. However, I wanted to include it anyways for a few reason

  • I felt it was thematic.
  • Writing n! communicated clearly that it was the deck order, rather than say turns left.
  • It emphasizes that the team is super unlikely to win (which I think is important for telling solvers the route is unique). For example, I think test-solve groups used the fact that 2 in 720 is not reduced.
  • It helps check-sum that the solution is right.

Incidentally, I’m almost more proud of the solution write-up than the puzzle itself — it took a lot of time for me to put together.

7. Nutrition Facts

This was also a pretty quick puzzle. One of the remaining puzzle answers was GOOD FAT, which motivated me to look at some nutrition fact containers and think about how the numbers there could be made into a puzzle. Well, numbers correspond to letters, so those bold numbers could be put to good use…

I was reminded about the 9-4-4 rule I learned in grade school, which meant there was a natural way to link the numbers “hidden in plain sight”. I originally wanted to let calories be the unknown number to compute, since it was the biggest number and if it was missing, everyone would think to compute it right away. However, it was hard to get that to work with A1Z26, since the calories number was usually going to be more than 100, making it awkward to naturally map the calorie numbers to letters, even if I allowed scaling by 10 or 100 or whatever.

That’s when I had the idea to make protein the hidden number instead, because for unknown reasons, the percent on protein is usually missing. That worked much better with A1Z26 and completed the puzzle.

Similar to Le Chiffre, I included one example food, a HERRING, so that the solvers knew what was going on. This also gave me one extra degree of freedom to let the total number of calories of all the foods equal 2000; not strictly necessary, but it was a nice artistic touch.


SVG pie chart. When you hover over in the real puzzle, the highlighted section changes and a different color is revealed.

Originally, I just had a pretty dorky pie chart with food images pasted on to it, but during the post-prodding I had a bit more fun making an interactive pie chart with HTML that hovered over the image.

8. Divided Is Us, written with Jakob Weisblat

As written in the author’s notes

This was the last puzzle to be written for the 2021 MIT Mystery Hunt. At a meeting in late December, the editors announced there was exactly one answer remaining, and it was a teamwork time puzzle. As it happens, I was thinking earlier that day how we didn’t have our obligatory Baba puzzle yet, and the idea for the puzzle was born.

(I should say now that actually there was one other puzzle that was written after this: Better Bridges, which was a last-minute backup puzzle.)

To look into whether this was feasible, I did some research on open-source Baba Is You libraries. I settled on using baba-is-auto, although the library is still a work-in-progress as far as I can tell, and I needed to do some editing of the library to make it work.

Anyways, yeah, the premise was co-operative Baba puzzles, where there were different blocks for the various players. The obvious options for the win condition would be “win if anyone reaches the goal”, or “win if everyone reaches the goal”, but the intermediate option of having YOU reach the goal turned out to work the best. (This was not immediately obvious to teams during test-solving even a few levels in, so we had to rewrite the intro level to make it more blatant.)

Meanwhile, I had the thought that it would be cool if the game slowly revealed that everyone was seeing parts of the same world, and say, around level 4 or so, you would suddenly realize this when someone moved their icon onto another player’s screen. Accordingly, we hid the “other player” viewports until after this revelation (at which point some people wondered if they had simply not noticed those viewports all along).


The fifth level of Divided Is Us. One player in each quadrant.

I’m pretty happy with the extraction: I think it’s a good example of how you can embed encodings in unexpected places. Here, I took advantage of the fact that the game was secretly one world in order to put semaphore and Braille (conveniently, flags are not too conspicuous) and it was a lot of fun to watch people realize this around level 6 or 7, as they were reaching the final level. However, I also couldn’t resist using the names of the various icons. It was convenient that many of them were four letter words. Anyways, the answer SECOND GRAND ALLIANCE is pretty long, so there was enough space to use one mechanism per word.

(I did sort of break one puzzle design rule here: you couldn’t finish the puzzle without solving every Baba level. This went okay in test-solving though.)

There were two design choices I wish I had done differently:

  • We tried to set the game so that each person got assigned a player, which wouldn’t change if they refreshed the page. This ended up causing issues when websockets got disconnected or the server went down or whatever, causing certain player slots to get locked out. It wasn’t too hard to work around (e.g. using private windows) but it did annoy a bunch of the teams.
  • We didn’t show the total number of levels, which I regret a bit. I think it would have been more fun for people to realize they were getting to the end. Also, in one case a team managed to prove to themselves the penultimate level was not solvable, and was unsure what to do from there. Had we put a more obvious level indicator, this could have been a bit clearer.

Finally, while I’m here, I’d like to thank Arvi Teikari for the permission to use the game concept and art for this puzzle.

Closing thoughts

I was pretty pleasantly surprised by how many puzzles I was able to produce over the span of the hunt-writing process. The larger puzzles here on average took over my life for a week, and Shortlist and Escape From Life took even longer. It helped a lot that I was taking time away from school this year; I think otherwise there isn’t any way I would have been able to find enough time to put these puzzles together.

On the flip side though, I think I’ve just about used up all my puzzle ideas for the foreseeable future. (At least I say that now.) But I don’t have any regrets about this. I’m not sure I’ll have a chance to write for the Mystery Hunt ever again, so I’m glad I was able to make the most of this chance I got, and I couldn’t have asked for a better team to have write the Mystery Hunt with.

Some puzzle-writing thoughts from an amateur

The 2021 Mystery Hunt concluded a while ago, and wow, what an experience. I was lucky enough to be on the organizing team, and I am so proud right now to be able to say I am a ✈️✈️✈️ Galactic Trendsetter ✈️✈️✈️. (If you don’t know what a puzzle hunt is, betaveros has a great introduction.)

I came in to the MIT Mystery Hunt with no writing experience at all, and ended up being listed as the author of a few. It was a trial by fire, to say the least.

In this post I want to say a bit about the parts of the puzzle-writing process that surprised me the most, in the hopes that maybe it would be helpful to future authors. A nice introduction is given by e.g. David Wilson, which I read many times before I actually attempted to write my first puzzle. Most of what’s said here repeats or builds off of what is already there.

Spoiler warning: this post contains minor spoilers for some of my (Evan’s) puzzles, but nothing that would completely ruin the puzzle. The eight puzzles I made major contributions for the 2021 hunt are A Bit of Light, Blind Calculation, Clueless, Divided Is Us, Escape From Life, Le chiffre indéchiffrable, Nutrition Facts, The IMO Shortlist.

0. Overview

When I first went in to puzzle writing, I had the idea that a puzzle was a series of steps that the solver takes to go from the initial data to an answer. The puzzle-designer’s job was to make figuring out the steps challenging but rewarding (rather than trivial or frustrating). This point of view is really focused on the mechanical part of the puzzle.

After my hunt experiences, I found myself internally guided by splitting puzzle-writing goals into two orthogonal parts: the mechanical half and the aesthetic half. (Or if you are feeling zen, “science” and “art”.) The mechanical part refers to the details of how the solver works through the puzzle, but the aesthetic part refers to things like the theming of the puzzle or so on. I’ll talk a bit more about these below.

1. Finding a core mechanism

I think the thing that was most helpful for me in starting was to be on the lookout for things things that “feel puzzly”. I’m still not sure exactly what I mean when I say that, but I think I vaguely mean things satisfying one or more of the following criteria:

  • Thing is very “well-defined”; for example, it has a canonical reference you can find on the Internet.
  • Thing has a grid.
  • Thing lends itself to puns or wordplay.
  • Thing has a natural mapping, e.g. a way to turn a number into a word (say, episode numbers of a show).
  • Thing has numbers that work well with A=1, B=2, …, Z=26.
  • Thing has a concept of compass directions for flag semaphore.

devjoe’s index is useful both for getting ideas, and also to make a routine search of whether an idea you have is already used. For example, when writing the IMO Shortlist, the first thing I did was look through to see if math contests more generally had been done in any Mystery Hunt (answer: no).

I’m already starting to mention encodings, so I’ll do that now for those of you that have never seen them. Basically, there is a list of puzzle encodings that all experienced solvers are expected to be able to notice in Mystery Hunt. This has the upshot that they can be used freely in puzzles! Examples are:

  • Anything with numbers could be A=1, B=2, …, Z=26.
  • Anything with a grid could be Braille.
  • Anything with directions at all could be semaphore.
  • Any time you have a bunch of strings, the first letters might spell a message.
  • Any time you have a sequence with exactly two kinds of objects (ones or zeros, on or off, true or false, red or blue, dot or dash) consider binary (which could then become letters by A1Z26) or occasionally Morse code.
  • Any time you have words of the same length, see if their letters overlap.
  • Any time you have a bijection f from a set to itself, you should see if that function forms a cycle. If so, take that as an ordering.

The extraction part of Brian Chen’s introduction to puzzle hunts goes over all of these in more detail.

Screenshot_2021-02-21 CodeSheet-201912 pdf - CodeSheet-201912 pdf

Table of common encodings from puzzledpint.com, click to open

Examples of moments in my life when I saw something and said “this would be good puzzle material”, and flew with it:

  • Blind Calculation came about because I was staring at an abacus on my table and saw a grid.
  • Escape From Life came about because I thought that this website (spoiler) was a puzzle gold-mine. (I had a lot of fun drawing a SHIPTIE as a ship with a tie.)
  • Nutrition Facts came about because I was staring at a nutrition facts label and noticed how there were a lot of numbers with relations between them.

One thing I occasionally found helpful was to start with the answer to the puzzle first to help narrow down the focus a bit. For the 2021 MIT Mystery Hunt, there was a big pool of several hundred answers, about 50 of which were for the Students round (shorter easy puzzles, like the fish puzzles from 2015). The Students answers were put in a big pool, and you could request one of them to start and then write a puzzle with it. Blind Calculation and Nutrition Facts were both started this way, where I was first given the answer to the puzzle, and then I started looking for objects that both felt puzzly and fit thematically. But it’s not necessary; the other six puzzles I wrote were because I had an idea in an area I knew well, came up with a mechanism, and got an answer assigned once I had a plan written out.

2. Aesthetic

I think the thing that surprised me most is that you can have a really successful puzzle which is almost trivial mechanically (in the sense the solver has pretty clear instructions on what to do the entire way through). For example, this year’s hunt featured a puzzle called Boggle Battle where most of the work to be done was playing cooperative Boggle with friends, and the only “puzzly” part was the extraction. Similarly, So You Think You Can Count has minimal puzzle elements, and doesn’t have an extraction at all (upon reaching 100, the solvers are simply given the answer). They both got high ratings during testsolving because the solvers had fun anyways.

Helpful Firefly, an iconic character from So You Think You Can Count?

I think the conclusion I drew from this was that while mechanics are important (e.g. things like checksumming, eliminating false trails, no extra information, etc.) caring about the aesthetics of the puzzle is more important. (Of course, like many things in life, they are not completely orthogonal: thematic coherence is one of the best confirmations a puzzle-solver can have when they have a guess at the next step.)

In particular, one useful guideline is to try and consider cutting any steps that didn’t feel thematically coherent.

But this was not always a prohibition! For example, despite the advice that it’s better to have only one a-ha per puzzle, I found out that I could generally get away with having multiple a-ha’s for a puzzle as long as all of them felt thematically coherent with each other. Le chiffre indéchiffrable is a decent example of this in my opinion, where despite multiple steps and moving parts, the steps all seem to resonate with each other.

An anti-example of this might be the extraction step of IMO Shortlist, which arguably felt a little out of place, although I ended up keeping it anyways for lack of a better alternative.

3. Writing

Once I had the general mechanism for the problem down, there were often still a lot of smaller local decisions to make as I composed the puzzle. Editors were helpful at this step to provide feedback; but there are some guidelines that I would constantly check against during the writing process (pre-emptively addressing concerns).

Some examples (by no means the only ones):

  • Do whatever you can to confirm correct paths, e.g. make it so that if a solver tries the right thing, they immediately get a confirmation of some sort.
  • Try to make sure all the information is used, and minimize arbitrary choices. For example, alphabetize things when the given order doesn’t matter. (There are some exceptions to this, e.g. one common puzzle mechanic is that you have a sea of letters and draw some lines, and look at the intersected letters. Thor from puzzle potluck 3 is an example. Necessarily, some of the “distracting” letters are never used.)
  • Provide enough redundancy or checkpoints so that a team can still solve the puzzle even with a few errors. For example, you might try to set things up so that one error leads to just a single garbled letter in a phrase which can be noticed and fixed; or, if you have a list of objects to identify, having them in alphabetical order will also help fix mistakes.

Basically, I would constantly put myself into the shoes of the solver. At each milestone in the puzzle, I would imagine what they were looking at, and how they might react it. I often paid special attention to thinking about what feedback they would get if they did the correct thing, and what feedback they would get if they made a mistake (e.g. “if they make a small error here, the number they get probably won’t be an integer, so they’ll know right away they made a mistake”), and what possible wrong paths the solver could take.

It’s also worth paying a bit of attention to accessibility of the puzzle. For my puzzles, this mostly consisted of:

  • Adding alt-texts to images (especially any text in the image).
  • Including enlarged versions of any image I used (typically, if you click an image, it opens the full-resolution image).
  • Keeping colorblind or black/white printing in mind when choosing colors (or choosing not to use colors). Though I had to violate this blatantly for Clueless, where the exact choice of colors was important to the puzzle.

4. Testsolving

If you are a software programmer, then debugging is part of your life. No matter how experienced you are, you are going to have some bugs in your code. The compiler will catch some of them; testing will catch others. You might have fewer bugs over time as you get more experienced, but they will never go away completely.

I think the same is largely true for puzzle-writing.

I don’t mean this in the sense that the puzzle will literally have an outright error or typo every so often (though if we’re being honest here, all my first drafts sure did). What I really mean is that, the first version of the puzzle will have unexpected issues, and often the only reasonable way to find them out is through test-solving. I had the common experience of watching my puzzle completely fall apart because the solvers were doing something totally different from what I wanted.

For example, in Clueless:

  • The “(starts)” text wasn’t there originally. I thought it was clear that 1 should start. The test-solvers didn’t, and tried to get scenarios to work by having 2, 3, or 4 start instead.
  • The texts like “FIVES WIN THIS” used to be at the top in bold, and test-solvers tried to use this information too early in ways that to me made no sense. I tried to move it the bottom to make it less prominent.
  • There used to only be four scenarios, but the solvers got stuck quickly. It was necessary to add one easy scenario to help as a foothold to clarify what rules were allowed.

In the programming analogy, the editors are the compilers: they try to point out preliminary issues. The test-solvers are the unit tests. Since test-solver time is a valuable resource, it’s important to have a version that is ready to testsolve for them, but at the same time, I wish I hadn’t agonized so hard sometimes on minor decisions before testsolving, because the feedback from testsolving often led to huge revisions anyways.

So, even though I mostly worked without coauthors, the editors and test-solvers were the primary source of feedback for revising the puzzle.

5. Closing thoughts

This is all for now. I’m hoping to make a later post with more details about the creation process of each of the eight individual puzzles mentioned above.

In the meantime, I’d like to give one last shout-out to everyone on Galactic who helped me so much in getting onboarded into the puzzle-writing process. Thanks y’all!

USEMO Problem Development, Behind the Scenes

In this post I’m hoping to say a bit about the process that’s used for the problem selection of the recent USEMO: how one goes from a pool of problem proposals to a six-problem test. (How to write problems is an entirely different story, and deserves its own post.) I choose USEMO for concreteness here, but I imagine a similar procedure could be used for many other contests.

I hope this might be of interest to students preparing for contests to see a bit of the behind-the-scenes, and maybe helpful for other organizers of olympiads.

The overview of the entire timeline is:

    1. Submission period for authors (5-10 weeks)
    2. Creating the packet
    3. Reviewing period where volunteers try out the proposed problems (6-12 weeks)
    4. Editing and deciding on a draft of the test
    5. Test-solving of the draft of the test (3-5 weeks)
    6. Finalizing and wrap-up

Now I’ll talk about these in more detail.

Pinging for problems

The USA has the rare privilege of having an extremely dedicated and enthusiastic base of volunteers, who is going to make the contest happen rain or shine. When I send out an email asking for problem proposals, I never really worry I won’t get enough people. You might have to adjust the recipe below if you have fewer hands on deck.

When you’re deciding who to invite, you have to think about a trade-off between problem security versus openness. The USEMO is not a high-stakes competition, so it accepts problems from basically anyone. On the other hand, if you are setting problems for your country’s IMO team selection test, you probably don’t want to take problems from the general public.

Submission of the problem is pretty straightforward, ask to have problems emailed as TeX, with a full solution. You should also ask for any information you care about to be included: the list of authors of the problem, any other collaborators who have seen or tried the problem, and a very very rough estimate of how hard the difficulty is. (You shouldn’t trust the estimate too much, but I’ll explain in a moment why this is helpful.)

Ideally I try to allocate 5-10 weeks between when I open submissions for problems and when the submission period ends.

It’s at this time you might as well see who is going to be interested in being a test-solver or reviewer as well — more on that later.

Creating the packet

Once the submission period ends, you want to then collate the problems into a packet that you can send to your reviewers. The reviewers will then rate the problems on difficulty and suitability.

A couple of minor tips for setting the packet:

  • I used to try to sort the packet roughly by difficulty, but in recent years I’ve switched to random order and never looked back. It just biases the reviewers too much to have the problem number matter. The data has been a lot better with random order.
  • Usually I’ll label the problems A-01, A-02, …, A-05 (say), C-06, C-07, …., and so on. The leading zero is deliberate: I’ve done so much IMO shortlist that if I see a problem named “C7”, it will automatically feel like it should be a hard problem, so using “C-07” makes this gut reaction go away for me.
  • It’s debatable whether you need to have subject classifications at all, since in some cases a problem might not fit cleanly, or the “true” classification might even give away the problem. I keep it around just because it’s convenient from an administrative standpoint to have vaguely similar problems grouped together and labelled, but I’ll explicitly tell reviewers to not take the classification seriously, rather as a convenience.

More substantial is the choice of which problems to include in the packet if you are so lucky to have a surplus of submissions. The problem is that reviewers only have so much time and energy and won’t give as good feedback if the packet is too long. In my experience, 20 problems is a nice target, 30 problems is strenuous, anything more than that is usually too much. So if you have more than 30 problems, you might need to cut some problems out.

Since this “early cutting” is necessarily pretty random (because you won’t be able to do all the problems yourself singe-handedly), I usually prefer to do it in a slightly more egalitarian way. For example, if one person submits a lot of problems, you might only take a few problems from them, and say “we had a lot of problems, so we took the 3 of yours we liked the most for review”. (That said, you might have a sense that certain problems are really unlikely to be selected, and so you might as well exclude those.)

You usually also want to make sure that you have a spread of difficulties and subjects. Actually, this is even more true if you don’t have a surplus: if it turns out that, say, you have zero easy algebra or geometry problems, that’s likely to cause problems for you later down the line. It might be good to see if you can get one or two of those if you can. This is why the author’s crude estimates of difficulty can still be useful — it’s not supposed to be used for deciding the test, but it can give you “early warnings” that the packet might be lacking in some area.

One other useful thing to do, if you have the time at this point, is to edit the proposals as they come in, before adding them onto the packet. This includes both making copy edits for clarity and formatting, as well as more substantial changes if you can see an alternate version or formulation that you think is likely to be better than the original. The reason you want to do this step now is that you want the reviewer’s eyes on these changes: making edits halfway through the review process or after it can cause confusion and desynchronization in the reviewer data, and increases the chances of errors (because the modifications have been checked fewer times).

The packet review process

Then comes the period where your reviewers will work through the problems and submit their ratings. This is also something where you want to give reviewers a lot of time: 6-8 weeks is a comfortable amount. Between 10 and 25 reviewers is a comfortable number.

Reviewers are asked to submit a difficulty rating and quality rating for each problem. The system I’ve been using, which seems to work pretty well, goes like follows:

  • The five possible difficulty ratings are “Unsuitable”, “Mediocre”, “Acceptable”, “Nice”, “Excellent”. I’ve found that this choice of five words has the connotations needed to get similar distributions of calibrations across the reviewers.
  • For difficulty, I like to provide three checkboxes “IMO1”, “IMO2”, “IMO3”, but also tell the reviewer that they can check two boxes if, say, they think a problem could appear as either IMO1 or IMO2. That means in essence five ratings {1, 1.5, 2, 2.5, 3} are possible.

This is what I converged on as far as scales that are granular enough to get reasonable numerical data without being so granular that they are unintuitive or confusing. (If your scale is too granular, then a person’s ratings might say more about how a person interpreted the scale than the actual content of the problem.) For my group, five buckets seems to be the magic number; your mileage may vary!

More important is to have lots of free text boxes so that reviewers can provide more detailed comments, alternate solutions, and so on. Those are ultimately more valuable than just a bunch of numbers.

Here’s a few more tips:

  • If you are not too concerned about security, it’s also nice to get discussion between reviewers going. It’s more fun for the reviewers and the value of having reviewers talk with each other a bit tends to outweight the risk of bias.
  • It’s usually advised to send only the problem statements first, and then only send the solutions out about halfway through. I’ve found most reviewers (myself included) appreciate the decreased temptation to look at solutions too early on.
  • One thing I often do is to have a point person for each problem, to make sure every problem is carefully attempted This is nice, but not mandatory — the nicest problems tend to get quite a bit of attention anyways.
  • One thing I’ve had success with is adding a question on the review form that asks “what six problems would you choose if you were making the call, and why?” I’ve found I get a lot of useful perspective from hearing what people say about this.

I just use Google Forms to collect all the data. There’s a feature you can enable that requires a sign-in, so that the reviewer’s responses are saved between sessions and loaded automatically (making it possible to submit the form in multiple sittings).

Choosing the draft of the test

Now that you have the feedback, you should pick a draft of the test! This is the most delicate part, and it’s where it is nice to have a co-director or small committee if possible so that you can talk out loud and bounce ideas of each other.

For this stage I like to have a table with the numerical ratings as a summary of what’s available. The way you want to do this is up to you, but some bits from my workflow:

  • My table is color-coded, and it’s sorted in five different ways: by problem number, by quality rating, by difficulty rating, by subject then quality rating, by subject then difficulty rating.
  • For the quality rating, I use the weights -0.75, -0.5, 0, 1, 1.5 for Unsuitable, Mediocre, Acceptable, Nice, Excellent. This fairly contrived set of weights was chosen based on some experience in which I wanted to the average ratings to satisfy a couple properties: I wanted the sign of the rating (- or +) to match my gut feeling, and I wanted to not the rating too sensitive to a few Unsuitable / Excellent ratings (either extreme). This weighting puts a “cliff” between Acceptable and Nice, which empirically seems a good place to make the most differential.
  • I like to include a short “name” in the table to help with remembering which problem numbers are which, e.g. “2017-vtx graph”.

An example of what a table might look like is given in the image below.

Here is a output made with fake data. The Python script used to generate the table is included here, for anyone that wants to use it.

WT_U = -0.75
WT_M = -0.5
WT_A = 0
WT_N = 1
WT_E = 1.5

# ---- Populate with convincing looking random data ----
import random
slugs = {
		"A-01" : r"$\theta \colon \mathbb Z[x] \to \mathbb Z$",
		"A-02" : r"$\sqrt[3]{\frac{a}{b+7}}$",
		"A-03" : r"$a^a b^b c^c$",
		"C-04" : r"$a+2b+\dots+32c$",
		"C-05" : r"$2017$-vtx dinner",
		"G-06" : r"$ST$ orthog",
		"G-07" : r"$PO \perp YZ$",
		"G-08" : r"Area $5/2$",
		"G-09" : r"$XD \cap AM$ on $\Gamma$",
		"G-10" : r"$\angle PQE, \angle PQF = 90^{\circ}$",
		"N-11" : r"$5^n$ has six zeros",
		"N-12" : r"$n^2 \mid b^n+1$",
		"N-13" : r"$fff$ cycles",

qualities = {}
difficulties = {}
for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e,f = [random.randrange(0,3) for _ in range(6)]
	if c >= 1: a = 0
	if a >= 2: d,e = 1,0
	if e == 0: f = 0
	if a == 0 and b == 0: e *= 2
	qualities[k] = [WT_U] * a + [WT_M] * b + [WT_A] * (b+d+e) + [WT_N] * (c+d+e) + [WT_E] * (c+e+f)

for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e = [random.randrange(0,5) for _ in range(5)]
	if e >= 4:
		b = 0
		c //= 2
	elif e >= 3:
		a = 0
		b //= 2
	if a >= 3:
		e = 0
		d //= 3
	elif a >= 2:
		e = 0
		d //= 2
	difficulties[k] = [1] * a + [1.5] * b + [2] * c + [2.5] * d + [3] * e

# ---- End random data population ----

import statistics
def avg(S):
	return statistics.mean(S) if len(S) > 0 else None
def median(S):
	return statistics.median(S) if len(S) > 0 else None

# criteria for inclusion on chart
criteria = lambda k: True

def get_color_string(x, scale_min, scale_max, color_min, color_max):
	if x is None:
		return r"\rowcolor{gray}"
	m = (scale_max+scale_min)/2
	a = min(int(100 * 2 * abs(x-m) / (scale_max-scale_min)), 100)
	color = color_min if x < m else color_max
	return r"\rowcolor{%s!%d}" %(color, a) + "\n"

def get_label(key, slugged=False):
	if slugged:
		return r"{\scriptsize \textbf{%s} %s}" %(key, slugs.get(key, ''))
		return r"{\scriptsize \textbf{%s}}" % key

## Quality rating
def get_quality_row(key, data, slugged = True):
	a = avg(data)
	s = ("$%+4.2f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, WT_U, WT_E, "Salmon", "green")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
	return color_tex + row_tex
def print_quality_table(d, sort_key = None, slugged = True):
	items = sorted(d.items(), key = sort_key)
	print(r"\toprule Prob & U & M & A & N & E & Avg \\ \midrule")
	for key, data in items:
		print(get_quality_row(key, data, slugged))

## Difficulty rating
def get_difficulty_row(key, data, slugged = False):
	a = avg(data)
	s = ("$%.3f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, 1, 3, "cyan", "orange")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
	return color_tex + row_tex
def print_difficulty_table(d, sort_key = None, slugged = False):
	items = sorted(d.items(), key = sort_key)
	print(r"\begin{tabular}{l ccccc c}")
	print(r"\toprule Prob & 1 & 1.5 & 2 & 2.5 & 3 & Avg \\ \midrule")
	for key, data in items:
		print(get_difficulty_row(key, data, slugged))

filtered_qualities = {k:v \
		for k,v in qualities.items() if criteria(k)}
filtered_difficulties = {k:v \
		for k,v in difficulties.items() if criteria(k)}

def print_everything(name, fn = None, flip_slug = False):
	if fn is not None:
		sort_key = lambda item: fn(item[0])
		sort_key = None
	print(r"\section{" + name + "}")
	if flip_slug:
		print_quality_table(filtered_qualities, sort_key, False)
		print_difficulty_table(filtered_difficulties, sort_key, True)
		print_quality_table(filtered_qualities, sort_key, True)
		print_difficulty_table(filtered_difficulties, sort_key, False)

# Start outputting content

\title{Example of ratings table with randomly generated data}


print(r"\section{All ratings}")

print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall popularity",
		lambda p : (-avg(qualities[p]), p), False)
print_everything("Beauty contest, by subject and popularity",
		lambda p : (p[0], -avg(qualities[p]), p), False)
print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall difficulty",
		lambda p : (-avg(difficulties[p]), p), True)
print_everything("Beauty contest, by subject and difficulty",
		lambda p : (p[0], -avg(difficulties[p]), p), True)

print(r"\section{Scatter plot}")
print(r"""\begin{axis}[width=0.9\textwidth, height=22cm, grid=both,
	xlabel={Average difficulty}, ylabel={Average suitability},
	every node near coord/.append style={font=\scriptsize},
print(r"""\addplot [scatter,
	only marks, point meta=explicit symbolic,
	nodes near coords*={\prob},
	visualization depends on={value \thisrow{prob} \as \prob}]""")
print(r"table [meta=subj] {")
for p in qualities.keys():
	x = avg(difficulties[p])
	y = avg(qualities[p])
	if x is None or y is None:
	print("%0.2f\t%0.2f\t%s\t%s" %(x,y,p[2:],p[0]))


Of course, obligatory warning to not overly rely on the numerical ratings and to put heavy weight on the text comments provided. (The numerical ratings will often have a lot of variance, anyways.)

One thing to keep in mind is that when choosing the problems is that there are two most obvious goals are basically orthogonal. One goal is to have the most attractive problems (“art”), but the other is to have an exam which is balanced across difficulty and subject composition (“science”). These two goals will often compete with each other and you’ll have to make judgment calls to prioritize one over the other.

A final piece of advice is to not be too pedantic. For example, I personally dislike the so-called “Geoff rule” that 1/2/4/5 should be distinct subjects: I find that it is often too restrictive in practice. I also support using “fractional distributions” in which say a problem can be 75% number theory and 25% combinatorics (rather than all-or-nothing) when trying to determine how to balance the exam. This leads to better, more nuanced judgments than insisting on four categories.

This is also the time to make any last edits you want to the problems, again both copy edits or more substantial edits. This gives you a penultimate draft of the exam.

Test solving

If you can, a good last quality check to do is to have a round of test-solving from an unbiased group of additional volunteers who haven’t already seen the packet. (For the volunteers, this is a smaller time commitment than reviewing an entire packet so it’s often feasible as an intermediate commitment.) You ask this last round of volunteers to try out the problems under exam-like conditions, although I find it’s not super necessary to do a full 4.5 hours or have full write-ups, if you’ll get more volunteers this way. A nice number of test-solvers is 5-10 people.

Typically this test-solving is most useful as a sanity check (e.g. to make sure the test is not obviously too difficult) and for any last minute shuffling of the problems (which often happens). I don’t advise making drastic changes at this point. It’s good as a way to get feedback on the most tricky decisions, though.


After any final edits, I recommend sending a copy of the edited problems and solutions to the reviewers and test-solvers. They’re probably interested to know what problems made the cut, and you want to have eyes going through the final paper to check for ambiguities or errors.

I usually take the time to also send out some details of the selection itself: what the ratings for the problems looked like, often a sentence or two for each problem about the overall feedback, and a documentation of my thought process in the draft selection. It’s good to give people feedback on their problems, in my experience the authors usually appreciate it a lot, especially if they decide to re-submit the problem elsewhere.

And that’s the process.