MOP 2021 puzzle hunt

I worked with a few friends on writing a mini one-round puzzle hunt for this year’s MOP students. If you want to play, you can do so now at the following URL:

I’m not sure whether or not this is going to become a recurring tradition. On the one hand, it was loads of fun to make (including figuring out how to draw art on my iPad); on the other hand, I definitely didn’t blow a couple hundred hours of my life putting this together. ;)

Unrelated advertisement: the Carina Initiatives is currently looking to hire a junior data scientist. They’re a young philanthropic fund that’s interested in supporting initiatives that build actual math problem solving (like, the problem-solving you’d see on this blog, not the mainstream buzzword stuff). I’ve had a lot of fun talking with them over the last few months about problem-solving in general.

My impression is they’re looking for people who have finished a bachelor’s degree in some quantitative field and ideally have experience with any subset of {math/CS/etc competition/circle/camp, education, gaming, entrepreneurship, non-profit operations}. They’re based in SoCal but open to remote + quarterly travel.

If you’re interested in this and I know who you are, send me an email and I’ll connect you.

Unnecessarily detailed stories of my Mystery Hunt puzzles

As a follow-up to the previous post, I am recording here some details about the (numerous, varied, and deep) revisions that a bunch of my puzzles went through during the production process. These puzzles are in chronological order of when I wrote the puzzle.

Spoiler warning: Unlike the last post, these will completely spoil all of the puzzles below.

1. A Bit of Light

Surprisingly, the gist of the puzzle mostly survived revisions: the idea of an encoding-based puzzle with the key step to look at seven-segment displays between the Braille and resistor color encoding. One issue I ran into while constructing was that ABCDEF6 is actually too big in base 16 to store in an eight-bit number, which is part of why I ended up 1337-ing the seven-segment display bits.

Screenshot_2021-02-20 A Bit of Light — MIT Mystery Hunt 2021

The final presentation of the “A Bit of Light” puzzle; it looked pretty much like this at the start too, with worse encodings.

However, in the first draft I apparently had the bright idea that resistor colors should include the last digit being a 10^n multiplier as in the electronic color code. Unsurprisingly, nobody actually does this in puzzle hunts, and it made it impossible to extract the hex correctly because there was even “rounding off” of the hexadecimal numbers involved. So the first testsolve session went up in flames, oops.

I over-corrected a bit after this where, in addition to making the colors simple digits, I made the clue phrase for the next step a lot more explicit, something like “resist > 8d int > hex > SSD > hex”. The next test-solves returned a few more comments that the puzzle felt a bit like “here is what to do, now do it”. So over time, the clue phrase became less explicit until it was just “int=>hex=>SSD”. That got better reviews, since it made things a little more rewarding.

The other thing that happened is that the length of the puzzle (in terms of the number of cells) got increasingly short over versions. There was a lot of concern that this puzzle was quite tedious if you don’t know enough spreadsheet/coding to automate transcribing the colors and bits and so on. Earlier versions of the puzzle had 150ish cells; the final version ended up with just 56.

We also agreed that it was better to make the 1337-ing of the LED segments consistent; it used to be that either 4 or A could be used, for example, but it was changed to be always 4. This meant that only 16 color patterns ever appeared, which turned out to be not a big deal.

Screenshot_2021-02-20 Solution to A Bit of Light — MIT Mystery Hunt 2021

The various encodings, with the consistent 1337 -> 7SD step.

The title of the puzzle used to be “I See The Light”, but we agreed “A Bit of Light” was better. The flavortext wasn’t really used in any solve, but we kept it because it was nice confirmation. The extraction used to include a crossword type thing, which was quickly deleted because it was completely un-thematic and didn’t add anything.

One thing I’m grateful for was having everything automated at the beginning; I basically had a Python script the whole time for building prototypes and editing them.

2. The IMO Shortlist

The basic idea behind the puzzle was: solve some math problems, insert the proposing country into the problem in some way, get a new country, and extract somehow.

The first idea for how to insert the country was by using the calling code for the country as a way to convert the input to a number. It was immediately pointed out that this was totally unthematic, so I dropped that idea pretty quickly.

It’s at this point I got the idea to use host countries of recent IMO’s, say the year they hosted the IMO or the number of the IMO, etc. I thought it would be cute if we used a different mechanism for each subject, although this was in part necessary because it’s not easy at all to write a reasonable-looking geometry question that gives a function from R to R. So I chose the “three collinear points” gimmick for geometry, and then noticed it lined up with the columns of IMO timeline, so used that for the rest.

Anyways, this gave a way to change each old host year into a new year. My original idea for extraction was that you could take these new years and compare which of them had a real IMO problem attached by that year-number pair, and find out that you get one of P1 through P6 exactly once. The true proposing countries of these six problems could then be diagonalized to extract a six-letter answer.

This extraction crashed and burned during test-solving, for a few reasons. First, the transformation I was specifying was a bijection from a set to itself, and people usually expect this to give a cycle, which I didn’t use. Secondly, there were just too many possibilities to make something this far-fetched reasonable, e.g. whether to use the old year or the new year, the old country or the new country or the true country, etc. And third, data-mining the past shortlists was a royal pain (I did try to plant the relevant data on my website but that was after the fact). We basically agreed that to make this mechanism work, we would need a really explicit shell that would be sort of clunky and not too elegant.

Screenshot_2021-02-20 Evan Chen • Problems

Definitely not planting data on my website.

The good news was the feedback about the first part of the puzzle was mostly positive, although the puzzle was possibly too difficult (I remember hearing in feedback that the extraction felt “three steps too long”, and that even without the extraction this would still probably be one of the hardest puzzles in the Hunt.)

As I thought about this a bit more, I started thinking about the cycle idea that was pointed out (before this, I hadn’t known that bijections in puzzles are often cycles), and thinking about what I could do if instead the extraction led to a single cycle. It certainly occurred to me that there was one meme that I could exploit for this: that A/C/G/N is almost exactly the same as A/C/G/T. If I abbreviated “Number Theory” as “NT”, then I could use the resulting cycle to spell out a short phrase using DNA codons, the answer. This wasn’t too thematic, but then again it’s something that most people who have done math contests extensively have noticed at some point or other, and maybe it would be okay to use this idea now.

Screenshot_2021-02-20 puzzle pdf(1)

Using the abbreviation “NT”. It’s not an accident!

In building the cycle, there was an obvious bit of extra information: the shortlist numbers themselves are never used (e.g. A1 vs A2). So it was natural to have them be increasing in the correct order of the cycle, i.e. A1 came before A2 in the cycle, A2 came before A3, etc. This provided a strong confirmation that the cycle was the right thing to look at, that the number of the problems is unimportant, and it also gave a lot of error-correction to fix a few mistakes in the math problems.

Anyways, this involved redoing a lot of the math problems, which was “fun” :P

One other thing that happened during test-solving: one group failed to realize that combo was years instead of IMO number (like algebra) until after they had finished the entire puzzle (error-correction works, hooray). I fixed this by rewriting one of the problems (C3) so that it was obvious the input had to be a year rather than a small number to feasibly work.

The flavor-text is almost littered with clues (“stranded”, “timeline”, “official”). The first couple pages of the PDF file were also added to clue a bit more heavily what the solver was expected to do (originally, the proposing countries were alphabetical).

Screenshot_2021-02-20 puzzle pdf

The cover page of the IMO Shortlist.

3. Escape From Life

The author’s notes for this puzzle give a summary of some of the thoughts behind this puzzle. In short, Conway’s Game of Life has appeared before, but it never seemed to be used beyond simply “run one generation”. As I mentioned in my last post, I thought the elaborate list of names of still lifes was a silly/fun source of puzzle material, and wanted to use it.


An example of an ash with cartoons substituted in, featuring MOOSE ANTLERS.

The basic premise of the puzzle (i.e. what I started with) was as follows:

  • Solvers realize the silly images are Conway’s game of life structures.
  • Solvers run a set of specified life seeds and match it to pictures.
  • The life simulations have escaping gliders giving semaphore, so the matching now gives a grid of letters.

The presentation of the seeds for the simulation went some revisions. Initially, I specified the coordinates as just text, but changed this to a pictorial grid with numbers typed in since there was a bit of confusion on which way the axes were indexed. This was quite important for this puzzle — a tiny error makes the simulation completely useless, so it was imperative for solvers that the right way to input things was completely unambiguous. I learned this the hard way! (I also originally didn’t think to specify the orientation of the gliders, since I thought that teams would probably be using a codable simulator and just try all 4^4=256 ways. Ha, ha, ha. We quickly reversed this decision.)


The final format of the glider seeds. Solvers complete the squares to a glider in a unique way so they fly towards the center.

Since the order of the seeds didn’t matter, we sorted them alphabetically (which gave a strong confirmation of the semaphore being correct).

There was some discussion the whole way through of perhaps presenting the seeds in a standard Life format like RLE, because there was some mixed test-solving feedback about how much work it was to plug these into a Life simulator: essentially, some teams did the “right” thing and downloaded golly or similar, but others were using various browser apps that were more clunky, and the latter group of people did not have a good time running the simulations. We eventually kept the human-readable format rather than the machine-readable format because the step of figuring out how to fit the gliders to run towards the center was at least a little interesting, and because machine formats might carry “too much” information (e.g. solvers might start reading into grid size, starting location, etc.) and it was unclear why every scenario had exactly four gliders.

For me the most difficult part of setting this puzzle was what to do after obtaining the grid of letters. I had an idea that you could try and place the 16 things into a grid, and then run more Life so that the gliders collided further with each other. This turned out to be infeasible and the idea didn’t go anywhere.

The better idea I had was something like the following: when the semaphore is matched onto the 4×4 grid, the solver gets


and the cells with the capital letters above are tinted. I was hoping that the solver could read out the letters T-FIVE which would helpfully tell the solver to run the simulation five steps and end up with the correct answer FLYER.

There was one problem with this: I could not get any of my simulations to produce two orthogonal gliders, no matter how hard I tried. This was with my gaming PC doing glider collisions overnight.

Alas this meant I could not spell the letter F or R using this method. I tried to work around this by having one letter from each word replaced by X which the solver would have to fill in themselves, hoping that was good enough.

In test-solving, this went up in flames. The random X’s actually convinced the solvers that semaphore did not work, and even after this, the message T-FIVE was apparently not noticeable. We thought about a fix where we just told the solvers explicitly to mirror the image once they had it, which would at least work, but was not really elegant, in my opinion.

For the next week, I played around with all sorts of terrible alternative mechanisms to try and get something clean. Almost nothing I tried worked; the fundamental problem of being unable to construct the letter F or R when my assigned answer was FLYER made most things I did quite unnatural. I didn’t sleep well those nights, because all I could think about was how I was going to work my way out of this one.

(One approach would have been to just give up on having 4-glider syntheses, which I had decided on from the start. But simulations with more than four gliders were annoying, and the directions would become a lot more confusing.)

It wasn’t until I had the idea of using a word square that I finally made any progress. Word squares have a lot of redundancy, so if I could specify the letters in the “living” cells, the dead cells could also have any letter I wanted. A-ha! I went and scraped Google for a compilation of word squares, and started searching through it for ones that might work. I eventually found one that seemed like it could work.

So eventually, we managed to the get the 6×6 word square that did make it into the final version. I deliberately opted for a symmetric word square, because I wanted to keep the number of simulations less than 20 (because of earlier feedback about the tediousness of using a bad simulator). We added a “t+1” clue because “+1” was apparently not enough during testsolving.

Screenshot_2021-02-20 Escape From Life — MIT Mystery Hunt 2021

The word square, all done!

This was one of the few puzzles I wrote that have blanks in them, in the image where the still-lifes were first identified. Technically, I think some people could probably solve this puzzle without that image, if glider synthesis was clued a bit more in a different way. However, I thought this was a case where it’s more fun to break in by figuring out the (in some cases, ludicrous) names of the still-lifes gradually, and then being told explicitly to build the gliders with the resulting message.

4. Le Chiffre Indéchiffrable

To be honest, this puzzle was on the verge of not making it. In short, I had the idea of a Vigenere cipher using Unicode with a foreign language key… and that was pretty much it, lol.

Screenshot_2021-02-20 Le Chiffre Indéchiffrable — MIT Mystery Hunt 2021

It’s pretty minimalist as far as puzzle pages go.

The first draft was based on the idea that you have some famous song (think Taylor Swift) except it had been translated into some language L and back, and you needed to figure out the language L. This didn’t work at all, because there isn’t any good way to do this other than brute-force languages in Google Translate.

The second idea was that each key was the name of a KPop band and the plaintext was the song lyrics of one of their songs. It turns out there are a lot of KPop bands whose names can be read as a number like Iz*One, TWICE, GOT7, etc. Well, it turns out this was done already.

So I kind of gave up and just made a maximally uninspired extraction where you just take the first letters of song titles. We decided to get a test-solve of this, where it naturally got pretty low ratings, but by luck, Brian Chen testsolved it and mentioned Google Translate Sings as possibly thematic puzzle material.

I had never heard of Google Translate Sings, but when I did, I realized I now had puzzle data: there was a canonical episode number for all the songs. After doing a cursory search for “puzzle hunt” and “google translate” to confirm this hadn’t been done already, I went for it.

Anyways, at this point, the parameters I had to play with were:

  • Choose a language for the key
  • Choose a language for the plaintext
  • Choose the content of the key
  • Get an episode number for the plaintext

I had the idea that we could run a “do it again”, in which the key itself was a Unicode character, while the episode number from the plaintext was what you subtracted from. That left the languages free to vary, and I wanted to “use up” this information, so I rigged the first letters to spell “FANDOM” and “NUMBER”, to make sure the episode numbers were hinted specifically.

Screenshot_2021-02-20 Solution to Le Chiffre Indéchiffrable — MIT Mystery Hunt 2021

Rigging the languages to spell out a cluephrase.

One small wrench in this plan was that there was a Genius listing for Google Translate Sings that was (a) incomplete, (b) had wrong numbers. I had to carefully make sure to select songs that didn’t appear in Genius, in order to make sure anyone who found the Genius page knew immediately that it wasn’t what they were looking for.

The other wrench is that the Malinda Wiki is not that frequently updated, so some of the more recent songs I needed were not there. So, I am the proud author of the Never Gonna Give You Up page.

Screenshot_2021-02-20 Google Translate Sings Never Gonna Give You Up

The song is great, by the way.

It was also a bit constraining what Unicode characters I could use for the keys, because I needed to make sure the translation of the key back into English was recognizable (it wasn’t always). One other thing I did here was specifically use NULL (unicode point 0) as one of the characters, as a de-confirmation of several wrong approaches like trying to index by the Unicode number somehow.

We gave the first key to the solver because we wanted the solver to first have a confirmation they understand the Unicode Vigenere correctly before going on to try and do cryptanalysis.

5. Blind Calculation

This was pretty straightforward. I saw the answer PRODUCT TESTING in the pool, and looked at my abacus, and was like, “huh, the abacus has a grid”. So the idea to have a mapping Braille and the abacus was born.

Well, actually, to be pedantic, I own a soroban specifically (rather than the Chinese suanpan). Anyways, in keeping with the Japanese roots of the soroban, I thought it would be appropriate to use Japanese Braille rather than standard Braille, which was nice.


The soroban as it appears in the puzzle.

I was a bit worried about the cluephrase seihin tesuto not being unambiguous enough, though test-solvers seemed to get it fine. (There was a suggestion at some point of adding an enumeration 7 7, but by coincidence this is exactly the number of rods, which probably would have led to some red herrings.)

6. Clueless

During test-solving, most people figured out Hanabi and the letter encoding pretty quickly (and there is a lot of confirmation both are correct) As I mentioned in the previous post, there used to only be four scenarios, but we added a fifth easy one to make the rules more well-defined after the first test-solve group struggled. In addition, the plan was moved to the bottom and de-emphasized, since the first test-solving group tried to use them too early; we also added “starts” to make it clear that the first player starts.

Screenshot_2021-02-20 Clueless — MIT Mystery Hunt 2021

The eventual presentation of the puzzle.

Quoting further from the Author’s Notes for the original inspiration:

After playing maybe too many hours of Hanabi during quarantine, I knew I wanted to write a Hanabi puzzle for the Mystery Hunt as well, especially after seeing a Hanabi duck konundrum in the inaugural teammate hunt.

When I played Hanabi at summer math camps as a teenager, the only conventions we would have were play left, discard right, (forward) finesses, and the “fake finesses” (a worse form of the bluff). Plays and saves weren’t even clearly differentiated. Thus our win rate was quite low, and it was not until I played with a more well-developed convention set that I started appreciating the complexity (and fun) of Hanabi.

So, I wanted to develop a puzzle that would showcase at least some of that complexity. Since it’s so difficult to use “conventions” as a puzzle mechanic, I settled on the idea of using the notion of pace for a logic-only puzzle (without regards for efficiency, a measure of how much information is communicated by clues). The idea that the number of discards is limited, and clues can be used not just to indicate cards but also to stall for time, gave me a mechanic that I thought would be new even and interesting even to people who may have casually played Hanabi before. It was a lot of fun during test-solving to watch groups slowly realize (and then prove) that no cards could be discarded in the second scenario.

For extraction, choosing the plans was a fun exercise in Nutrimatic, since if I used a “random” string, it would be too easy to figure out which letters overlap just by looking at which cards were still left to play. Various strings that appeared in one version or other, often that barely didn’t work:


I also originally considered using tap code for the encoding of the letters, but this seemed like an un-thematic complication that was quickly scrapped.

I didn’t really want to use the crossword cluephrase FINAL PROPOSAL, but the current answer has the unfortunate property that it is almost uniquely determined by the first two letters, so this was necessary to prevent solvers from short-circuiting the puzzle too prematurely.

The title of the puzzle was suggested by Danny Bulmash.

One of the things that was hotly debated was the “Chances: m of n!” notation. Technically, the number m is not needed at all, and it seems like it would be more straightforward to say “n left” or similar. However, I wanted to include it anyways for a few reason

  • I felt it was thematic.
  • Writing n! communicated clearly that it was the deck order, rather than say turns left.
  • It emphasizes that the team is super unlikely to win (which I think is important for telling solvers the route is unique). For example, I think test-solve groups used the fact that 2 in 720 is not reduced.
  • It helps check-sum that the solution is right.

Incidentally, I’m almost more proud of the solution write-up than the puzzle itself — it took a lot of time for me to put together.

7. Nutrition Facts

This was also a pretty quick puzzle. One of the remaining puzzle answers was GOOD FAT, which motivated me to look at some nutrition fact containers and think about how the numbers there could be made into a puzzle. Well, numbers correspond to letters, so those bold numbers could be put to good use…

I was reminded about the 9-4-4 rule I learned in grade school, which meant there was a natural way to link the numbers “hidden in plain sight”. I originally wanted to let calories be the unknown number to compute, since it was the biggest number and if it was missing, everyone would think to compute it right away. However, it was hard to get that to work with A1Z26, since the calories number was usually going to be more than 100, making it awkward to naturally map the calorie numbers to letters, even if I allowed scaling by 10 or 100 or whatever.

That’s when I had the idea to make protein the hidden number instead, because for unknown reasons, the percent on protein is usually missing. That worked much better with A1Z26 and completed the puzzle.

Similar to Le Chiffre, I included one example food, a HERRING, so that the solvers knew what was going on. This also gave me one extra degree of freedom to let the total number of calories of all the foods equal 2000; not strictly necessary, but it was a nice artistic touch.


SVG pie chart. When you hover over in the real puzzle, the highlighted section changes and a different color is revealed.

Originally, I just had a pretty dorky pie chart with food images pasted on to it, but during the post-prodding I had a bit more fun making an interactive pie chart with HTML that hovered over the image.

8. Divided Is Us, written with Jakob Weisblat

As written in the author’s notes

This was the last puzzle to be written for the 2021 MIT Mystery Hunt. At a meeting in late December, the editors announced there was exactly one answer remaining, and it was a teamwork time puzzle. As it happens, I was thinking earlier that day how we didn’t have our obligatory Baba puzzle yet, and the idea for the puzzle was born.

(I should say now that actually there was one other puzzle that was written after this: Better Bridges, which was a last-minute backup puzzle.)

To look into whether this was feasible, I did some research on open-source Baba Is You libraries. I settled on using baba-is-auto, although the library is still a work-in-progress as far as I can tell, and I needed to do some editing of the library to make it work.

Anyways, yeah, the premise was co-operative Baba puzzles, where there were different blocks for the various players. The obvious options for the win condition would be “win if anyone reaches the goal”, or “win if everyone reaches the goal”, but the intermediate option of having YOU reach the goal turned out to work the best. (This was not immediately obvious to teams during test-solving even a few levels in, so we had to rewrite the intro level to make it more blatant.)

Meanwhile, I had the thought that it would be cool if the game slowly revealed that everyone was seeing parts of the same world, and say, around level 4 or so, you would suddenly realize this when someone moved their icon onto another player’s screen. Accordingly, we hid the “other player” viewports until after this revelation (at which point some people wondered if they had simply not noticed those viewports all along).


The fifth level of Divided Is Us. One player in each quadrant.

I’m pretty happy with the extraction: I think it’s a good example of how you can embed encodings in unexpected places. Here, I took advantage of the fact that the game was secretly one world in order to put semaphore and Braille (conveniently, flags are not too conspicuous) and it was a lot of fun to watch people realize this around level 6 or 7, as they were reaching the final level. However, I also couldn’t resist using the names of the various icons. It was convenient that many of them were four letter words. Anyways, the answer SECOND GRAND ALLIANCE is pretty long, so there was enough space to use one mechanism per word.

(I did sort of break one puzzle design rule here: you couldn’t finish the puzzle without solving every Baba level. This went okay in test-solving though.)

There were two design choices I wish I had done differently:

  • We tried to set the game so that each person got assigned a player, which wouldn’t change if they refreshed the page. This ended up causing issues when websockets got disconnected or the server went down or whatever, causing certain player slots to get locked out. It wasn’t too hard to work around (e.g. using private windows) but it did annoy a bunch of the teams.
  • We didn’t show the total number of levels, which I regret a bit. I think it would have been more fun for people to realize they were getting to the end. Also, in one case a team managed to prove to themselves the penultimate level was not solvable, and was unsure what to do from there. Had we put a more obvious level indicator, this could have been a bit clearer.

Finally, while I’m here, I’d like to thank Arvi Teikari for the permission to use the game concept and art for this puzzle.

Closing thoughts

I was pretty pleasantly surprised by how many puzzles I was able to produce over the span of the hunt-writing process. The larger puzzles here on average took over my life for a week, and Shortlist and Escape From Life took even longer. It helped a lot that I was taking time away from school this year; I think otherwise there isn’t any way I would have been able to find enough time to put these puzzles together.

On the flip side though, I think I’ve just about used up all my puzzle ideas for the foreseeable future. (At least I say that now.) But I don’t have any regrets about this. I’m not sure I’ll have a chance to write for the Mystery Hunt ever again, so I’m glad I was able to make the most of this chance I got, and I couldn’t have asked for a better team to have write the Mystery Hunt with.

Some puzzle-writing thoughts from an amateur

The 2021 Mystery Hunt concluded a while ago, and wow, what an experience. I was lucky enough to be on the organizing team, and I am so proud right now to be able to say I am a ✈️✈️✈️ Galactic Trendsetter ✈️✈️✈️. (If you don’t know what a puzzle hunt is, betaveros has a great introduction.)

I came in to the MIT Mystery Hunt with no writing experience at all, and ended up being listed as the author of a few. It was a trial by fire, to say the least.

In this post I want to say a bit about the parts of the puzzle-writing process that surprised me the most, in the hopes that maybe it would be helpful to future authors. A nice introduction is given by e.g. David Wilson, which I read many times before I actually attempted to write my first puzzle. Most of what’s said here repeats or builds off of what is already there.

Spoiler warning: this post contains minor spoilers for some of my (Evan’s) puzzles, but nothing that would completely ruin the puzzle. The eight puzzles I made major contributions for the 2021 hunt are A Bit of Light, Blind Calculation, Clueless, Divided Is Us, Escape From Life, Le chiffre indéchiffrable, Nutrition Facts, The IMO Shortlist.

0. Overview

When I first went in to puzzle writing, I had the idea that a puzzle was a series of steps that the solver takes to go from the initial data to an answer. The puzzle-designer’s job was to make figuring out the steps challenging but rewarding (rather than trivial or frustrating). This point of view is really focused on the mechanical part of the puzzle.

After my hunt experiences, I found myself internally guided by splitting puzzle-writing goals into two orthogonal parts: the mechanical half and the aesthetic half. (Or if you are feeling zen, “science” and “art”.) The mechanical part refers to the details of how the solver works through the puzzle, but the aesthetic part refers to things like the theming of the puzzle or so on. I’ll talk a bit more about these below.

1. Finding a core mechanism

I think the thing that was most helpful for me in starting was to be on the lookout for things things that “feel puzzly”. I’m still not sure exactly what I mean when I say that, but I think I vaguely mean things satisfying one or more of the following criteria:

  • Thing is very “well-defined”; for example, it has a canonical reference you can find on the Internet.
  • Thing has a grid.
  • Thing lends itself to puns or wordplay.
  • Thing has a natural mapping, e.g. a way to turn a number into a word (say, episode numbers of a show).
  • Thing has numbers that work well with A=1, B=2, …, Z=26.
  • Thing has a concept of compass directions for flag semaphore.

devjoe’s index is useful both for getting ideas, and also to make a routine search of whether an idea you have is already used. For example, when writing the IMO Shortlist, the first thing I did was look through to see if math contests more generally had been done in any Mystery Hunt (answer: no).

I’m already starting to mention encodings, so I’ll do that now for those of you that have never seen them. Basically, there is a list of puzzle encodings that all experienced solvers are expected to be able to notice in Mystery Hunt. This has the upshot that they can be used freely in puzzles! Examples are:

  • Anything with numbers could be A=1, B=2, …, Z=26.
  • Anything with a grid could be Braille.
  • Anything with directions at all could be semaphore.
  • Any time you have a bunch of strings, the first letters might spell a message.
  • Any time you have a sequence with exactly two kinds of objects (ones or zeros, on or off, true or false, red or blue, dot or dash) consider binary (which could then become letters by A1Z26) or occasionally Morse code.
  • Any time you have words of the same length, see if their letters overlap.
  • Any time you have a bijection f from a set to itself, you should see if that function forms a cycle. If so, take that as an ordering.

The extraction part of Brian Chen’s introduction to puzzle hunts goes over all of these in more detail.

Screenshot_2021-02-21 CodeSheet-201912 pdf - CodeSheet-201912 pdf

Table of common encodings from, click to open

Examples of moments in my life when I saw something and said “this would be good puzzle material”, and flew with it:

  • Blind Calculation came about because I was staring at an abacus on my table and saw a grid.
  • Escape From Life came about because I thought that this website (spoiler) was a puzzle gold-mine. (I had a lot of fun drawing a SHIPTIE as a ship with a tie.)
  • Nutrition Facts came about because I was staring at a nutrition facts label and noticed how there were a lot of numbers with relations between them.

One thing I occasionally found helpful was to start with the answer to the puzzle first to help narrow down the focus a bit. For the 2021 MIT Mystery Hunt, there was a big pool of several hundred answers, about 50 of which were for the Students round (shorter easy puzzles, like the fish puzzles from 2015). The Students answers were put in a big pool, and you could request one of them to start and then write a puzzle with it. Blind Calculation and Nutrition Facts were both started this way, where I was first given the answer to the puzzle, and then I started looking for objects that both felt puzzly and fit thematically. But it’s not necessary; the other six puzzles I wrote were because I had an idea in an area I knew well, came up with a mechanism, and got an answer assigned once I had a plan written out.

2. Aesthetic

I think the thing that surprised me most is that you can have a really successful puzzle which is almost trivial mechanically (in the sense the solver has pretty clear instructions on what to do the entire way through). For example, this year’s hunt featured a puzzle called Boggle Battle where most of the work to be done was playing cooperative Boggle with friends, and the only “puzzly” part was the extraction. Similarly, So You Think You Can Count has minimal puzzle elements, and doesn’t have an extraction at all (upon reaching 100, the solvers are simply given the answer). They both got high ratings during testsolving because the solvers had fun anyways.

Helpful Firefly, an iconic character from So You Think You Can Count?

I think the conclusion I drew from this was that while mechanics are important (e.g. things like checksumming, eliminating false trails, no extra information, etc.) caring about the aesthetics of the puzzle is more important. (Of course, like many things in life, they are not completely orthogonal: thematic coherence is one of the best confirmations a puzzle-solver can have when they have a guess at the next step.)

In particular, one useful guideline is to try and consider cutting any steps that didn’t feel thematically coherent.

But this was not always a prohibition! For example, despite the advice that it’s better to have only one a-ha per puzzle, I found out that I could generally get away with having multiple a-ha’s for a puzzle as long as all of them felt thematically coherent with each other. Le chiffre indéchiffrable is a decent example of this in my opinion, where despite multiple steps and moving parts, the steps all seem to resonate with each other.

An anti-example of this might be the extraction step of IMO Shortlist, which arguably felt a little out of place, although I ended up keeping it anyways for lack of a better alternative.

3. Writing

Once I had the general mechanism for the problem down, there were often still a lot of smaller local decisions to make as I composed the puzzle. Editors were helpful at this step to provide feedback; but there are some guidelines that I would constantly check against during the writing process (pre-emptively addressing concerns).

Some examples (by no means the only ones):

  • Do whatever you can to confirm correct paths, e.g. make it so that if a solver tries the right thing, they immediately get a confirmation of some sort.
  • Try to make sure all the information is used, and minimize arbitrary choices. For example, alphabetize things when the given order doesn’t matter. (There are some exceptions to this, e.g. one common puzzle mechanic is that you have a sea of letters and draw some lines, and look at the intersected letters. Thor from puzzle potluck 3 is an example. Necessarily, some of the “distracting” letters are never used.)
  • Provide enough redundancy or checkpoints so that a team can still solve the puzzle even with a few errors. For example, you might try to set things up so that one error leads to just a single garbled letter in a phrase which can be noticed and fixed; or, if you have a list of objects to identify, having them in alphabetical order will also help fix mistakes.

Basically, I would constantly put myself into the shoes of the solver. At each milestone in the puzzle, I would imagine what they were looking at, and how they might react it. I often paid special attention to thinking about what feedback they would get if they did the correct thing, and what feedback they would get if they made a mistake (e.g. “if they make a small error here, the number they get probably won’t be an integer, so they’ll know right away they made a mistake”), and what possible wrong paths the solver could take.

It’s also worth paying a bit of attention to accessibility of the puzzle. For my puzzles, this mostly consisted of:

  • Adding alt-texts to images (especially any text in the image).
  • Including enlarged versions of any image I used (typically, if you click an image, it opens the full-resolution image).
  • Keeping colorblind or black/white printing in mind when choosing colors (or choosing not to use colors). Though I had to violate this blatantly for Clueless, where the exact choice of colors was important to the puzzle.

4. Testsolving

If you are a software programmer, then debugging is part of your life. No matter how experienced you are, you are going to have some bugs in your code. The compiler will catch some of them; testing will catch others. You might have fewer bugs over time as you get more experienced, but they will never go away completely.

I think the same is largely true for puzzle-writing.

I don’t mean this in the sense that the puzzle will literally have an outright error or typo every so often (though if we’re being honest here, all my first drafts sure did). What I really mean is that, the first version of the puzzle will have unexpected issues, and often the only reasonable way to find them out is through test-solving. I had the common experience of watching my puzzle completely fall apart because the solvers were doing something totally different from what I wanted.

For example, in Clueless:

  • The “(starts)” text wasn’t there originally. I thought it was clear that 1 should start. The test-solvers didn’t, and tried to get scenarios to work by having 2, 3, or 4 start instead.
  • The texts like “FIVES WIN THIS” used to be at the top in bold, and test-solvers tried to use this information too early in ways that to me made no sense. I tried to move it the bottom to make it less prominent.
  • There used to only be four scenarios, but the solvers got stuck quickly. It was necessary to add one easy scenario to help as a foothold to clarify what rules were allowed.

In the programming analogy, the editors are the compilers: they try to point out preliminary issues. The test-solvers are the unit tests. Since test-solver time is a valuable resource, it’s important to have a version that is ready to testsolve for them, but at the same time, I wish I hadn’t agonized so hard sometimes on minor decisions before testsolving, because the feedback from testsolving often led to huge revisions anyways.

So, even though I mostly worked without coauthors, the editors and test-solvers were the primary source of feedback for revising the puzzle.

5. Closing thoughts

This is all for now. I’m hoping to make a later post with more details about the creation process of each of the eight individual puzzles mentioned above.

In the meantime, I’d like to give one last shout-out to everyone on Galactic who helped me so much in getting onboarded into the puzzle-writing process. Thanks y’all!

USEMO Problem Development, Behind the Scenes

In this post I’m hoping to say a bit about the process that’s used for the problem selection of the recent USEMO: how one goes from a pool of problem proposals to a six-problem test. (How to write problems is an entirely different story, and deserves its own post.) I choose USEMO for concreteness here, but I imagine a similar procedure could be used for many other contests.

I hope this might be of interest to students preparing for contests to see a bit of the behind-the-scenes, and maybe helpful for other organizers of olympiads.

The overview of the entire timeline is:

    1. Submission period for authors (5-10 weeks)
    2. Creating the packet
    3. Reviewing period where volunteers try out the proposed problems (6-12 weeks)
    4. Editing and deciding on a draft of the test
    5. Test-solving of the draft of the test (3-5 weeks)
    6. Finalizing and wrap-up

Now I’ll talk about these in more detail.

Pinging for problems

The USA has the rare privilege of having an extremely dedicated and enthusiastic base of volunteers, who is going to make the contest happen rain or shine. When I send out an email asking for problem proposals, I never really worry I won’t get enough people. You might have to adjust the recipe below if you have fewer hands on deck.

When you’re deciding who to invite, you have to think about a trade-off between problem security versus openness. The USEMO is not a high-stakes competition, so it accepts problems from basically anyone. On the other hand, if you are setting problems for your country’s IMO team selection test, you probably don’t want to take problems from the general public.

Submission of the problem is pretty straightforward, ask to have problems emailed as TeX, with a full solution. You should also ask for any information you care about to be included: the list of authors of the problem, any other collaborators who have seen or tried the problem, and a very very rough estimate of how hard the difficulty is. (You shouldn’t trust the estimate too much, but I’ll explain in a moment why this is helpful.)

Ideally I try to allocate 5-10 weeks between when I open submissions for problems and when the submission period ends.

It’s at this time you might as well see who is going to be interested in being a test-solver or reviewer as well — more on that later.

Creating the packet

Once the submission period ends, you want to then collate the problems into a packet that you can send to your reviewers. The reviewers will then rate the problems on difficulty and suitability.

A couple of minor tips for setting the packet:

  • I used to try to sort the packet roughly by difficulty, but in recent years I’ve switched to random order and never looked back. It just biases the reviewers too much to have the problem number matter. The data has been a lot better with random order.
  • Usually I’ll label the problems A-01, A-02, …, A-05 (say), C-06, C-07, …., and so on. The leading zero is deliberate: I’ve done so much IMO shortlist that if I see a problem named “C7”, it will automatically feel like it should be a hard problem, so using “C-07” makes this gut reaction go away for me.
  • It’s debatable whether you need to have subject classifications at all, since in some cases a problem might not fit cleanly, or the “true” classification might even give away the problem. I keep it around just because it’s convenient from an administrative standpoint to have vaguely similar problems grouped together and labelled, but I’ll explicitly tell reviewers to not take the classification seriously, rather as a convenience.

More substantial is the choice of which problems to include in the packet if you are so lucky to have a surplus of submissions. The problem is that reviewers only have so much time and energy and won’t give as good feedback if the packet is too long. In my experience, 20 problems is a nice target, 30 problems is strenuous, anything more than that is usually too much. So if you have more than 30 problems, you might need to cut some problems out.

Since this “early cutting” is necessarily pretty random (because you won’t be able to do all the problems yourself singe-handedly), I usually prefer to do it in a slightly more egalitarian way. For example, if one person submits a lot of problems, you might only take a few problems from them, and say “we had a lot of problems, so we took the 3 of yours we liked the most for review”. (That said, you might have a sense that certain problems are really unlikely to be selected, and so you might as well exclude those.)

You usually also want to make sure that you have a spread of difficulties and subjects. Actually, this is even more true if you don’t have a surplus: if it turns out that, say, you have zero easy algebra or geometry problems, that’s likely to cause problems for you later down the line. It might be good to see if you can get one or two of those if you can. This is why the author’s crude estimates of difficulty can still be useful — it’s not supposed to be used for deciding the test, but it can give you “early warnings” that the packet might be lacking in some area.

One other useful thing to do, if you have the time at this point, is to edit the proposals as they come in, before adding them onto the packet. This includes both making copy edits for clarity and formatting, as well as more substantial changes if you can see an alternate version or formulation that you think is likely to be better than the original. The reason you want to do this step now is that you want the reviewer’s eyes on these changes: making edits halfway through the review process or after it can cause confusion and desynchronization in the reviewer data, and increases the chances of errors (because the modifications have been checked fewer times).

The packet review process

Then comes the period where your reviewers will work through the problems and submit their ratings. This is also something where you want to give reviewers a lot of time: 6-8 weeks is a comfortable amount. Between 10 and 25 reviewers is a comfortable number.

Reviewers are asked to submit a difficulty rating and quality rating for each problem. The system I’ve been using, which seems to work pretty well, goes like follows:

  • The five possible difficulty ratings are “Unsuitable”, “Mediocre”, “Acceptable”, “Nice”, “Excellent”. I’ve found that this choice of five words has the connotations needed to get similar distributions of calibrations across the reviewers.
  • For difficulty, I like to provide three checkboxes “IMO1”, “IMO2”, “IMO3”, but also tell the reviewer that they can check two boxes if, say, they think a problem could appear as either IMO1 or IMO2. That means in essence five ratings {1, 1.5, 2, 2.5, 3} are possible.

This is what I converged on as far as scales that are granular enough to get reasonable numerical data without being so granular that they are unintuitive or confusing. (If your scale is too granular, then a person’s ratings might say more about how a person interpreted the scale than the actual content of the problem.) For my group, five buckets seems to be the magic number; your mileage may vary!

More important is to have lots of free text boxes so that reviewers can provide more detailed comments, alternate solutions, and so on. Those are ultimately more valuable than just a bunch of numbers.

Here’s a few more tips:

  • If you are not too concerned about security, it’s also nice to get discussion between reviewers going. It’s more fun for the reviewers and the value of having reviewers talk with each other a bit tends to outweight the risk of bias.
  • It’s usually advised to send only the problem statements first, and then only send the solutions out about halfway through. I’ve found most reviewers (myself included) appreciate the decreased temptation to look at solutions too early on.
  • One thing I often do is to have a point person for each problem, to make sure every problem is carefully attempted This is nice, but not mandatory — the nicest problems tend to get quite a bit of attention anyways.
  • One thing I’ve had success with is adding a question on the review form that asks “what six problems would you choose if you were making the call, and why?” I’ve found I get a lot of useful perspective from hearing what people say about this.

I just use Google Forms to collect all the data. There’s a feature you can enable that requires a sign-in, so that the reviewer’s responses are saved between sessions and loaded automatically (making it possible to submit the form in multiple sittings).

Choosing the draft of the test

Now that you have the feedback, you should pick a draft of the test! This is the most delicate part, and it’s where it is nice to have a co-director or small committee if possible so that you can talk out loud and bounce ideas of each other.

For this stage I like to have a table with the numerical ratings as a summary of what’s available. The way you want to do this is up to you, but some bits from my workflow:

  • My table is color-coded, and it’s sorted in five different ways: by problem number, by quality rating, by difficulty rating, by subject then quality rating, by subject then difficulty rating.
  • For the quality rating, I use the weights -0.75, -0.5, 0, 1, 1.5 for Unsuitable, Mediocre, Acceptable, Nice, Excellent. This fairly contrived set of weights was chosen based on some experience in which I wanted to the average ratings to satisfy a couple properties: I wanted the sign of the rating (- or +) to match my gut feeling, and I wanted to not the rating too sensitive to a few Unsuitable / Excellent ratings (either extreme). This weighting puts a “cliff” between Acceptable and Nice, which empirically seems a good place to make the most differential.
  • I like to include a short “name” in the table to help with remembering which problem numbers are which, e.g. “2017-vtx graph”.

An example of what a table might look like is given in the image below.

Here is a output made with fake data. The Python script used to generate the table is included here, for anyone that wants to use it.

WT_U = -0.75
WT_M = -0.5
WT_A = 0
WT_N = 1
WT_E = 1.5

# ---- Populate with convincing looking random data ----
import random
slugs = {
		"A-01" : r"$\theta \colon \mathbb Z[x] \to \mathbb Z$",
		"A-02" : r"$\sqrt[3]{\frac{a}{b+7}}$",
		"A-03" : r"$a^a b^b c^c$",
		"C-04" : r"$a+2b+\dots+32c$",
		"C-05" : r"$2017$-vtx dinner",
		"G-06" : r"$ST$ orthog",
		"G-07" : r"$PO \perp YZ$",
		"G-08" : r"Area $5/2$",
		"G-09" : r"$XD \cap AM$ on $\Gamma$",
		"G-10" : r"$\angle PQE, \angle PQF = 90^{\circ}$",
		"N-11" : r"$5^n$ has six zeros",
		"N-12" : r"$n^2 \mid b^n+1$",
		"N-13" : r"$fff$ cycles",

qualities = {}
difficulties = {}
for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e,f = [random.randrange(0,3) for _ in range(6)]
	if c >= 1: a = 0
	if a >= 2: d,e = 1,0
	if e == 0: f = 0
	if a == 0 and b == 0: e *= 2
	qualities[k] = [WT_U] * a + [WT_M] * b + [WT_A] * (b+d+e) + [WT_N] * (c+d+e) + [WT_E] * (c+e+f)

for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e = [random.randrange(0,5) for _ in range(5)]
	if e >= 4:
		b = 0
		c //= 2
	elif e >= 3:
		a = 0
		b //= 2
	if a >= 3:
		e = 0
		d //= 3
	elif a >= 2:
		e = 0
		d //= 2
	difficulties[k] = [1] * a + [1.5] * b + [2] * c + [2.5] * d + [3] * e

# ---- End random data population ----

import statistics
def avg(S):
	return statistics.mean(S) if len(S) > 0 else None
def median(S):
	return statistics.median(S) if len(S) > 0 else None

# criteria for inclusion on chart
criteria = lambda k: True

def get_color_string(x, scale_min, scale_max, color_min, color_max):
	if x is None:
		return r"\rowcolor{gray}"
	m = (scale_max+scale_min)/2
	a = min(int(100 * 2 * abs(x-m) / (scale_max-scale_min)), 100)
	color = color_min if x < m else color_max
	return r"\rowcolor{%s!%d}" %(color, a) + "\n"

def get_label(key, slugged=False):
	if slugged:
		return r"{\scriptsize \textbf{%s} %s}" %(key, slugs.get(key, ''))
		return r"{\scriptsize \textbf{%s}}" % key

## Quality rating
def get_quality_row(key, data, slugged = True):
	a = avg(data)
	s = ("$%+4.2f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, WT_U, WT_E, "Salmon", "green")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
	return color_tex + row_tex
def print_quality_table(d, sort_key = None, slugged = True):
	items = sorted(d.items(), key = sort_key)
	print(r"\toprule Prob & U & M & A & N & E & Avg \\ \midrule")
	for key, data in items:
		print(get_quality_row(key, data, slugged))

## Difficulty rating
def get_difficulty_row(key, data, slugged = False):
	a = avg(data)
	s = ("$%.3f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, 1, 3, "cyan", "orange")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
	return color_tex + row_tex
def print_difficulty_table(d, sort_key = None, slugged = False):
	items = sorted(d.items(), key = sort_key)
	print(r"\begin{tabular}{l ccccc c}")
	print(r"\toprule Prob & 1 & 1.5 & 2 & 2.5 & 3 & Avg \\ \midrule")
	for key, data in items:
		print(get_difficulty_row(key, data, slugged))

filtered_qualities = {k:v \
		for k,v in qualities.items() if criteria(k)}
filtered_difficulties = {k:v \
		for k,v in difficulties.items() if criteria(k)}

def print_everything(name, fn = None, flip_slug = False):
	if fn is not None:
		sort_key = lambda item: fn(item[0])
		sort_key = None
	print(r"\section{" + name + "}")
	if flip_slug:
		print_quality_table(filtered_qualities, sort_key, False)
		print_difficulty_table(filtered_difficulties, sort_key, True)
		print_quality_table(filtered_qualities, sort_key, True)
		print_difficulty_table(filtered_difficulties, sort_key, False)

# Start outputting content

\title{Example of ratings table with randomly generated data}


print(r"\section{All ratings}")

print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall popularity",
		lambda p : (-avg(qualities[p]), p), False)
print_everything("Beauty contest, by subject and popularity",
		lambda p : (p[0], -avg(qualities[p]), p), False)
print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall difficulty",
		lambda p : (-avg(difficulties[p]), p), True)
print_everything("Beauty contest, by subject and difficulty",
		lambda p : (p[0], -avg(difficulties[p]), p), True)

print(r"\section{Scatter plot}")
print(r"""\begin{axis}[width=0.9\textwidth, height=22cm, grid=both,
	xlabel={Average difficulty}, ylabel={Average suitability},
	every node near coord/.append style={font=\scriptsize},
print(r"""\addplot [scatter,
	only marks, point meta=explicit symbolic,
	nodes near coords*={\prob},
	visualization depends on={value \thisrow{prob} \as \prob}]""")
print(r"table [meta=subj] {")
for p in qualities.keys():
	x = avg(difficulties[p])
	y = avg(qualities[p])
	if x is None or y is None:
	print("%0.2f\t%0.2f\t%s\t%s" %(x,y,p[2:],p[0]))


Of course, obligatory warning to not overly rely on the numerical ratings and to put heavy weight on the text comments provided. (The numerical ratings will often have a lot of variance, anyways.)

One thing to keep in mind is that when choosing the problems is that there are two most obvious goals are basically orthogonal. One goal is to have the most attractive problems (“art”), but the other is to have an exam which is balanced across difficulty and subject composition (“science”). These two goals will often compete with each other and you’ll have to make judgment calls to prioritize one over the other.

A final piece of advice is to not be too pedantic. For example, I personally dislike the so-called “Geoff rule” that 1/2/4/5 should be distinct subjects: I find that it is often too restrictive in practice. I also support using “fractional distributions” in which say a problem can be 75% number theory and 25% combinatorics (rather than all-or-nothing) when trying to determine how to balance the exam. This leads to better, more nuanced judgments than insisting on four categories.

This is also the time to make any last edits you want to the problems, again both copy edits or more substantial edits. This gives you a penultimate draft of the exam.

Test solving

If you can, a good last quality check to do is to have a round of test-solving from an unbiased group of additional volunteers who haven’t already seen the packet. (For the volunteers, this is a smaller time commitment than reviewing an entire packet so it’s often feasible as an intermediate commitment.) You ask this last round of volunteers to try out the problems under exam-like conditions, although I find it’s not super necessary to do a full 4.5 hours or have full write-ups, if you’ll get more volunteers this way. A nice number of test-solvers is 5-10 people.

Typically this test-solving is most useful as a sanity check (e.g. to make sure the test is not obviously too difficult) and for any last minute shuffling of the problems (which often happens). I don’t advise making drastic changes at this point. It’s good as a way to get feedback on the most tricky decisions, though.


After any final edits, I recommend sending a copy of the edited problems and solutions to the reviewers and test-solvers. They’re probably interested to know what problems made the cut, and you want to have eyes going through the final paper to check for ambiguities or errors.

I usually take the time to also send out some details of the selection itself: what the ratings for the problems looked like, often a sentence or two for each problem about the overall feedback, and a documentation of my thought process in the draft selection. It’s good to give people feedback on their problems, in my experience the authors usually appreciate it a lot, especially if they decide to re-submit the problem elsewhere.

And that’s the process.

USA Special Team Selection Test Series for IMO 2021

A lot of people have been asking me how team selection is going to work for the USA this year. This information was sent out to the contestants a while ago, but I understand that there’s a lot of people outside of MOP 2020 who are interested in seeing the TST problems :) so this is a quick overview of how things are going down this year.

This year there are six tests leading to the IMO 2021 team:

  • USA TSTST Day 1: November 12, 2020 (3 problems, 4.5 hours)
  • USA TSTST Day 2: December 10, 2020 (3 problems, 4.5 hours)
  • USA TSTST Day 3: January 21, 2021 (3 problems, 4.5 hours)
  • RMM Day 1: February 2021 (3 problems, 4.5 hours)
  • APMO: March 2021 (5 problems, 4 hours)
  • USAMO: April 2021 (2 days, each with 3 problems and 4.5 hours)

Everyone who was at the virtual MOP in June 2020 is invited to all three days of TSTST, and then the top scores get to take the latter three exams as team selection tests for the IMO. Meanwhile, the RMM teams and EGMO teams are based on just the three days of TSTST.

Similar to past years, discussion of TSTST is allowed on noon Eastern time Monday after each day. That means you can look forward to the first set of three new problems coming out on Monday, November 16, and similarly for the other two days of TSTST.

To add to the hype, I’ll be doing a short one-hour-or-less Twitch stream at 8:00pm ET on Tuesday November 17 where I present the solutions to the TSTST problems of day 1.  If there’s demand, I’ll probably run a review session for the other two days of TSTST, as well.

EDIT: Changed stream time to Tuesday so more people have time to try the problems.

On choosing exercises

Finally, if you attempt to read this without working through a significant number of exercises (see §0.0.1), I will come to your house and pummel you with [Gr-EGA] until you beg for mercy. It is important to not just have a vague sense of what is true, but to be able to actually get your hands dirty. As Mark Kisin has said, “You can wave your hands all you want, but it still won’t make you fly.”

— Ravi Vakil, The Rising Sea: Foundations of Algebraic Geometry

When people learn new areas in higher math, they are usually required to do some exercises. I think no one really disputes this: you have to actually do math to make any progress.

However, from the teacher’s side, I want to make the case that there is some art to picking exercises, too. In the process of writing my Napkin as well as taking way too many math classes I began to see some patterns in which exercises or problems I tended to add to the Napkin, or which exercises I found helpful when learning myself. So, I want to explicitly record some of these thoughts here.

1. How not to do it

So in my usual cynicism I’ll start by saying what I think people typically do, and why I don’t think it works well. As far as I can tell, the criteria used in most classes is:

  1. The student is reasonably able to (at least in theory) eventually solve it.
  2. A student with a solid understanding of the material should be able to do it.
  3. (Optional) The result itself is worth knowing.

Both of these criteria are good. My problem is that I don’t think they are sufficient.

To explain why, let me give a concrete example of something that is definitely assigned in many measure theory classes.

Okay example (completion of a measure space). Let {(X, \mathcal A, \mu)} be a measure space. Let {\overline{\mathcal A}} denote all subsets of {X} which are the union of a set in {\mathcal A} and a null set. Show that {\overline{\mathcal A}} is a sigma-algebra there is a unique extension of the measure {\mu} to it.

I can see why it’s tempting to give this as an exercise. It is a very fundamental result that the student should know. The proof is not too difficult, and the student will understand it better if they do it themselves than if they passively read it. And, if a student really understands measures well, they should find the exercise quite straightforward. For this reason I think this is an okay choice.

But I think we can do better.

In many classes I’ve taken, nearly all the exercises looked like this one. I think when you do this, there are a couple blind spots that sometimes get missed:

  • There’s a difference between “things you should be able to do after learning Z well” and “things you should be able to do when first learning Z“. I would argue that the above example is the former category, but not the latter one — if a student is learning about measures for the first time, my first priority would be to make sure they get a good conceptual understanding first, and in particular can understand why the statement should be true. Then we can worry about actually proving it.
  • Assigning an exercise which checks if you understand X is not the same as actually teaching it. Okay exercises can verify if you understand something, great exercises will actively help you understand it.

2. An example that I found enlightening

In contrast, this year I was given an exercise which I thought was so instructive that I’ll post it here. It comes from algebraic geometry.

Exercise: The punctured gyrotop is the open subset {U} of {X = \mathrm{Spec} \mathbb C[x,y,z] / (xz, yz)} obtained by deleting the origin {(x,y,z)} from {X}. Compute {\mathcal O_X(U)}.

It was after I did this exercise that I finally felt like I understood why distinguished open sets are so important when defining an affine scheme. For that matter, it finally clicked why sheaves on a base are worth caring about.

I had read lots and lots of words and pushed symbols around all day. I had even proved, on paper already, that {\mathcal O(U \sqcup V) = \mathcal O(U) \times \mathcal O(V)}. But I never really felt it. This exercise changed that for me, because suddenly I had an example in front of me that I could actually see.

3. Some suggested additional criteria

So here are a few suggested guidelines which I think can help pick exercises like that one.

A. They should be as concrete as possible.

This is me yelling at people to use more examples, once again. But I think having students work through examples as an exercise is just as important (if not more) than reading them aloud in lecture.

One other benefit of using concrete examples is that you can avoid the risk of students solving the exercise by “symbol pushing”. I think many of us know the feeling of solving some textbook exercise by just unwinding a definition and doing a manipulation, or black-boxing some theorem and blindly applying it. In this way one ends up with correct but unenlightening proofs. The issue is that nothing written down resonates with System 1, and so the result doesn’t get internalized.

When you give a concrete exercise with a specific group/scheme/whatever, there is much less chance of something like that happening. You almost have to see the example in order to work with it. I really think internalizing theorems and definitions is better done in this concrete way, rather than the more abstract or general manipulations.

B. They should be enjoyable.

Math majors are humans too. If a whole page of exercises looks boring, students are less likely to do them.

This is one place where I think people could really learn from the math contest community. When designing exams like IMO or USAMO, people fight over which problems they think are the prettiest. The nicest and most instructive exam problems are passed down from generation to generation like prized heirlooms. (Conveniently, the problems are even named, e.g. “IMO 2008/3”, which I privately think helps a ton; it gives the problems a name and face. The most enthusiastic students will often be able to recall where a good problem was from if shown the statement again.) Imagine if the average textbook exercises had even a tenth of that enthusiasm put into crafting them.

Incidentally, I think being concrete helps a lot with this. Part of the reason I enjoyed the punctured gyrotop so much was that I could immediately draw a picture of it, and I had a sense that I should be able to compute the answer, even though I wasn’t experienced enough yet to see what it was. So it was as if the exercise was leading me on the whole way.

For an example of how not to do it, here’s what I think my geometry book would look like if done wrong.

C. They should not be too tricky.

People are always dumber than you think when they first learn a subject; things which should be obvious often are not. So difficulty should be used in moderation: if you assign a hard exercise, you should assume by default the student will not solve it, so there better be some reason you’re adding some extra frustration.

I should at this point also mention some advice most people won’t be able to take (because it is so time-consuming): I think it’s valuable to write full solutions for students, especially on difficult problems. When someone is learning something for the first time, that is the most important time for the students to be able to read the full details of solutions, precisely because they are not yet able to do it themselves.

In math contests, the ideal feedback cycle is something like: a student works on a problem P, makes some progress (possibly solving it), then they look at the solution and see what they were missing or where they could have cleaned up their solution or what they could have done differently, et cetera. This lets them update their intuition or toolkit before going on. If you cut out this last step by not providing solutions, you lose the only real chance you had to give feedback to the student.

4. Memorability

I have, on more occasions than I’m willing to admit, run into the following situation. I solve some exercise in a textbook. Sometime later, I am reading about some other result, and I need some intermediate result, which looks like it could be true but I don’t how to prove it immediately. So I look it up, and then find out it was the exercise I did (and then have to re-do the exercise again because I didn’t write up the solution).

I think you can argue that if you don’t even recognize the statement later, you didn’t learn anything from it. So I think the following is a good summarizing test: how likely is the student to actually remember it later?

USEMO sign-ups are open

I’m happy to announce that sign-ups for my new olympiad style contest, the United States Ersatz Math Olympiad (USEMO), are open now! The webpage for the USEMO is (where sign-ups are posted).

The US Ersatz Math Olympiad is a proof-based competition open to all US middle and high school students. Like many competitions, its goals are to develop interest and ability in mathematics (rather than measure it). However, it is one of few proof-based contests open to all US middle and high school students. You can see more about the goals of this contest in the mission statement.

The contest will run over Memorial day weekend:

  • Day 1 is Saturday May 23 2020, from 12:30pm ET — 5:00pm ET.
  • Day 2 is Sunday May 24 2020, from 12:30pm ET — 5:00pm ET.

In the future, assuming continued interest, I hope to make the USEMO into an annual tradition run in the fall.

Circular optimization

This post will mostly be focused on construction-type problems in which you’re asked to construct something satisfying property {P}.

Minor spoilers for USAMO 2011/4, IMO 2014/5.

1. What is a leap of faith?

Usually, a good thing to do whenever you can is to make “safe moves” which are implied by the property {P}. Here’s a simple example.

Example 1 (USAMO 2011)

Find an integer {n} such that the remainder when {2^n} is divided by {n} is odd.

It is easy to see, for example, that {n} itself must be odd for this to be true, and so we can make our life easier without incurring any worries by restricting our search to odd {n}. You might therefore call this an “optimization”: a kind of move that makes the problem easier, essentially for free.

But often times such “safe moves” or not enough to solve the problem, and you have to eventually make “leap-of-faith moves”. For example, maybe in the above problem, we might try to focus our attention on numbers {n = pq} for primes {p} and {q}. This does make our life easier, because we’ve zoomed in on a special type of {n} which is easy to compute. But it runs the risk that maybe there is no such example of {n}, or that the smallest one is difficult to find.

2. Circular reasoning can sometimes save the day

However, a strange type of circular reasoning can sometimes happen, in which a move that would otherwise be a leap-of-faith is actually known to be safe because you also know that the problem statement you are trying to prove is true. I can hardly do better than to give the most famous example:

Example 2 (IMO 2014)

For every positive integer {n}, the Bank of Cape Town issues coins of denomination {\frac 1n}. Given a finite collection of such coins (of not necessarily different denominations) with total value at most {99 + \frac12}, prove that it is possible to split this collection into {100} or fewer groups, such that each group has total value at most {1}.

Let’s say in this problem we find ourselves holding two coins of weight {1/6}. Perhaps we wish to put these coins in the same group, so that we have one less decision to make. However, this could rightly be viewed as a “leap-of-faith”, because there’s no logical reason why the task must remain possible after making this first move.

Except there is a non-logical reason: this is the same as trading the two coins of weight {1/6} for a single coin of weight {1/3}. Why is the task still possible? Because the problem says so: the very problem we are trying to solve includes this case, too. If the problem is going to be true, then it had better be true after we make this trade.

Thus by a perverse circular reasoning we can rest assured that our leap-of-faith here will not come back to bite us. (And in fact, this optimization is a major step of the solution.)

3. More examples of circular optimization

Here’s some more examples of problems you can try that I think have a similar idea.

Problem 1

Prove that in any connected graph {G} on {2004} vertices one can delete some edges to obtain a graph (also with {2004} vertices) whose degrees are all odd.

Problem 2 (USA TST 2017)

In a sports league, each team uses a set of at most {t} signature colors. A set {S} of teams is color-identifiable if one can assign each team in {S} one of their signature colors, such that no team in {S} is assigned any signature color of a different team in {S}. For all positive integers {n} and {t}, determine the maximum integer {g(n,t)} such that: In any sports league with exactly {n} distinct colors present over all teams, one can always find a color-identifiable set of size at least {g(n,t)}.

Feel free to post more examples in the comments.

Meritocracy is the worst form of admissions except for all the other ones

I’m now going to say something explicitly that I hinted at in June: I don’t think a student deserves to make MOP more because they had a higher score than another student.

I think it’s easy to get this impression because the selection for MOP is done by score cutoffs. So it sure looks that way.

But I don’t think MOP admissions (or contests in general) are meant to be a form of judgment. My primary agenda is to run a summer program that is good for its participants, and we get funding for N of them. For that, it’s not important which N students make it, as long as they are enthusiastic and adequately prepared. (Admittedly, for a camp like MOP, “adequately prepared” is a tall order). If anything, what I would hope to select for is the people who would get the most out of attending. This is correlated with, but not exactly the same as, score.

Two corollaries:

  • I support the requirement for full attendance at MOP. I know, it sucks for those star students who qualify for two conflicting and then have to choose. You have my apologies (and congratulations). But if you only come for 2 of 3 weeks, you took away a spot from someone who would have attended the whole time.
  • I am grateful to the European Girl’s MO for giving MOP an opportunity to balance the gender ratio somewhat; empirically, it seems to improve the camp atmosphere if the gender ratio is not 79:1.

Anyways, given my mixed feelings on meritocracy, I sometimes wonder whether MOP should do what every other summer camp does and have an application, or even a lottery. I think the answer is no, but I’m not sure. Some reasons I can think of behind using score only:

  1. MOP does have a (secondary) goal of IMO training, and as a result the program is almost insane in difficulty. For this reason you really do need students with significant existing background and ability. I think very few summer camps should explicitly have this level of achievement as a goal, even secondarily. But I think there should be at least one such camp, and it seems to be MOP.
  2. Selection by score is transparent and fair. There is little risk of favoritism, nepotism, etc. This matters a lot to me because, basically no matter how much I try to convince them otherwise, people will take any admissions decision as some sort of judgment, so better make it impersonal. (More cynically, I honestly think if MOP switched to a less transparent admissions process, we would be dealing with lawsuits within 15 years.)
  3. For better or worse, qualifying for MOP ends up being sort of a reward, so I want to set the incentives right and put the goalpost at “do maximally well on USAMO”. I think we design the USAMO well enough that preparation teaches you valuable lessons (math and otherwise). For an example of how not to set the goalpost, take most college admissions processes.

Honestly, the core issue might really be cultural, rather than an admissions problem. I wish there was a way we could do the MOP selection as we do now without also implicitly sending the (unintentional and undesirable) message that we value students based on how highly they scored.

MOHS hardness scale

There’s a new addition to my olympiad problems and solutions archive: I created an index of many past IMO/USAMO/USA TST(ST) problems by what my opinions on their difficulties are. You can grab the direct link to the file below:

In short, the scale runs from 0M to 50M in increments of 5M, and every USAMO / IMO problem on my archive now has a rating too.

My hope is that this can be useful in a couple ways. One is that I hope it’s a nice reference for students, so that they can better make choices about what practice problems would be most useful for them to work on. The other is that the hardness scale contains a very long discussion about how I judge the difficulty of problems. While this is my own personal opinion, obviously, I hope it might still be useful for coaches or at least interesting to read about.

As long as I’m here, I should express some concern that it’s possible this document does more harm than good, too. (I held off on posting this for a few months, but eventually decided to at least try it and see for myself, and just learn from it if it turns out to be a mistake.) I think there’s something special about solving your first IMO problem or USAMO problem or whatever and suddenly realizing that these problems are actually doable — I hope it would not be diminished by me rating the problem as 0M. Maybe more information isn’t always a good thing!