Laurent's notes

Mathematical optimization over Wordle decision trees

There are great write-ups on Wordle out there: statistical analyses, strategies, even solvers. However, almost all are heuristic. This means that while they provide reasonably good solutions, they don’t particularly care about mathematical optimality in a rigorous sense (this is sadly true in the popular press even when they use words like “best” or “optimal strategy”). As a result, simple questions remain unanswered: Is there a strategy that guarantees to find any one of the 12972 possible words within the 6 allowed guesses? Without resorting to luck, that is. In other words, does there exist a decision tree that lets you always win the game? If so, what about 5 guesses? 4? What is the minimum depth of such a decision tree?

The game looks innocent enough, so let’s code something quick. How hard can it be?

tl;dr

After 3500 lines of C code, a thousand hours of CPU time, and close to 50 hours of human time, it can be shown that:

Also:

Wordle

Wordle is an online puzzle game that’s become fairly popular in 2022. The objective of the player is to find an unknown 5-letter English word \(x\), which I will call here the target. At each turn, the player guesses a 5-letter word \(g\). Note that the letters of \(g\) must form an English word, not some random letter arrangement. In response, the game tells the player whether \(g = x\). In addition, if \(g \neq x\), it indicates, for every letter of \(g\), whether

Note that if \(g\) contains a duplicate letter, the number of green🟩/yellow🟨 squares for that letter will reflect the number of times it appears in \(x\). The game allows for 6 guesses. The player loses if all six guesses are wrong. Which words are “valid English words” is determined by a dictionary \(L\) of 12972 five-letter words, embedded in the JavaScript code of the game.

Decision trees

A decision tree acts as a fixed recipe that tells us exactly how to play at each turn of the game. In this case, it will have nodes labeled with player guesses, and branches labeled with game answers. I call an “answer” what the game responds to a guess, something like green-yellow-gray-gray-yellow 🟩🟨⬜⬜🟨, for example. Any guess could potentially yield an immediate win, so there is no necessity to store that outcome in the tree. A decision tree of depth \(d\) means that we need at most \(d + 1\) guesses to win the game. One uses a decision tree by playing the node labels as guesses, then following the branch corresponding to the game’s answer.

A decision tree of depth 2 corresponds to 3 guesses at most
A better decision tree of depth 1 (2 guesses at most)

Approach

A systematic approach to find an optimal decision tree for Wordle is to essentially enumerate over all possible decision trees. We can proceed as follows:

  1. We consider the set \(X \subseteq L\) of all the words that could be the target. When we start, the target \(x\) could be any word so \(X := L\).
  2. We enumerate all possible \(|L| = 12972\) guesses \(g \in L\) that we could take.
  3. Then, recursively for each guess, we enumerate all the possible answers \(a \in A\) from the game to that guess \(g\), where \(A\) is the set of all possible game answers. There are at most \(|A| = 3^5 = 243\) distinct answers (3 colors, 5 letters). For each such answer, we create a bucket \(Y_a\) containing all the target words \(x \in X\) that would yield that answer \(a\). Together, the buckets \(\{ Y_a : a \in A \}\) form a partition of \(X\).
  4. We then go to step 1, recursively, for each answer \(a\), but this time letting \(X := Y_a\).

When that is done, we can build an optimal decision tree by retaining only, after each “guess” level (step 2), the guess yielding the lowest recursion depth. On the illustrations below, only one guess is shown at each node, but all \(|L|\) of them have to be attempted, along with all their respective subtrees, before one is chosen. This includes, potentially, words that are not in their node’s \(Y_a\) set.

Tree of depth 2 with the corresponding X and Y_a sets
Tree of depth 1 with the corresponding X and Y_a sets

This method yields up to \(12972^{d+1} \times \left(3^5\right)^{d}\) iterations for each recursion depth \(d\). For \(d=5\) (i.e. 6 guesses), this yields \(4.04 \, \times \, 10^{36}\). At one iteration per cycle on a million 4GHz computers, that would take over \(2000 \times\) the estimated age of the universe. So we give up.

Of course not. “Age of the universe” arguments have their limitations. People routinely solve SATs, CPs, MIPs or TSPs with millions of variables, despite similarly alarming worst-case running times. There are many known tricks to make this type of enumeration work in practice. Let’s deploy the mathematical optimizer’s toolbox.

Pruning

If we are not interested in a decision tree of depth larger than \(\bar d\), large parts of the enumeration can be cut short. This happens, for example, once we found a tree of depth \(\bar d + 1\). This is called pruning, and it is the idea behind the “branch and bound” method. This happens not only at the root node, but all over the tree. At step 2, as soon as one of the guesses \(g \in L\) yields a subtree that disambiguates between the targets \(x \in X\), we have such an upper bound \(\bar d\).

Strong branching

Because of pruning, finding a decision tree of small depth early in the process can considerably reduce the size of the enumeration space. Therefore, it is advisable to attempt the most promising guesses first. A natural intuition is that a good guess \(g\) would partition \(X\) into many sets \(Y_a\) of small cardinality. This way, we hope that fewer subsequent guesses will be necessary to disambiguate between the members of each \(Y_a\) set. I chose to assign a score of \(\sum_a |Y_a|^2\) to each guess, sort the guesses by nondecreasing scores, and proceed to perform recursion (step 4) in that order.

What I do here in essence is using local information to select good candidate branches for depth-first search. This is similar to strong branching, an idea that dates back to the early TSP pioneers.

If we only want to quickly find a good decision tree (as opposed to finding the best possible decision tree, or proving that none exists), we can further accelerate the search by only recursing on the \(K\) most promising guesses, for some fixed constant \(K\). Of course, in that case, we lose the optimality guarantees, but \(K\) can be much smaller than 12972, yielding tremendous speedups.

Leaf logic

Say you are at depth \(\bar d\). You have one guess left, and it must be correct otherwise you lose (more specifically, because of pruning, this part of the enumeration tree would be cut off for being suboptimal). Obviously, if \(|X| \geq 2\), this is not going to work, and you can abandon this subtree.

Now if you are at depth \(\bar d - 1\), you have two guesses left. This means that your next guess must fully disambiguate between the remaining words in \(X\). In other words, we need \(|Y_a| \leq 1\) for all \(a\). Since \(\{ Y_1, \ldots, Y_{243} \}\) is a partition of \(X\), \(|Y_a| \leq 1\) will definitely not happen if \(|X| > 243\). Actually, we can already abandon the subtree if \(|X| > 212\), because we can show that none of the \(12972^2\) guess-target pairs ever yields more than 212 distinct answers.

This can be extended in various ways. If all of the words in \(X\) are identical in 4 of the 5 letter positions (e.g., sabed, safed, saned, sared, sated, saved, sawed, sayed), then we just need to find the remaining letter. However, with just one 5-letter guess, this can only possibly work if \(|X| \leq 6\).

Like with strong branching, we can trade speed for optimality guarantees, and heuristically limit \(|X|\) to a constant \(C\) at depth \(\bar d - 1\). Again, we lose the optimality guarantees, so we cannot do this when we prove that there exist no trees of a certain depth.

Lookup table

Implementation matters too. I wrote the enumeration in plain old C, which does not have a stellar reputation for the convenience of its string handling. That’s ok though, because except at the very beginning of the initialization, the code will never touch C strings.

The one operation that is costly – the bottleneck of the whole process – is to compute the game’s answer \(a \in A\) to a guess \(g \in L\), given a target \(x \in L\). However, since \(|L|\) is only 12972, we can precompute a \(12972 \times 12972\) lookup table, and this operation becomes as easy as a = table[g][x]. The answer \(a\) fits a uint8_t too, so the table is “just” 160 MB.

Even then, this operation remains a major bottleneck of the enumeration. Using my hperf tool to visualize the Linux perf trace, we see that constructing the \(Y_a\) partition of \(X\) takes over 95% of the running time. Digging into that hotspot reveals that over 60% of the overall running time is spent on a single assembly instruction, the one that fetches \(a\) from the lookup table!

Results

A decision tree of depth 6 (i.e., 7 guesses at most) is relatively easy to find: about 23 seconds of computing time. Depth 5 (6 guesses) is harder. It took close to 50 hours of CPU time, with \(\bar d = 5\), \(K = 200\), and a healthy bit of luck: The search space for the first guess was divided in 10 groups for parallel exploration, and the first guess of the 9th group, “spaer”, worked as a root node for a tree of depth 5. That decision tree has 14945 nodes.

As for lower bounds, it is easy to perform exhaustive search (i.e., \(K = |L|\) and \(C = |L|\)) at depth 2 (3 guesses), and prove that no tree of depth 2 exists: it takes about 23 seconds. For depth 3 (4 guesses), the search completed after over 95 hours, unsuccessfully as well. There is thus no tree of depth 3.

Unfortunately, depth 4 seems to be beyond the reach of my computational resources. It is thus still unknown whether all Wordle puzzles can be solved in 5 guesses. (Update 2022-01-25: This question is solved now, see below.)

A few caveats:

Update 2022-02-06: Indeed “spaer” is not the only starting word giving an optimal tree. Mark Fisher found a different depth-5 decision tree (it starts with “larnt”). His approach is much less brute-force than mine and as a result he has nice insights about which words are difficult to disambiguate, and how to tackle such word groups.

Update 2022-02-14: With the New York Times’ updated allowed-word list, the results are similar. I found a tree of depth 5 (6 guesses) in 51 CPU hours. Its starting word is “saber”, and its average number of guesses is 4.2361.

Variant: Cheat mode

If we look at the JavaScript code of the game, the dictionary \(L\) is technically built as the union \(L = L_x \cup L_+\) of two lists of words:

I think that it is more elegant to ignore this detail and consider that the target \(x\) is picked by the game from the entire list \(L\) containing 12972 words in total. Furthermore, if we know \(L_x\), then we may as well solve the game in a single guess by fetching the word corresponding to the current date. Still, we can imagine that we know \(L_x\) (but not its order) and check what happens in that case.

The only thing that changes is that now the list of targets is initialized with \(L_x\), i.e., \(X := L_x\) in step 1 instead of \(X := L\). Because \(|L_x|\) is much smaller than \(|L|\), the enumeration is much easier for this variant, and a decision tree of depth 4 (i.e., 5 guesses) can be found in under 5 seconds. This result is not new. Tyler Glaiel and Rémy Grünblatt both found such trees with clever heuristic approaches. Impressively, Alex Selby even gives trees that rigorously minimize the average number of guesses in this case. He also writes that there is no decision tree of depth 3 (4 guesses in the worst case). I can confirm the latter result: trees of depth 4 (5 guesses) are optimal in “cheat mode”.

Variant: Hard mode

Hidden in the game options, Wordle also features a “hard mode” in which “any revealed hints must be used in subsequent guesses”. Specifically, this means that

I thought that this variant would make the enumeration dramatically faster because at step 2, instead of enumerating all possible \(|L| = 12972\) guesses, we only need consider those that use revealed hints. Unfortunately, this is compensated by an explosion of the tree depth. So far, I have only found a tree of depth 13 (14 guesses) for hard mode. The reason for such a large depth can be understood by looking at the sequences of guesses for the target “wares”: “tares” (at this point we get ⬜🟩🟩🟩🟩 and all our subsequent guesses must finish in “ares”), then “bares”, “cares”, “dares”, “fares”, “gares”, “hares”, “lares”, “mares”, “nares”, “pares”, “rares”, “vares” and finally “wares”. Clearly, if we find many green 🟩 letters, subsequent guesses are then heavily constrained, and the tree devolves into an iterative enumeration. It is likely that the code could be improved to better avoid such situations.

Variant: Cheat+Hard mode

Even in “hard mode” though, knowing \(L_x\) can greatly help. In this mode, there is a tree of depth 4 (5 guesses), found in about 22h of CPU time. This is optimal (15 hours of CPU time).

Open questions

It should be noted that I use the word “proof” very loosely. When we find trees, we have an easy-to-check certificate: the decision tree itself. Instead, when a full enumeration (\(K = |L|\) and \(C = |L|\)) completes unsuccessfully, the “proof” that no tree exists really relies on me writing fully 100% certified bug-free code. So… yeah… 😬 Fortunately, this is a game, we don’t need to take this too seriously.

More excitingly, for the original game, it remains an open question whether a decision tree of depth 4 (i.e. 5 guesses) exists. Answering this question is tantalizingly close, though. As a rough estimate, I think it could be done in a few days with about $80,000 of EC2. If you have a billion CPU hours to spare, you can grab the code and try for yourself!

Update 2022-01-25: Just two days later, this is already settled, using cleverness as a substitute for a billion CPU hours. Alex Peattie shows that one cannot solve all Wordle puzzles in just 5 guesses. His proof is beautiful and self-contained. I summarize it here to the best of my ability. He first found 19 words that are identical except in their first letter: bills, fills, mills, … This first letter takes 19 distinct values \(L := \{ b, f, m, ... \}\). Then he makes the following observations:

  1. Select any two letters in \(L\), say \(b\) and \(f\). If we never play any guess containing these two letters, it is clear that we will never be able to distinguish between the corresponding words (bills and fills in this case). As a corollary, if ever the target word is one of those 19 words, in order to pinpoint it with certainty, we cannot have two such letters that are contained in none of our guesses. In other words, we need our guesses to have covered at least 18 of the 19 letters in \(L\).

  2. Now, to win in 5 guesses, we need to be sure about the target after the fourth guess. In other words, we need to cover at least 18 of the 19 letters in \(L\) with just 4 guesses (even though they total only 20 letters between them). This is a quasi-set-covering problem, which Alex first solves by hand with clever dictionary inspection, then verifies with a constraint programming solver (and I just checked this with a MIP formulation as well). In all cases, the conclusion is that there is no set of four words in the dictionary that satisfies our needs. It is thus impossible to solve every single one of the 12972 Wordle puzzle in 5 guesses or less.

    Update 2022-02-14 With the NYT’s updated word list, the proof still holds. Here is the updated MIP formulation.

I find it fascinating that a decision tree with 6 guesses is optimal. Wordle was clearly designed and finetuned with humans in mind – to great effect. And yet… the 6 allowed guesses are just enough to make it a fair game (it can always be solved), but not any bit easier than necessary, even for a computer.