A post on DataGenetics did the rounds last week, applying the might of statistical analysis to the game Hangman to try and guess what an optimal strategy might be. Many techniques were leveled at the problem, from basic analysis of letter frequencies to conditional probability, all in order to try and generate the best sequence you should call the letters.

Having read it I was slightly perplexed, it seemed like massive overkill for something that can be calculated fairly simply, so I created pyngman, a python script that generates optimal next guesses for Hangman. Input the state of the game and the letters you’ve called and it will tell you what letter to call next.

You supply the information as a state, such as ..e.., where .’s are unknown letters, followed by a list of letters you’ve tried:

$ pyngman -state ..e.. est
> Your best next guess is: a

It does this by using a dictionary (you must supply the dictionary, so the results will change depending on what you supply!), and looking at all possible words that could be the solution, and working out the letter with the highest probability of being present. So far I have been unable to find a word which causes the program to lose a game of hangman!

Grab pyngman from github and have a go yourself

Possible Improvements

It occurs to me that pyngman, while good, is not technically optimal. In hangman, you’re not trying to win, you’re trying not to lose. Therefore, you can do better than picking the letter which has the highest probability of being in the solution, you can pick the letter which minimizes the highest number of guesses required to reach any potential solution. This requires a bit more computation, but is not a technically challenging thing to implement.

One slight improvement (which I may include myself if I have time) is to adjust how letter frequencies are calculated. Currently every instance of every letter is counted, but you would only want to count any letter once per word to get the correct solution. The difference this makes to the supplied letters is probably minimal though.