Anagrams: An Investigation

10 Aug 2010
Progress: Completed

A while back I wrote my Anagram Finder. Its purpose was to find real words in a soup of random letters - perfect for cheating at countdown. The first version was pretty inefficient, and actually struggled to find words within the thirty seconds countdown gives you, which was motivation to make the algorithm better. Once I'd gotten it down to less than ten seconds though, I stopped.

Something always was at the back of my mind though: would it be possible to pre-process every possible complete anagram?

This wouldn't be that difficult to do, so one day I set out to make a script that would analyse my word list and spit out every single whole-word anagram (sometimes called synanagrams, apparently) in the English language.

The first step was a script which split my dictionary into separate ones ordered by word-length. It output:

two letter words: 121
three letter words: 1,241
four letter words: 5,245
five letter words: 12,048
six letter words: 21,447
seven letter words: 32,048
eight letter words: 39,348
nine letter words: 38,377
ten letter words: 33,718
eleven letter words: 26,722
twelve letter words: 19,533
thirteen letter words: 13,384
fourteen letter words: 8,797
fifteen letter words: 5,454
sixteen letter words: 3,266
seventeen letter words: 1,816
eighteen letter words: 976
nineteen letter words: 512
twenty letter words: 246
twenty one letter words: 101
twenty two letter words: 41
twenty three letter words: 24
twenty four letter words: 16
twenty five letter words: 5
twenty six letter words: 1
twenty seven letter words: 2
twenty eight letter words: 2
twenty nine letter words: 2
thirty letter words: 1
thirty one letter words: 1
thirty two letter words: 1

total words in dictionary: 264,498

The second step was to apply my little anagram routine to each word in the list, checking it with every other in the list. We know they're only anagrams if they've got the same number of letters, hence splitting the dictionary and cutting the time down a bit, but that's still a huge task to do. The next time-saver is that once we've found an anagram we can remove both of those from the list. This still leaves the problem of multiple anagrams. I couldn't think of a good way to deal with them for the moment so I just let the script run.

Many hours of processing led to the third step: removing the duplicates that had appeared because of multiple anagram groups. Another script got rid of these zippily.

three letters: 728 anagrams in 320 groups
four letters: 3,122 anagrams in 1,187 groups
five letters: 6,169 anagrams in 2,327 groups
six letters: 8,970 anagrams in 3,578 groups
seven letters: 10,567 anagrams in 4,457 groups
eight letters: 8,983 anagrams in 3,968 groups
nine letters: 5,557 anagrams in 2,562 groups
ten letters: 2,802 anagrams in 1,336 groups
eleven letters: 1,277 anagrams in 616 groups
twelve letters: 610 anagrams in 296 groups
thirteen letters: 218 anagrams in 108 groups
fourteen letters: 111 anagrams in 55 groups
fifteen letters: 52 anagrams in 25 groups
sixteen letters: 36 anagrams in 18 groups
seventeen letters: 12 anagrams in 6 groups
eighteen letters: 10 anagrams in 5 groups
nineteen letters: 2 anagrams in 1 group

total: 49,226 anagrams in 20,865 groups

(The 'number of anagrams' is the actual number of words in the list. The 'number of groups' is how many pairs/triplets/etc there are.)

Hoorah. Admittedly most of these were useless – alternative spellings of the same word, or slight differences you wouldn't care about. How can we find the most interesting of the lot?

My first thought was ranking them by how popular the words were, based on Google results for each word. Not difficult, but as it turned out not particularly helpful either. What I really needed was to find how different each side of the anagram was from the other, to find the cleverest or hardest to see anagrams.

I tested several ways of doing this. Levenshtein for instance was saying that words like lookout and outlook were very different, which wasn't exactly helpful. Eventually the script was printing them in a relatively useful order though.

So here're the current lists.

Anagrams 3
Anagrams 4
Anagrams 5
Anagrams 6
Anagrams 7
Anagrams 8
Anagrams 9
Anagrams 10
Anagrams 11
Anagrams 12
Anagrams 13
Anagrams 14
Anagrams 15
Anagrams 16
Anagrams 17
Anagrams 18
Anagrams 19

There's probably more I should do with this, but I think I've had enough of anagrams for now.

You can check out the original anagram finder here and also the subword finder.