A Random Walk Through Four Letter Words

I got the idea from the longstanding four-letter word display. The original, developed in the early seventies, of necesity uses a very simple algorithm for generating random words.  With modern technology, I felt that I could do something a bit more sophisticated: a random walk through four letter words.

I started out by downloading an on-line spelling dictionary, and extracting the four letter words.  I wound up with 4776 of them.  Next, I computed which words were adjacent to each other: two words are adjacent if they differ by one letter (snap to knap), or if they are the reverse of each other (paws to swap).  I make my random walk by choosing the next word to display randomly from set of words adjacent to the current one (except that I never choose the next word to be the same one as the previous word).

One problem is that I have no guarantee that a random walk over these words would be "interesting".  I might start at a word that has no other adjacent words at all.  So, I analyzed the set of words and their adjacencies to find an "interesting subset".  The set of words and their "adjacencies" is an example of a mathematical construct called a "graph".  Therefore I could use graph theory ideas and programs (i.e., the Graph module for Perl) to do my analysis.

First, I eliminated all words which were not adjacent to at least 2 other words.  I repeated this eight times: there were some dead-end paths in the word graph consisting of a string of eight words.  Each time I pruned the word set, I removed the last word in the dead-end path.  I was left with 4439 words.

Next, I wanted to ensure that I could eventually move from any one word to any other -- that is, that the graph is connected.  I found two connected components, one with 4436 words, and one consisting of (tyum, tium, tuum).  Still, I wanted to ensure that the random walk was nicely random, that the graph has "good mixing".  There are a lot of ways to analyze this, but I settled on something simple: there should be no word that seperates two halves of the graph.  That is, I wanted a single "bi-connected component".  I found that there is a big biconnected component with 4413 words, and nine smaller components.  I used the 4413 word component for the walk.

The displays are alphanumeric LED arrays.  I only need to write the character into a register on the module to display it.  I used an Atmel ATMega32 microcontroller to un the program.  There is one problem, though.  While the ATMega32 has 32 Kbytes of flash storage, the word adjacency information requires nearly 128 Kbytes of storage.  I use a 1 Mbit flash eeprom (the small chip) to store the adjacency lists.

New words enter from the top and scroll down, about once every 2 seconds.

Here are some statistics about the word adjacencies.  The distribution of the number of adjacencies are:

 # adjacent 2 3 4 5 6 7 8 9 # words 219 225 253 257 247 278 244 240 # adjacent 10 11 12 13 14 15 16 17 # words 243 233 220 208 187 175 175 164 #adjacent 18 19 20 21 22 23 24 25 #words 167 123 111 109 79 67 57 32 #adjacent 26 27 28 29 30 31 32 33 #words 25 28 19 10 7 3 6 2

The most connected words are

 # adjacent word 33 mins 33 bons 32 pale 32 line 32 bots 32 mine 32 cone 32 pins 31 mare 31 Male 31 lins 30 bare 30 Mars 30 care 30 bart 30 bars 30 core 30 ware