A Random Walk Through Four Letter Words

 

I got the idea from the longstanding four-letter word display. The original, developed in the early seventies, of necesity uses a very simple algorithm for generating random words.  With modern technology, I felt that I could do something a bit more sophisticated: a random walk through four letter words. 

I started out by downloading an on-line spelling dictionary, and extracting the four letter words.  I wound up with 4776 of them.  Next, I computed which words were adjacent to each other: two words are adjacent if they differ by one letter (snap to knap), or if they are the reverse of each other (paws to swap).  I make my random walk by choosing the next word to display randomly from set of words adjacent to the current one (except that I never choose the next word to be the same one as the previous word). 

One problem is that I have no guarantee that a random walk over these words would be "interesting".  I might start at a word that has no other adjacent words at all.  So, I analyzed the set of words and their adjacencies to find an "interesting subset".  The set of words and their "adjacencies" is an example of a mathematical construct called a "graph".  Therefore I could use graph theory ideas and programs (i.e., the Graph module for Perl) to do my analysis.

First, I eliminated all words which were not adjacent to at least 2 other words.  I repeated this eight times: there were some dead-end paths in the word graph consisting of a string of eight words.  Each time I pruned the word set, I removed the last word in the dead-end path.  I was left with 4439 words.

Next, I wanted to ensure that I could eventually move from any one word to any other -- that is, that the graph is connected.  I found two connected components, one with 4436 words, and one consisting of (tyum, tium, tuum).  Still, I wanted to ensure that the random walk was nicely random, that the graph has "good mixing".  There are a lot of ways to analyze this, but I settled on something simple: there should be no word that seperates two halves of the graph.  That is, I wanted a single "bi-connected component".  I found that there is a big biconnected component with 4413 words, and nine smaller components.  I used the 4413 word component for the walk.

The displays are alphanumeric LED arrays.  I only need to write the character into a register on the module to display it.  I used an Atmel ATMega32 microcontroller to un the program.  There is one problem, though.  While the ATMega32 has 32 Kbytes of flash storage, the word adjacency information requires nearly 128 Kbytes of storage.  I use a 1 Mbit flash eeprom (the small chip) to store the adjacency lists.

New words enter from the top and scroll down, about once every 2 seconds. 

Here are some statistics about the word adjacencies.  The distribution of the number of adjacencies are:

# adjacent 2 3 4 5 6 7 8 9
# words 219 225 253 257 247 278 244 240
# adjacent 10 11 12 13 14 15 16 17
# words 243 233 220 208 187 175 175 164
#adjacent 18 19 20 21 22 23 24 25
#words 167 123 111 109 79 67 57 32
#adjacent 26 27 28 29 30 31 32 33
#words 25 28 19 10 7 3 6 2

The most connected words are

# adjacent word
33 mins
33 bons
32 pale
32 line
32 bots
32 mine
32 cone
32 pins
31 mare
31 Male
31 lins
30 bare
30 Mars
30 care
30 bart
30 bars
30 core
30 ware