A Random Walk Through Four Letter Words
I got the idea from the longstanding four-letter word display. The original, developed in the early seventies, of necesity uses a very simple algorithm for generating random words. With modern technology, I felt that I could do something a bit more sophisticated: a random walk through four letter words.
I started out by downloading an on-line spelling dictionary, and extracting the four letter words. I wound up with 4776 of them. Next, I computed which words were adjacent to each other: two words are adjacent if they differ by one letter (snap to knap), or if they are the reverse of each other (paws to swap). I make my random walk by choosing the next word to display randomly from set of words adjacent to the current one (except that I never choose the next word to be the same one as the previous word).
One problem is that I have no guarantee that a random walk over these words would be "interesting". I might start at a word that has no other adjacent words at all. So, I analyzed the set of words and their adjacencies to find an "interesting subset". The set of words and their "adjacencies" is an example of a mathematical construct called a "graph". Therefore I could use graph theory ideas and programs (i.e., the Graph module for Perl) to do my analysis.
First, I eliminated all words which were not adjacent to at least 2 other words. I repeated this eight times: there were some dead-end paths in the word graph consisting of a string of eight words. Each time I pruned the word set, I removed the last word in the dead-end path. I was left with 4439 words.
Next, I wanted to ensure that I could eventually move from any one word to any other -- that is, that the graph is connected. I found two connected components, one with 4436 words, and one consisting of (tyum, tium, tuum). Still, I wanted to ensure that the random walk was nicely random, that the graph has "good mixing". There are a lot of ways to analyze this, but I settled on something simple: there should be no word that seperates two halves of the graph. That is, I wanted a single "bi-connected component". I found that there is a big biconnected component with 4413 words, and nine smaller components. I used the 4413 word component for the walk.
The displays are alphanumeric LED arrays. I only need to write the character into a register on the module to display it. I used an Atmel ATMega32 microcontroller to un the program. There is one problem, though. While the ATMega32 has 32 Kbytes of flash storage, the word adjacency information requires nearly 128 Kbytes of storage. I use a 1 Mbit flash eeprom (the small chip) to store the adjacency lists.
New words enter from the top and scroll down, about once every 2 seconds.
Here are some statistics about the word adjacencies. The distribution of the number of adjacencies are:
# adjacent | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
# words | 219 | 225 | 253 | 257 | 247 | 278 | 244 | 240 |
# adjacent | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
# words | 243 | 233 | 220 | 208 | 187 | 175 | 175 | 164 |
#adjacent | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 |
#words | 167 | 123 | 111 | 109 | 79 | 67 | 57 | 32 |
#adjacent | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 |
#words | 25 | 28 | 19 | 10 | 7 | 3 | 6 | 2 |
The most connected words are
# adjacent | word |
33 | mins |
33 | bons |
32 | pale |
32 | line |
32 | bots |
32 | mine |
32 | cone |
32 | pins |
31 | mare |
31 | Male |
31 | lins |
30 | bare |
30 | Mars |
30 | care |
30 | bart |
30 | bars |
30 | core |
30 | ware |