A Random Walk Through Four Letter Words

I got the idea from the longstanding four-letter word display. The original, developed in the early seventies, of necesity uses a very simple algorithm for generating random words. With modern technology, I felt that I could do something a bit more sophisticated: a random walk through four letter words.

I started out by downloading an on-line spelling dictionary, and extracting the four letter words. I wound up with 4776 of them. Next, I computed which words were adjacent to each other: two words are adjacent if they differ by one letter (snap to knap), or if they are the reverse of each other (paws to swap). I make my random walk by choosing the next word to display randomly from set of words adjacent to the current one (except that I never choose the next word to be the same one as the previous word).

One problem is that I have no guarantee that a random walk over these words would be "interesting". I might start at a word that has no other adjacent words at all. So, I analyzed the set of words and their adjacencies to find an "interesting subset". The set of words and their "adjacencies" is an example of a mathematical construct called a "graph". Therefore I could use graph theory ideas and programs (i.e., the Graph module for Perl) to do my analysis.

First, I eliminated all words which were not adjacent to at least 2 other words. I repeated this eight times: there were some dead-end paths in the word graph consisting of a string of eight words. Each time I pruned the word set, I removed the last word in the dead-end path. I was left with 4439 words.

Next, I wanted to ensure that I could eventually move from any one word to any other -- that is, that the graph is connected. I found two connected components, one with 4436 words, and one consisting of (tyum, tium, tuum). Still, I wanted to ensure that the random walk was nicely random, that the graph has "good mixing". There are a lot of ways to analyze this, but I settled on something simple: there should be no word that seperates two halves of the graph. That is, I wanted a single "bi-connected component". I found that there is a big biconnected component with 4413 words, and nine smaller components. I used the 4413 word component for the walk.

The displays are alphanumeric LED arrays. I only need to write the character into a register on the module to display it. I used an Atmel ATMega32 microcontroller to un the program. There is one problem, though. While the ATMega32 has 32 Kbytes of flash storage, the word adjacency information requires nearly 128 Kbytes of storage. I use a 1 Mbit flash eeprom (the small chip) to store the adjacency lists.

New words enter from the top and scroll down, about once every 2 seconds.

Here are some statistics about the word adjacencies. The distribution of the number of adjacencies are:

# adjacent 2 3 4 5 6 7 8 9

# words 219 225 253 257 247 278 244 240

# adjacent 10 11 12 13 14 15 16 17

# words 243 233 220 208 187 175 175 164

#adjacent 18 19 20 21 22 23 24 25

#words 167 123 111 109 79 67 57 32

#adjacent 26 27 28 29 30 31 32 33

#words 25 28 19 10 7 3 6 2

The most connected words are

# adjacent word

33 mins

33 bons

32 pale

32 line

32 bots

32 mine

32 cone

32 pins

31 mare

31 Male

31 lins

30 bare

30 Mars

30 care

30 bart

30 bars

30 core

30 ware

# adjacent	2	3	4	5	6	7	8	9
# words	219	225	253	257	247	278	244	240
# adjacent	10	11	12	13	14	15	16	17
# words	243	233	220	208	187	175	175	164
#adjacent	18	19	20	21	22	23	24	25
#words	167	123	111	109	79	67	57	32
#adjacent	26	27	28	29	30	31	32	33
#words	25	28	19	10	7	3	6	2

# adjacent	word
33	mins
33	bons
32	pale
32	line
32	bots
32	mine
32	cone
32	pins
31	mare
31	Male
31	lins
30	bare
30	Mars
30	care
30	bart
30	bars
30	core
30	ware