Erik Nelson


A Random Sentence Generator Using Iterated String Substitution in Python


October, 2006





In my random sentence generator, the vocabulary/sentence data structure is a list of lists of strings. The first string on each sublist is the token that must be replaced when found in the output that is being built. Each additional string on the sublist is a thing that can be chosen to replace it. For example, the data structure in the first version I wrote looked like this:


[

["#n", "apple","banana","car","dog","elevator"] ,

["#v", "amble","bark","cry","dump"] ,

["#a", "able","big","circular","dry"] ,

]



and the initial string that it went to work on was


"The #a #n likes to #v ."


The program loops through the starting string, looking at each word in turn, and checks to see if it is the same as one of the words that is on top of one of the vocabulary sublists.

I have chosen #n to be my code for "noun", #v for verb, and #a for adjective. So when it sees #a, for example, it can pick "able", "big", "circular" or "dry" to replace it with.


So, picking words at random from the list, it can come up with


The big dog likes to bark.

The able elevator likes to cry.

The circular banana likes to bark.


Okay, then, so what? Not very impressive, and this is a round about complicated way of doing things if this is all you want to do. But now it's time to come up with the more sophisticated version and introduce recursion. Now make the vocabulary sublist more complex. They can contain phrases, not just words, and these phrases can have little codes within them that can be replaced in the next iteration. Indeed, they can have whole sentence structures. For instance, in my next version, the list of adjectives looked like:


["#a", "able","big","circular","dry","eerie","fluid",
"un- #a","green","jumpy","fetid","squamous","raucous","wet",
"#n -like","flat","round", "#n -shaped", "strange", "enormous", "overt"]


Two things to point out here. First, I do not have to be strict about heirarchy. It is in fact possible to pick an adjective that contains another adjective-to-be-picked-later, so you can come up with weird constructions like "un- round" or even "un- un- un- un- dog -like". Also, I have to space weirdly around the hyphenations because each token needs to be a separate word. (Later versions had a few lines of code for getting rid of extra spaces around punctuation after all the iterations have run.)


Reading the source code, you will note that I do not use the find-and-replace string functions like I could have, but rather I use split() to divide the string into words-to-be-tested and then use join() to put the sentence back together once substitutions have been made. There is a reason for this. The string replacement function always replaces the first occurrence of a token. I felt that if a program fell into an infinite regress, it would all happen in the beginning of the string, and the program would never replace things that were to the right of it. If I had an infinite regress, I felt, it should be uniformly distributed throughout the whole string. However, I have not yet seen a string that failed to complete after twenty iterations of the loop. (I had some pound-codes left after ten, so I reset it to twenty and left it there since.)


Once this was all working, I still noticed that I needed to splice sentences better. I wanted to be able to say, for example,

The big dog likes to bark, but even a car can cry.

instead of

The big dog likes to bark. but Even a car can cry.

I solved this problem by creating the functions puncteater() and capeater(), which look for the characters "<" and ">" respectively, and have them mean "get rid of period and extra space to the left of me" and "get rid of extra space and decapitalize to the right of me" respectively. I refined this a little by having another character that exists merely to protect the letter "I" from being decapitalized, and then is evaporated from the string in a final cleanup.


Once all that was working, I created a version that can read a user-defined vocabulary-and-structure listing from a text file. You, too can write a vocabulary for this program.

Each item on a sublist is terminated by an end-of-line, and each sublist is separated by a blank line. You need to change the directory before running the program.

(example 1, the simplest version of the program)


#This is an experiment in random sentence generators.

#The data structure is a list of lists, with the substitutable thing at the

#head of each one and things it can turn into on that list.


#The next thing to do: make it a recursive grammar.

#This entails I must make the substitution loop into a function

#so I can cycle it through several times.


import random


vocab = [] #an empty list to begin with

vocab.append( ["#n", "apple","banana","car","dog","eel","frog","grape"] )

vocab.append( ["#v", "amble","bark","cry","dump","eat","fly"] )

vocab.append( ["#a", "able","big","circular","dry","eerie","fetid"] )


#print vocab

#print "the zeroth item in the zeroth list is ",

#print vocab[0][0]

#print "the 4th item in the 2nd list (starting the count at 0,0) is ",

#print vocab[2][4]


#print"============================"


teststring = "The #a #n likes to #v . Also, the #n is a #a #n ."


#print teststring


#print "and now we make the test string into a list"


stringbreakup = teststring.split(" ")

#print stringbreakup


newword=""

buildlist = []

for i in stringbreakup: #each i is a word in the list of source words

newword = i

for j in vocab: #each j is a LIST in the listoflists vocab

if i == j[0]: #so j[0] is the substitutable token

newword = random.choice(j[1:])

#so newword is a choice from the rest of the list

#if i is to be sustituted for

buildlist.append(newword)


#print buildlist


#print "====================="

#print "and now we join the words."


print " ".join(buildlist)

Example 2, with a recursive vocabulary and sentence structure


#By Erik Nelson, June 12, 2006

#This is an experiment in random sentence generators.

#The data structure is a list of lists, with the substitutable thing at the

#head of each one and things it can turn into on that list.


#The next thing to do: make it a recursive grammar.

#This entails I must make the substitution loop into a function

#so I can cycle it through several times.


import random


vocab = [] #an empty list to begin with

vocab.append(["#p", "\n It can be shown that #s This is true even though the #n says #s . However, #s", "\n In spite of it all, #s #s But even a #n will #v .", "\n In conclusion, #s but #s ", "#s #s But #s Furthermore, the #n doesn't believe that #s","#s #p", "#p #p","\n All this is meaningless and #a because the #n says #s To say #s would be an oversimplification."])

vocab.append(["#s", "The #a #n likes to #v .", "A #n can't #v if it has a #n !", "Also, the #a #n is a #n .","It isn't true that #s","A #n is a #a #n that can #v .","I have decided to become a #n even though I don't #v .", "People who #v do so at their own risk because of a #n .","I can't #v with a #n .","A #n can't #v because of a #n .","The #n will never learn not to #v .","A #n and a #n can't #v together if one of them is more #a than the other.","I am a #n .","The #n can #v ."])

vocab.append( ["#n", "apple","banana","car","dog","eel","frog","grape","#n that isn't #a","#n that can't #v","rock","hammer", "snowball","brick","map" ,"thing","house","#n -eater"] )

vocab.append( ["#v", "amble","bark at a #n","cry","dump on a #n until it turns #a","eat a #n","fly like a #n","absquatulate","vomit","#v until the #n comes home","fail to #v","#v like a #n", "become #a","be a #n","walk","write","speak", "say a #n is like a #n","ride a #n","run","go shopping"] )

vocab.append( ["#a", "able","big","circular","dry","eerie","fluid", "un- #a","green","jumpy","fetid","squamous","raucous","wet","#n -like","flat","round", "#n -shaped","strange","enormous","overt"] )




teststring ="#p #p \n #s But #s Furthermore, the #n doesn't believe that #s"

print teststring, "\n The above is the ORIGINAL STRING"


def replacetokens(instring):

stringbreakup = instring.split(" ")

#print stringbreakup


newword=""

buildlist = []

for i in stringbreakup:

#each i is a word in the list of source words

newword = i

#and it remains unchanged unless it's a token

for j in vocab:

#each j is a LIST in the listoflists vocab

if i == j[0]:

#so j[0] is the substitutable token

newword = random.choice(j[1:])

#so newword is a choice from list j

#if the word is changed.

buildlist.append(newword)


#print buildlist


#print "====================="

#print "and now we join the words."


outstring = " ".join(buildlist)

return outstring


evolvestring = teststring

for i in range(1,20):

teststring = replacetokens(teststring)

print "\nThe ", i, "th iteration is============\n\n",teststring

print "\n"

The latest version, with capeater(), puncteater(), and user-definable files


"""

#By Erik Nelson, June 12, 2006

#This is an experiment in random sentence generators.

#The data structure is a list of lists, with the substitutable thing at the

#head of each one and things it can turn into on that list.


#The next thing to do: make it a recursive grammar.

#This entails I must make the substitution loop into a function

#so I can cycle it through several times.


"""


import random


########## getvocab(filename) returns listoflistofstrings ######

"""

getvocab(filename) returns a list-of-lists-of-strings.

Rather than clutter up the code I will comment above it.

This takes a file and reads it into the list-of-list format

that wordsubfunc uses.


As each line of the file gets read in, it is appended to buildsublist.

If a blank line is read, the existing buildsublist (a list of strings) becomes an item in buildbiglist (a list of lists of strings).

Finally buildbiglist is returned when we are finished building it.


The line is cleaned up by removing the backslash-n before appending it.

This is done with the if-else statement that assigns the variable cleanline. A compacter way to do this would be welcome.(Without the conditional, the last character of the last line can get lost because the last line might not end in a backslash-n)


The final conditional append statement after the loop is needed because without it the last sublist doesn't get appended unless you end the input file with a blank line, which is a non-neat situation.


Note that it won't work unless the right path is already set.

"""




def getvocab(filename):

infile = open(filename, 'r')

buildbiglist = []

buildsublist = []

for line in infile:

if (len( line ) > 1) and (line != None) :

if line[-1] == '\n':

cleanline = line[0:-1]

else:

cleanline = line

buildsublist.append( cleanline )

elif buildsublist != []:

buildbiglist.append( buildsublist )

buildsublist = []

if buildsublist != []:

buildbiglist.append( buildsublist )

return buildbiglist


######## getvocab() function definition ends here ################



vocab = [] #an empty list to begin with

vocab.append(["#p", "\n It can be shown that > #s This is true even though the #n says > #s However, > #s", "\n In spite of it all, > #s #s <, but even a #n will #v <.", "\n In conclusion, > #s <, even though > #s ", "\n #s #s But > #s Furthermore, the #n doesn't believe that > #s","\n #s #p", "#p #p","\n All this is meaningless and #a because the #n says > #s","\n To say > #s < would be an oversimplification."])


vocab.append(["#s", "The #a #n likes to #v .", "A #n can't #v if it has a #n !", "Also, the #a #n is a #n <.","It isn't true that > #s","A #n is a #a #n that can #v .","^I have decided to become a #n even though I don't #v .", "People who #v do so at their own risk because of a #n .","^I can't #v with a #n .","A #n can't #v because of a #n .","The #n will never learn not to #v .","A #n and a #n can't #v together if one of them is more #a than the other.","^I am a #n .","The #n can #v .", 'The #n says "So, > #s <, eh?"' , '" < #s <," quoth the #n .'])


vocab.append( ["#n", "apple","banana","car","dog","eel","frog","grape","#n that isn't #a","#n that can't #v","rock","hammer", "snowball","brick","map" ,"thing","house","#n -eater"] )


vocab.append( ["#v", "amble","bark at a #n","cry","dump on a #n until it turns #a","eat a #n","fly like a #n","absquatulate","vomit","#v until the #n comes home","fail to #v","#v like a #n", "become #a","be a #n","walk","write","speak", "say a #n is like a #n","ride a #n","run","go shopping"] )


vocab.append( ["#a", "able","big","circular","dry","eerie","fluid", "un- #a","green","jumpy","fetid","squamous","raucous","wet","#n -like","flat","round", "#n -shaped","strange","enormous","overt"] )


teststring ="#p #p \n #s But > #s Furthermore, the #n doesn't believe that > #s"


"""The following four functions splice sentences by removing period, exc. mark and question mark where the workstring contains < after it, and de-capitalizing where the workstring has a > before the capital letter. """


def decapitalize(thestring):

if thestring == "":

result = ""

elif len(thestring) == 1:

result = thestring.lower()

else:

result = thestring[0].lower() + thestring[1:]

return result


def capeater(thestring):

thelist = thestring.split(">")

newlist = []

newlist.append(thelist[0])

for piece in thelist[1:]:

newlist.append( decapitalize( piece.lstrip() ) )

result = "".join( newlist )

return result


def depunct( thestring ):

if thestring == "":

result = ""

elif len(thestring) == 1:

if thestring in ".!?":

result = ""

else:

result = thestring

else:

if thestring[-1] in ".?!" :

result= thestring[:-1]

else:

result = thestring

return result


def puncteater(thestring):

thelist = thestring.split( "<" )

newlist = []

for piece in thelist[:-1]:

newlist.append( depunct( piece.rstrip() ) )

newlist.append(thelist[-1])

result = "".join(newlist)

return result




print "This program will use the vocabulary in the file that you name."

print "It will use the old vocabulary if you don't name a filename."

print "What file do you want to read from? Just hit return for default vocabulary."

filechoice = raw_input()

if filechoice !='':

vocab = getvocab(filechoice)


print teststring, "\n The above is the ORIGINAL STRING"


def replacetokens(instring):

stringbreakup = instring.split(" ")

#print stringbreakup


newword=""

buildlist = []

for i in stringbreakup:

#each i is a word in the list of source words

newword = i

#and it remains unchanged unless it's a token

for j in vocab:

#each j is a LIST in the listoflists vocab

if i == j[0]:

#so j[0] is the substitutable token

newword = random.choice(j[1:])

#so newword is a choice from list j

#if the word is changed.

buildlist.append(newword)


#print buildlist


#print "====================="

#print "and now we join the words."


outstring = " ".join(buildlist)

return outstring


evolvestring = teststring

for i in range(1,20):

teststring = replacetokens(teststring)

print "\nThe ", i, "th iteration is============\n\n",teststring

print "\n"

teststring = capeater( puncteater( teststring) )

# The above functions fix the punctuation and capitalization

# when the template calls for a sentence splice.


teststring = teststring.replace("^","")

# We use '^' to protect things like "I" from the above rule.

# Then the ^ disappears.


teststring = teststring.replace( " ." , "." )

# Eliminate spaces before periods (which are there because codetokens

# need to be separated from everything else by spaces.


teststring = teststring.replace( " -" , "-" )

teststring = teststring.replace( "- " , "-" )

teststring = teststring.replace(" ," , "," )

teststring = teststring.replace(" !" , "!")

teststring = teststring.replace(" ?", "?")

teststring = teststring.replace(' " ' , ' "')

# And similarly for hyphens, exclamation marks, question marks, commas

# The double quote call eliminates space after (lefthand) quote

# Since a lefthand quote looks the same as a righthand quote, be careful.

# Capeater will work between a punctuation mark and a (construed as right)

# quote.



print "***The final version with spliced punctuation and capitalization: **\n"

print teststring




Here is a sample session of the program, with some of the successive iterations omitted, because it gets repetitive. Final generated text is at the bottom.


[E:~] eriknels% python /\ erikProgramming/python\ exercises/fileexperiment/RSGwfileread/wordsubwfileread.py

This program will use the vocabulary in the file that you name.

It will use the old vocabulary if you don't name a filename.

What file do you want to read from? Just hit return for default vocabulary.


#p #p

#s But > #s Furthermore, the #n doesn't believe that > #s

The above is the ORIGINAL STRING


The 1 th iteration is============



All this is meaningless and #a because the #n says > #s

#s #p

" < #s <," quoth the #n . But > A #n is a #a #n that can #v . Furthermore, the #n that can't #v doesn't believe that > The #n will never learn not to #v .


The 2 th iteration is============



All this is meaningless and enormous because the eel says > A #n can't #v if it has a #n !

The #n will never learn not to #v .

To say > #s < would be an oversimplification.

" < A #n is a #a #n that can #v . <," quoth the eel . But > A dog is a fetid brick that can say a #n is like a #n . Furthermore, the #n -eater that can't write doesn't believe that > The rock will never learn not to bark at a #n .


The 3 th iteration is============



All this is meaningless and enormous because the eel says > A brick can't ride a #n if it has a banana !

The map will never learn not to be a #n .

To say > A #n and a #n can't #v together if one of them is more #a than the other. < would be an oversimplification.

" < A brick is a able map that can run . <," quoth the eel . But > A dog is a fetid brick that can say a eel is like a dog . Furthermore, the frog -eater that can't write doesn't believe that > The rock will never learn not to bark at a house .


The 4 th iteration is============



All this is meaningless and enormous because the eel says > A brick can't ride a #n that can't #v if it has a banana !

The map will never learn not to be a apple .

To say > A brick and a banana can't absquatulate together if one of them is more #n -shaped than the other. < would be an oversimplification.

" < A brick is a able map that can run . <," quoth the eel . But > A dog is a fetid brick that can say a eel is like a dog . Furthermore, the frog -eater that can't write doesn't believe that > The rock will never learn not to bark at a house .


The 5 th iteration is============



All this is meaningless and enormous because the eel says > A brick can't ride a rock that can't dump on a #n until it turns #a if it has a banana !

The map will never learn not to be a apple .

To say > A brick and a banana can't absquatulate together if one of them is more car -shaped than the other. < would be an oversimplification.

" < A brick is a able map that can run . <," quoth the eel . But > A dog is a fetid brick that can say a eel is like a dog . Furthermore, the frog -eater that can't write doesn't believe that > The rock will never learn not to bark at a house .


The 6 th iteration is============



All this is meaningless and enormous because the eel says > A brick can't ride a rock that can't dump on a dog until it turns enormous if it has a banana !

The map will never learn not to be a apple .

To say > A brick and a banana can't absquatulate together if one of them is more car -shaped than the other. < would be an oversimplification.

" < A brick is a able map that can run . <," quoth the eel . But > A dog is a fetid brick that can say a eel is like a dog . Furthermore, the frog -eater that can't write doesn't believe that > The rock will never learn not to bark at a house .


(Note that all the substitutable codes are gone after 6 cycles. Since the rest of the 19 iterations are the same, I am not including them.)


***The final version with spliced punctuation and capitalization: **



All this is meaningless and enormous because the eel says a brick can't ride a rock that can't dump on a dog until it turns enormous if it has a banana!

The map will never learn not to be a apple.

To say a brick and a banana can't absquatulate together if one of them is more car-shaped than the other would be an oversimplification.

"A brick is a able map that can run," quoth the eel. But a dog is a fetid brick that can say a eel is like a dog. Furthermore, the frog-eater that can't write doesn't believe that the rock will never learn not to bark at a house.


(The above is the final resulting text. Note that the < and > characters were there to tell capeater() to decapitalize and puncteater() to get rid of periods so that sentences could be spliced, and also surplus spaces around punctuation were cleaned up.)



The text of a user-defined file, so that you can see how to write your own vocabulary.


#n

car

plane

boat

passenger


#a

big

hairy

snub-nosed


#v

go

stop

fly

dig


#s

The #a #n can #v .

I am a #n and I can #v with a #a #n .


Once my program nested its quotation marks three deep, so I thought this was worth keeping:


To say "" "I have decided to become a banana even though I don't become big," quoth the rock," quoth the snowball," quoth the rock would be an oversimplification.