Balzac Text Data Mining:
How to pick apart Balzac's Comedie Humaine with a computer

Some techniques using Python (computer programming language) that you can use to learn some higher-order facts about texts. Extracting various useful types of meta-data from the Project Gutenberg collection of Balzac stories.


Importance of a Character in a Story
Measures the importance of a character in a story from the number of references to the character in the story.

The Balzac novels that Project Gutenberg provides have nice lists of recurring characters and the other works that they appear in at the end. Unfortunately, this list gives the reader little idea of how important a role a character plays in those other works. This reference or word counting measure of character importance has been cooked up to overcome this limitation.

[countrefs1 (program), Input, Title Table (input), Output: Lousteau, Bixiou, Bianchon, d'Arthez]


Get All References to a Character in a Collection of Stories
Extracts references, finds all occurences of a word (or more generally a regular expression) in a collection of texts and records their location and context (surrounding words). Uses string Search. The output of this program can be used as input to the program 'countrefs1' above. This program is just a script immitation of "Find in files" as found in well-known programming editors such as the shareware "Editplus" editor, Microsoft's VC++ 6, or Python's IDLE. (This script is significantly slower the editors I've just mentioned but usable by a python program instead of manually through the editor). (Unixspeak: "Grep" all regex matching lines from a directory of files.)
[getrefs1 (program), Output]


Find Important Periods of Time in a Story
For every Balzac title, find the years referenced. This gives a rough timeline for the work. The distribution of years in a work provides an idea of which years are important in the work. Plotting a histogram of this distribution allows one to *visually* judge the importance of different periods of time in a work. (The years collected are restricted to years in the nineteenth century up to Balzac's death: 1800-1848) [Input yearrefs1 (program), Output]


Provide readers with relative reading times for different titles
In this busy world it would be nice if a reader could estimate the amount of time a story is going to take so he can fit it into his busy schedule. Usually people eyeball the thickness of the book, or see how many pages there are. For an ebook they might look at how many bytes long the file is (e.g. 654 KB). Since we are interested in one collection of texts, namely stories written by Balzac, we can do better than this. We can rank his works and break them into groups: One simplifying assumption we make is that the time it takes to read a title is a linear function of the number of words (time = k * words). In fact if a story has a complicated plot then there could be significant non-linearities, in which case the best way to measure reading time might be to measure people reading it. Uses word counting. [wordsperfile2 (program), Output]


Lookup extracted references in the full body of the text (GUI)
So you have references, what if you want to see what they refer to? Here we give you a very simple reference file viewer. Just double click the reference line with your mouse and the larger file is opened and positioned to the item referenced which is highlighted. Just like the programming editors VC++6,IDLE(Python),and Editplus.


Get word frequencies in the Comedie Humaine Corpus
Word frequencies in the total corpus are essential for calculating statistics to identify significant collocations used by Balzac in his descriptions. (e.g. faces, noses, hats, rooms, food,...) Since Balzac used the 18th century folk "science" of phrenology (still regarded as a science today by many where I live in Rangoon, Myanmar) to design his characters, one can look at a character's face as a control panel for the story's plot in Balzac, so you can see these "physiognomy collocations" are an essential foundation for any simulation of Balzac's fictional world.
[wordcount0 (C++ program), wordcount00 (Python program calling wordcount0), wordcount1 (program) Output sample]


Recurring Character List
Balzac like Proust is famous for his recurring characters. The only problem is that in the Project Gutenberg edition of Balzac there isn't a master list, only partial lists at the each of each title that list the other works that the characters of that title appear in. Our program just reads these files and uses Python's dictionary data structure (Perl Hash, C++/Java Map) to accumulate these lists over all the titles in the Balzac collection.
[recur2, Output]

(Distribution of recurring characters. The average number of titles a character occurs in is about 2, but some, like the doctor Horace Bianchon, occur in or narrate in as many as 26 titles. Points plotted in the histogram.)



Dummy Project
Whatever dummies like me do.



Python Code Snippets of Useful Functionality:

Regular Expressions

Word Lists


Text Tiling


Clustering