Balzac's World


 
  Home

  Reader's Guide

  Characters

  Computing
for Poets


  Bio
 
     

    Computing for Poets

    Code Snippets

    Text Data Mining

    The novels of Balzac provide fertile ground for researching textual information retrieval with languages such as Python, Perl, C++ using multiple string search strategies such as the Aho-Corasick algorithm or DFA's (Deterministic Finite State Machines).

    Stories or narratives are interesting special cases for information retrieval because they don't explicitly say what they are about. Rhetorical devices used in literature (metaphor, metonymy, irony) rely on indirection and not saying things explicitly. Let's say you want to search for all descriptions of characters, you can't just search on the strings "character" and "description" like you would for an ordinary subject in an internet search engine. If you want to locate character descriptions you have to use a whole set of strings, phrases used to describe physique, clothing, and the emotional life of characters. You want to look for passages in the text where there is a high density of these strings.

    About midway through his career Balzac begin writing his novels with an eye towards recurring characters and recurring social types embodied in characters. He described his characters with a surplus of detail, this surplus of detail is great raw material for textual search algorithms. All his works are available for free from the internet in French and English translation.

    The programs presented in this section mix the mundane (e.g. reformatting from plain text to HTML, a reader for plain text books that allows bookmarking and annotation) and the more sophisticated (e.g. textual search strategies for extracting certain types of information out of texts, typically referred to as IE or information extraction) Here are some programs I've used on the corpus of Balzac's works:

    Here's a annotatable book that can be used with plain text files like Project Gutenberg provides:

    Here is a Perl program for extracting paragraphs that describe characters, interiors, or food from novels by authors such as Dickens and Balzac: