STYLOMETRY IS DEFINED as the technique of making a statistical analysis of some characteristics of literary style. Often stylometric studies focus on authorship questions, and often they rely on computers to count language features in a text and to perform some kind of statistical analysis on these counts. Computers have been used in stylometric authorship studies since the 1960s, when computers first became available for use by scholars in any field. Many of these studies have been controversial for a number of reasons. Most humanities scholars cannot understand the complexities of mathematical methods, nor are they inclined to trust a scientific approach to their own problems. On the other hand, the practitioner of stylometry may fail to understand fully important textual aspects of the authorship problem that seriously affect the assumptions of the statistical study. For example, evaluating a text's authorship by measuring something associated with punctuation will not be valid if scholars have determined that punctuation was later added or modified by a scribe or printer. The stylometric approach has also raised objections from literary scholars when a scientist has based an analysis on certain textual features that do not seem interesting or significant to the literary scholar. Finally, there have been controversies among the advocates of stylometry themselves regarding the best statistical methods to use or the best textual features to measure.
Not surprisingly, the field of Shakespearean authorship studies has been a frequent battleground for those who strongly advocate stylometry. For the most part, the accepted mainstream of Shakespearean textual scholarship has been unaffected by the results of any statistical authorship study, although a number of articles in the area have been published in journals devoted to computer applications in the humanities. When I began the research described in this paper, my goal was to attempt to evaluate the effectiveness of a number of stylometric techniques on a suitable Shakespeare problem (one for which it might just be possible to arrive at an answer). Another goal was to pay particular attention to the well-established body of knowledge developed by editors, bibliographers, and other Shakespearean scholars.
The possible collaboration between Fletcher and Shakespeare in The Two Noble Kinsmen and Henry VIII is an excellent test for evaluating the effectiveness of stylometry. Most scholars accept that Shakespeare and Fletcher are the possible authors involved, and that most scenes are probably wholly written by one playwright or the other. There is strong linguistic and stylistic evidence supporting existing attributions that the stylometric results can be measured against. In addition, there are a relatively large number of undisputed plays by both authors that can serve as control samples by which habits of composition can be identified. Finally, there is still some disagreement about the question of collaboration among Shakespearean scholars.
This paper summarizes parts of a large research project that led to my Ph.D. dissertation. In this study, common function words were counted and studied to see if they might be useful in distinguishing Shakespeare from Fletcher (as they have been in other studies). A statistical procedure known as discriminant analysis was used with this data, and both the data and procedure were thoroughly evaluated on samples of known authorship to determine the accuracy and limitations of this method for determining authorship. Finally, the procedure was applied to the scenes in Henry VIII and The Two Noble Kinsmen that contain at least five hundred words. The results indicate that both plays are collaborations and generally support the accepted divisions. However, a number of scenes thought by many to be Fletcher's work resemble Shakespeare more closely in their use of function words.
In order to study the question of collaboration in Henry VIII (H8) and The Two Noble Kinsmen (TNK), twenty-four Shakespeare and eight Fletcher plays were selected to represent the two dramatists. Of these texts, six were set aside as a test set. These were treated as if their authorship was unknown. All procedures that I evaluated were applied to samples from these six plays, and the results were used to judge their effectiveness. The remaining plays (twenty by Shakespeare and six by Fletcher) were used to establish each author's characteristics of composition. These texts are referred to as the control set.
At the time this study began, only two editions of Shakespeare's works existed in electronic form. The Riverside edition was not available to scholars at that time, but I was granted access to the quarto and folio texts prepared by T. H. Howard-Hill for the series of concordances published by Oxford University Press. Using early editions led to some problems with regard to experimental design and computer processing. However, Schoenbaum, Greg, and others have argued that authorship studies should be based on early printed editions or manuscripts. In addition, there are few modern editions of any plays by Fletcher. Of the fourteen plays that Hoy determined to be the unaided work of Fletcher, four existed in machine readable form. Two of these were authoritative: the 1647 F1 text of Bonduca, and the Ralph Crane manuscript version of The Humorous Lieutenant (which is entitled Demetrius and Enanthe on the manuscript). Two more existing texts had been prepared from a copy of the second Beaumont and Fletcher folio of 1679, although the text for these plays was derived from Fl. These texts were used as the test set for Fletcher. Four other texts were entered into the computer, using Bowers's old-spelling editions.
The use of early authoritative editions does introduce a variety of problems, including some that influence which undisputed plays could be included in the control set. A number of principles were used to select plays not to be included in the study. First, any play with serious questions of authenticity was omitted. Second, I only selected plays that have one authoritative source text, since otherwise I would have to bring together readings from several texts to form a composite version (in other words, become an editor). This principle was violated in minor ways in a number of plays (for example, the witch scenes in Macbeth and the passages of French in Henry V were ignored; the deposition scene from F1 of Richard II was added to the quarto text; and so on). The changes I made in the electronic copies of these copy texts required no editorial judgment other than a sound knowledge of the relationship of the various early versions of the plays. Similar principles were applied in selecting plays by Fletcher.
In choosing to use stylometry to approach a problem, the researcher faces two fundamental choices: what features to count or measure in the texts, and what statistical techniques are most appropriate to evaluate the significance of this data. There are many choices for each of these issues, and differences of opinion among advocates of stylometry have led to controversies within the field. Tallentire argues that the choice of what to measure can be made in two very different ways. The first approach is more natural for a traditional literary scholar: an analyst uses statistics to provide an objective component of judgment regarding features that have been first recognized through careful reading of a text. The second approach is often used by a statistician or scientist working in stylometry: the analyst measures many features (perhaps chosen arbitrarily) in control texts and uses statistics to find those features that produce statistically significant differences. For example, in applying the first approach to the current problem, we might use statistics to evaluate the presence of forms like ye and 'em which scholars have long noted are favored by Fletcher. A study based on the second approach might examine a large number of features (such as the set of all pronouns, sentence length, the type-token ratio of vocabulary richness, and so on) to identify which features show the most promise for distinguishing the two authors.
In using the second approach, the analyst trusts the final results not because of a belief in the importance of the literary features studied but because of a rigorous validation procedure which has been carried out on control samples. The scientist may accept the results even if there is no apparent logical explanation that could explain why the chosen features should indicate authorship, date of composition, and so on. In the past, the community of humanities scholars has mistrusted this approach, but for questions of authorship it is important to find textual features that reflect a writer's conscious stylistic decisions as little as possible. Traits apparent to a careful reader would of course be open to imitation. Thus many studies of authorship, chronology, or textual integrity have focused on features such as the occurrence or position of function words, vocabulary measures, sentence length, or word length. The primary results of this study are based on the rate of occurrence of common function words. Such words were also successfully used by Mosteller and Wallace in their well-known study of Hamilton's and Madison's The Federalist papers.
This study also uses a statistical approach that differs from many techniques of stylometry that have been used. Some researchers have measured language features in a number of control texts and then computed statistics (for example, the average together with the standard deviation) for a given author. Then the same feature is measured in a disputed play (or scene, and so on). This value is compared statistically to the value that characterizes all of the control texts as a group. Two problems arise from this approach. First, we must have some way of combining the results of a number of such comparisons. The process of combining mathematical results may not be straightforward if the features being studied are not statistically independent. Second, if there is a wide variation in the value being measured, then this variation may not be reflected when one describes, for example, thirty plays of Shakespeare by one numeric value. When a researcher uses this type of approach to compare the overall proportion for a language feature in a large Shakespeare control set to the proportion in a scene from Henry VIII, the researcher is asking in effect: How different is the observed value in the disputed scene from the best estimate of what it should be?
A better question might be: What is the likelihood of observing a sample of text with these values, given the pattern of occurrence observed in a number of similar samples by a given author? This second question seems to be closer to what we really want to ask. But it implies that we should measure textual features in samples similar in size and nature to what is disputed, and that the statistical method chosen should clearly address variation within these many samples. That is, if one wants to assign a disputed scene, one should compare it to members of a large set of scenes of known authorship, not just to a summary statistic like an average value. Thus, the researcher should use a quantitative method that compares like to like. The next paragraphs describe a multivariate statistical method that is based on this strategy.
The statistical method used to analyze word rates in this study, discriminant analysis, avoids the problems noted above. (Hand's text provides the best description of discriminant analysis as it is used in this study.) First, it is truly multivariate, in that it allows us to combine measurements for a number of features (or variables) and to produce one answer, taking statistical dependence into account. Second, it uses similar samples [the control set) of known classification (i.e., authorship) to create a classifier that decides to which group or class a given sample is closest. In other words, after choosing a set of features to measure and analyze, these features are measured in a large number of samples from both the control set and the disputed set of works. Each of these samples is now described by a set or vector of values, and the question becomes this: for each disputed sample, is its vector more like those in the set of Shakespeare samples or more like those in the set of Fletcher samples? For this problem, the samples of known classification are the scenes from the plays in the control set (i.e., undisputed works by Shakespeare and Fletcher). For each scene from H8 and TNK, the statistical method determines if that scene's vector is more like the set of all the vectors from Shakespeare's control set scenes or more like those from Fletcher's. In this context the term classifier is used to reference a particular combination of statistical method and specific set of variables. The variables might be any type of textual feature measured in a scene, but in this study they are the rate of occurrence of function words. If the set of words to be analyzed is varied, then a different classifier results (although these classifiers might all be based on the same discriminant analysis technique).
The statistics of discriminant analysis are complex, but a simple example using word-rate data from Shakespeare and Fletcher may clarify the general approach. In this example, the twenty plays in the control set written by Shakespeare and the six plays written by Fletcher are divided into acts. In each act, the rate of occurrence of the word in is measured along with the rate of the word of. (Shakespeare uses these two prepositions more often than Fletcher.) We now have two values associated with each act in these plays; each sample is represented by a two-dimensional vector. We can show this with a two-dimensional plot, shown in Figure 1, in which each sample is positioned according to its value for the rate of in (the horizontal axis) and of (the vertical axis). Acts by Shakespeare are plotted as solid circles, while Fletcher acts are shown as small diamonds. (In the actual study, measurements are made by scene; this example uses acts to keep the plot simple.)
Examination of the graph shows that there is wide variation within each author for these two word rates; clearly neither author is very consistent in how often he uses these words. If the writers did use these words at a consistent rate, then the plots for each would form small, tight clusters. But despite the large amount of variation, these two common words do a remarkable job separating the set of Shakespeare acts from the set of Fletcher acts. If we consider the curve shown in the figure as the boundary dividing the two authors, then we can see that two of the thirty Fletcher acts and eight of the one hundred one Shakespeare acts fall on the wrong side of the curve and would thus be assigned to the incorrect class (author). We can view this as a misclassification rate of 10/131 = 7.6 percent. If we added samples from H8 and TNK to the graph, then we could assign authorship based on which side of the curve these samples' points fell.
This example illustrates exactly how discriminant analysis works. The method uses the samples of known classification (plotted in the figure) to calculate the curve shown in the figure. This curve serves as a decision boundary for samples of unknown classification. Imagine if three words were used; then the graph would become a three-dimensional plot in which the authors' samples would be clouds of points floating in a three-dimensional space. (If the three variables do a good job in distinguishing authorship, these clouds would not overlap much.) In general, discriminant analysis treats each sample as a point in an n-dimensional space. If one then chooses any point in this space (say, any position in the plot in Figure 1), then the method can calculate the probability that a sample with those values belongs to one class (author) or another. The ratio of these probabilities becomes the likelihood that the sample belongs to a given class (i.e., that the scene was written by a given author). A sample failing exactly on the curve shown on the plot would have a fifty-fifty chance of belonging to either playwright, according to the probabilities calculated by the statistical method. The farther you move away from the boundary separating the authors, the higher likelihood that a sample is safely assigned to one author. The statistical method allows us to use what is known as a reject option to define a region adjacent to the boundary. Samples falling in this region are considered too close to call and are not assigned to either class. (In the results described later, a reject option was used so that samples were not classified unless the likelihood was four-to-one for one author or another.)
The skeptical reader may have several questions at this point. First, why should this statistical method be accurate in assigning a small sample from a possibly collaborative play to Shakespeare or Fletcher? There is no reason to think so unless one has a method for evaluating how successful this approach can be. This evaluation should be carried out before applying the classifiers to the disputed plays. For the example above we began such an assessment when we noted that the method failed for ten of the 131 acts in the control samples that were used to create the graph and make the calculations. Recall also that a set of plays were kept apart from the control samples; this test set is made up of two more Fletcher plays and four more Shakespeare texts. Acts from these plays could be assigned to one author or the other by the method, and the misclassification rate evaluated. These test set plays are not used in any other way to help establish a classifier, so they provide a completely independent test. This approach is used to quantify the accuracy of the classifier (i.e., the method together with the chosen variables) used in the final analysis. Our confidence in the likelihoods produced by the method are based solely on how well the method does in classifying all the samples in both the control and test sets (which are all of known authorship).
Another question might be: How does one decide which variables to use with the method? It is important to note that the statistical method discriminant analysis is really just a technique for assessing information collected from the plays, and thus it will not work well if the data measured in the samples does not really differ enough between the authors. The goal of this study was to try to find the best language features for discriminating between these two authors. Initially several simple statistical techniques were used to identify a large set of possible features that might suffice. Afterwards the discriminant analysis method itself was used with various smaller subsets of these features to find a small set of variables that are best for classification by author. Choices are based on which subsets produce the smallest misclassification rates on the control samples and/or the test samples. This process is known as feature selection. There are a number of statistical methods for carrying out feature selection (all beyond the scope of this paper). Each involves trying various combinations of variables in an effort to improve the accuracy of the classifier when applied to the samples of known classification.
No matter what statistical method is used for a stylometric study, we must collect measurements from the texts themselves. Identifying and counting function words might seem to be a straightforward task, but homonyms, spelling variants and contracted forms in old-spelling dramatic texts present problems when using a computer. Many common function words have several variant spellings (e.g., been can be spelled beene, bene, bin, and so on). Other forms can represent a number of lexical forms; for example, besides the indefinite article, the single-letter word a can mean he, of, on, ah, and so on. Some forms of compound contractions involving function words are frequent (e.g., let's, o'th', 'tis, and the many contracted forms of is like it's, Caesar's, he's).
These problems were solved by developing a system to produce a revised version of the computer file containing each play in which all forms of common words can be recognized automatically. This system has three components. First, a set of special codes is added to texts' computer files to identify and distinguish homonyms and variants. For example, a#1 represents occurrences of he, a#3 represents on, and so on. Occurrences of beene, bene, and so on will later be automatically replaced by the standard form, but occurrences of bin meaning a container are marked bin#1 to avoid mistaking this as a form of the verb. Second, replacement and expansion lists give a "normalized" form for all of these forms involving common function words. For example, these lists include an entry showing that a#1 should be replaced by he in the new version of the computer text. These lists also give the expanded forms of many compound contractions, indicating for example that 'tis should be replaced by it is. Lists of compound contractions and their full forms were compiled from experience and with help from Partridge's book Orthography in Shakespeare and Elizabethan Drama. However, a simple replacement strategy is not powerful enough to handle apostrophe-s and -t forms. Special codes (similar to those shown above for homonyms, and so on) were inserted into the text files to distinguish apostrophe-s contractions involving is, us, his, and so on from possessive forms, and contractions of it from forms like banish't.
The third component of this system is a program named REPLACE that uses this system of codes and these replacement/expansion lists to create a "normalized and expanded" version of the computer file containing the text of a play. These versions contain no compound contractions, and all forms of common function words can be easily identified using a computer program. Once such versions were prepared for each play by both authors, we could determine the extent of each author's use of compound contractions. In almost every case, Fletcher uses more of these forms than Shakespeare, although the latter uses more contractions as his career progresses. Because of this secular change and the possibility of alterations introduced by scribes, compositors, or revisors, the expanded versions of the plays were used in the remainder of the study. Thus any subsequent mention of samples or scenes of, say, 1,000 words in this paper reflects the count after the expansion of compound contractions. This process revealed some interesting results based on differences in Shakespeare and Fletcher's rate of use of compound contractions. In particular, Fletcher's rate of use of compound contractions involving is is a great deal higher than Shakespeare's. While this finding might be useful in the context of this authorship study, such contractions are features that might easily have been modified by a scribe or compositor. Therefore this feature is not used in any of the statistical analyses described in this paper. Nevertheless, I cannot resist noting that H8 4.2, usually attributed to Fletcher, has an extremely low rate for contractions of is that is unparalleled in the scenes of the eight undisputed Fletcher plays examined in this study.
Before beginning the discriminant analysis, I searched for a set of words that seemed to have the most potential as authorship markers: words that are used more often by one author than by the other. Starting with several hundred common words, several procedures were initially used to narrow these down to a smaller set of good markers. These procedures included the use of the well-known distinctiveness ratio (DR) and statistical t-test. Some words that exhibited relatively large degrees of variation in Shakespeare's works according to genre or date of composition were then eliminated. Finally, as noted above, the discriminant analysis technique itself was used as part of the feature selection technique to further reduce the number of words to be used for the final analysis.
The distinctiveness ratio is calculated for a given word by dividing the overall rate of occurrence for one author by the other author's rate. The DR values did identify some words that turned out to be good markers (such as dare, too, and which). It also found the forms that Shakespeare and Fletcher have been known to use differently (such as ye, hath, has, them, 'em). However, a large DR only occurs for words that one author uses very infrequently, and this method does not identify more frequent words that both authors use but at different rates. For example, the rate for the is very different in the two writers; Shakespeare's rate is 32.6 per 1,000 word occurrences, while Fletcher's is 23.7. The DR is 1.38, not a large value. Another criticism of DRs is that they do not take within-author variance into account at all. The t-test is a calculation normally used by statisticians to determine if the mean values calculated from two separate sets of samples are statistically equivalent; it does take within-sample variance into account. My results show that this test is probably more useful than the distinctiveness ratio test for identifying good markers.
Perhaps these markers occur at different rates for reasons other than authorship. A statistical procedure called analysis of variance was used to test for variation in these word rates that might correspond to date of composition or genre. This was only clone for acts of the twenty Shakespeare control-set plays. Many of the words showed significant variation, including some of the best markers identified in the previous steps. After some discussion with a statistician, I decided to eliminate any of the markers for which the average rate for one of the subgroups (i.e., comedies or early plays) did not differ significantly (based on the t-test) from the overall Fletcher rate. Thus this kind of variation among Shakespeare's works did not eliminate a marker from consideration if the extreme values were still different enough from the Fletcher rate.
Some of the variations within Shakespeare's works might raise a linguist's curiosity. Shakespeare's comedies are characterized by high rates for pronouns and a, together with low rates for the. Tragedies have low rates for a. The histories have very low rates for personal pronouns (as noted by Brainerd) and high rates for in, of and and. In occurs infrequently in the romances, while so is much more frequent in this genre than the other three. The late plays have a high rate for the.
There were a number of words that are used at different rates by the two dramatists but did not occur frequently enough to be used as individual variables. However, one way of possibly using them would be to take all of the Shakespeare markers and count them together, then do the same for the Fletcher markers. This produces two new variables, which I call "infrequent marker pooled sets." There are some statistical reasons why this approach is not completely sound. But I decided to see how this might work, hoping that using these two variables together with the other individual word markers might minimize any problems due to statistical dependence. These two new variables turned out to be the best individual discriminators. But because of their somewhat questionable nature, the final discriminant analysis was repeated twice: once using the infrequent marker pooled sets and once without.
The process of evaluating marker words led to the discovery of a trait which, although it cannot be used in the discriminant analysis method, might provide additional evidence. When looking for spelling variants when preparing the replacement and expansion lists described earlier, I noticed that Shakespeare uses more words beginning with there- or where-(e.g., therefore, therein, wheresoever) than Fletcher. In fact, only eight such "there/where- compounds" (as I call them) occur in all eight Fletcher plays I studied, whereas Shakespeare uses twenty-seven in the twenty control-set plays. When Fletcher does use such forms, he uses them much less frequently; the combined rate is 1.53 per 1,000 word occurrences for Shakespeare but only 0.13 for Fletcher. This yields a distinctiveness ratio of 11.8, which is as high a value as that of the well-known marker hath.
These there/where- compounds do not occur very often, and some occurrences in Fletcher and the disputed plays occur in stock phrases (such as "and thereby hangs a tale") or in a song (that may have been borrowed). Because of their very low frequency they were not included in the formal statistical analysis. However, two scenes in H8 and TNK that are usually attributed to Fletcher contain what I feel are significant occurrences of these words. H8 1.3 contains two (wherewithall and thereunto); this would be an extremely high rate for Fletcher. TNK 4.3 contains one occurrence of thereto; this scene was attributed to Shakespeare by some early scholars. This is the first of the mad scenes of the jailor's daughter, with its echoes of Lear and some other Shakespeare plays. To anticipate the results of the final discriminant analysis, the use of function words in these two scenes is much more like Shakespeare than Fletcher.
After identifying good markers using the DR and t-test and then eliminating those that varied too much according to genre or date of composition, sixteen markers remained (fourteen individual words and two infrequent marker pooled sets). These seemed to have the best chance of discriminating between Shakespeare and Fletcher. Above are some simple statistics for the fourteen individual words used in the final analysis. The mean (or average) and the standard deviation (a measure of variability among the samples) are listed, each calculated using control-set scenes containing at least 1,000 words.
The discriminant analysis method is known to work better with a smaller number of variables, so the feature selection techniques (described above) were used to select the subset of variables that produced the best results in classifying the control set itself. I used two different methods to do this, which produced two slightly different sets of words. I then repeated the process, including the infrequent marker pooled set variables; one of the resulting subsets of variables misclassified more samples of known authorship than the other three, so I dropped it. Thus I had three sets of word-rate variables to use in the final analysis, each containing seven or eight words from the fourteen words listed above plus the two pooled sets of infrequent markers.
To determine how short a sample of text could be accurately classified using this method and these variables, I applied the classification method to scenes of different length from both the control and test sets (of known authorship) and looked at the misclassification rates. I first tried all scenes of 1,000 or more words, then 750 or more words, and finally 300 words. Not surprisingly, the method's performance decreases as smaller samples are included. A short scene has fewer occurrences of the markers and therefore less information that can be used by the statistical procedure. The five-hundred-word limit appeared to be a good choice; the accuracy was acceptable in my opinion, and most of the scenes in H8 and TNK are at least this length.
Correct Incorrect Rejected
As noted above, I chose to use a reject option that allows samples to remain unassigned if the evidence is not strong. I used a rather strict threshold in comparison to most other statistical applications of discriminant analysis: any likelihood ratio less than four-to-one was "rejected." In addition, because I ended up with three sets of word-rate variables (and thus three distinct classifiers), I needed a decision rule to resolve conflicting or ambiguous results from these three different sets. After examining the performance results on the control and test sets, I classified scenes even where the probability for one of the three classifiers was less than the reject threshold as long as all three results were consistently in favor of one author.
Finally the point has been reached when we can evaluate how well discriminant analysis performs analyzing the scenes of known authorship in the control and test sets. If the word-rate information and the statistical method are not successful when classifying scenes for which we know the answer, there is no point applying the procedure to the disputed plays. The table above shows the number of scenes correctly assigned, incorrectly assigned, and the number that could not be assigned due to the reject option on scenes of five hundred words or more. These results demonstrate very acceptable performance on relatively small samples of known authorship. Almost 95 percent of the five-hundred-word scenes were assigned correctly, and an outright error was made for less than one-half of one percent of the scenes.
Two other important tests were performed before analyzing H8 and TNK. First, I took the six plays of the test set and extracted samples that were composed of the speeches of the individual characters (those who speak at least five hundred words). When the three classifiers were applied to these sixty-two samples, the misclassification rate only increased slightly. Although this is only a limited test, it suggests that characterization does not affect the use of these variables with this procedure to any great degree (at least for the purpose of distinguishing Fletcher from Shakespeare).
Second, to see how the classification procedure handled scenes of joint composition, I took text of known authorship (from the test set) and created twenty samples of about eight hundred fifty words that were roughly half Fletcher and half Shakespeare. These results were not so encouraging. Half of the scenes were assigned to one author or the other (using the same rules which produced the error rates described in the preceding paragraph). Twenty-five percent were left unclassified, and the other 25 percent produced ambiguous results: at least two of the classifiers produced high probabilities one way or the other, but these disagreed regarding which author wrote the scene. This interesting result only occurred for 3.4 percent of the test-set scenes analyzed earlier that were written wholly by one author. It appears that scenes of joint composition may be assigned to one author or the other by the method developed in this study. On the other hand, strong but conflicting results for a scene may well indicate joint composition.
Finally, the statistical method was applied to function word data collected from the possibly collaborative plays. The final procedure was applied to the scenes in H8 and TNK that contained at least five hundred words. As indicated in the introduction of this paper, the results of this function word analysis indicate that both plays are collaborations. Several scenes thought by many to represent Fletcher's work resemble Shakespeare more closely in their use of function words. Figures 2 and 3 show these results together with the length of each scene and the attribution by other authorities. The statistical results given for each scene of five hundred words or more include the probability and authorship attribution calculated using each of the three sets of markers chosen with feature selection. The last column indicates the verdict as determined by the decision rule described earlier. The strength of the result (based on the relative degree of the calculated probabilities) is indicated by preceding the verdict with "like" or "very like." Scenes that could not be clearly assigned by the decision rule are indicated by a question mark.
A detailed analysis of each scene is beyond the scope of this paper, but here are some of the more interesting results. Of the scenes that are classified as Shakespeare's, perhaps TNK 2.3 is the least exciting. No one else has ever wanted to claim this for him. Perhaps Shakespeare really did write it, or perhaps it is one of the errors we expect (with low probability) from the method. Or perhaps someone else wrote it. Discriminant analysis only tells us which of known candidates is most likely, so it is not very useful for situations where one does not know all the possible classes.
The other interesting result in TNK is 4.3. This analysis concludes that the evidence for Shakespeare's authorship of this scene is strong. The function word results show that this scene is as much like Shakespeare's other scenes as the generally accepted portions of the first and last acts. The linguistic results described by Hoy do not really support Fletcher's claim here. This scene has one occurrence of has compared to two occurrences of hath, and no occurrences of ye but ten occurrences of you. Only nine of 106 of the Fletcher scenes in the control set contain two or more occurrences of hath, and there are no scenes with ten or more occurrences of you with no occurrences of ye. Also, this scene contains one of the there/where- compounds, thereto.
In H8, the results of the discriminant analysis are perhaps less clear than in TNK. Almost all of the scenes that are generally accepted as Shakespeare's (1.1-2, 2.4, 3.2a, and 5.1) have function word rates that are very unlike scenes by Fletcher. In the last thirty years, some scholars (e.g., Foakes and Hoy have suggested that Fletcher touched up Shakespeare's work in 2.1-2, 3.2b, and 4.1-2. My results may support this theory of revision, since 2.2 and 3.2b have the "strong but disagreeing" results that characterized 25 percent of the "joint composition" scenes put together to test the method. The results for act 4 are more interesting: in both scenes the use of function words is very unlike Fletcher. Also, as noted earlier, the rate of contraction of is in 4.2 is lower than for any other scene by Fletcher examined in this study. Thus my results suggest that Fletcher had little or nothing to do with this act. The authorship of these scenes has been controversial; my results indicate that Katherine's final speeches cannot be credited to Shakespeare's collaborator, as some have argued. If this raises difficulties in the interpretation of the play's structure, then these problems cannot be explained away as Fletcher's misunderstanding of Shakespeare's intentions.
Two scenes in H8 that are usually assigned to Fletcher alone are much more like Shakespeare in terms of function word occurrences. The first is 5.4, the prose scene involving the porter and his man. Here the word rates are unusual for either author, but the scene is far more unlike Fletcher than Shakespeare. Although Fletcher hardly ever wrote prose, the scene has been assigned to him by other scholars because it contains thirteen occurrences of 'em to none of them, and eight occurrences of ye to fifteen of you. These rates for ye and 'era in one scene are unparalleled in Shakespeare's known work. Assuming the validity of the function word results, a very interesting question arises regarding the copy text: why does a scene by Shakespeare contain linguistic forms that he does not normally use? Revision by Fletcher is one possible explanation.
The second of these results in H8 is 1.3, which Mincoff calls "the most unmistakably Fletcherian scene in the whole play." The function word evidence here is not as strong as in other scenes assigned to Shakespeare, but it does indicate that the general pattern of use is more like him than Fletcher. Also, the scene contains two occurrences of there/where- compounds (wherewithall and thereunto). These two occurrences in a short scene are unlike any other Fletcher scene in either the control or test sets, and cannot be traced to the sources behind this scene. On the other hand, this scene contains some of the more convincing examples of Fletcher's stylistic traits, including proportions of 'era and ye that are not paralleled in any scenes by Shakespeare in the control set. Again, revision appears to be a possible explanation for the presence of these two types of contradictory evidence.
Inspection of Figure 3 shows that in H8 the three classifiers left many of the scenes usually assigned to Fletcher unclassified and did not assign many scenes to Fletcher with high probability. It appears that the method may not recognize Fletcher samples with the same certainty as Shakespeare. This may be due to the larger number of Shakespeare samples used to create the classifiers (twenty plays compared to six), or because the most frequent of the markers used are all Shakespeare "plus" words. I do not believe this casts doubts on the assignments of scenes to Shakespeare; the classification results on scenes of known authorship presented earlier are the best indicator of the procedure's accuracy and limitations.
Now that the results of applying this method to the two plays have been presented, a number of advantages of this study of function words should be noted. One of the strengths of my study is the amount of text examined. The classifiers are based on samples comprised of more than 400,000 words of Shakespeare and almost 90,000 words of Fletcher. (For comparison, Mosteller and Wallace used 94,000 words of Hamilton and 114,000 words of Madison in their main study.) Naturally, there are more questions about the integrity of these plays than for the texts in some authorship problems, but the plays I selected were (for the most part) free from any serious textual problems.
The marker words examined are relatively frequent. Of the sixteen variables initially chosen, twelve were used in at least one of the three classifier's sets of markers. These twelve represent about 10 percent of the total word occurrences in both authors. Further work needs to be done in the area of recognizing the effects of context. I relied on the discriminant analysis feature selection process to eliminate any word with serious (or frequent) context dependencies. However, I can give examples where rates for words like the and of are seriously affected by the subject matter. (In my complete description of this study the analysis of the results for the disputed plays was accompanied by a close examination of the word occurrences that led to the results.)
Before concluding, it is useful to reexamine the basic approach of this study. The first stages of the analysis of function words were an attempt to answer these questions: Is there a difference in the rate of occurrence of function words in these plays that corresponds to a difference in authorship? If so, how close is this correspondence, and how often does it lead to apparently incorrect decisions about undisputed samples? After finding useful answers to these questions, the next step was the application of the tests to the scenes in the two disputed works.
But how should one interpret the results? When the discriminant analysis method is applied to the word rates from a disputed scene, the "verdict" should perhaps be formulated along these lines:
The rates for these words in this scene are more similar to the rates found in undisputed scenes written by Author A than those scenes written by Author B.
The only grounds one has for making the jump to the statement:
Author A wrote this scene
are the results for samples of known authorship that showed that such a conclusion was correct for 96.5 percent of the 365 scenes in the control set and for 85.2 percent of the scenes in the test set (94.8 percent overall). The probabilities produced by the discriminant analysis procedures also provide some indications of the certainty of the decision (and a means of recognizing scenes that the method should leave unassigned). This is important, for as Hey notes, "With linguistic evidence it is all, finally, a matter of more or less"). In the scene-by-scene analysis of TNK and H8, one must recognize that the evidence for some scenes was much stronger than for others.
This study has shown that common function words can be studied as internal evidence of authorship, and one should consider how this evidence compares to other forms of internal evidence. This becomes especially important when the results of an analysis of function words do not agree with results based on more traditional forms of evidence. One reason that function words might be a reliable form of internal evidence is that, because of their frequency and lack of prominence, they are presumably less likely to be altered by scribes, printers, editors, or revisers. For the procedures used in this study, this will only be true if alterations in the counts (due to corruption or revision) do not affect the words rates enough to drastically change the probabilities of authorship calculated using discriminant analysis. It should be noted that one or two insertions or deletions of an infrequent marker such as dare can produce a large change in the word rate, especially in short scenes. Classification results that appear to be strongly affected by a rate for one infrequent marker are thus less trustworthy than a result due to one or more frequent markers such as all, the, of, and in. (Again, in the complete description of the study, I examine how the results for some scenes would be affected if small changes were to be made to the counts for one or two words in that scene.)
My result for Henry VIII 5.4 probably represents the most serious disagreement between the analysis of function words and other examinations of linguistic evidence. The marker word rates in this scene are much more like those in Shakespeare's control set samples than in Fletcher's but one cannot ignore those occurrences of ye and 'em. If one accepts that my result indicates Shakespearean authorship, one must then explain these very Fletcher-like proportions for the pronoun forms. Either Shakespeare was capable of breaking from his normal practice, or these forms were introduced into the copy by a scribe or by Fletcher as a reviser. Both of these explanations require a serious departure from the assumptions made by most textual scholars.
The nature of the copy text used in printing the play in the 1623 folio is central to reconciling these conflicting results. Most would agree that Henry VIII presents a more complex problem than does The Two Noble Kinsmen. Hoy has maintained (and my results support his conclusions) that Fletcher revised some scenes in this play that Shakespeare wrote. Their hands do not appear to be so closely intermingled in TNK. Neither the goals nor the methods of this study are intended to address new hypotheses regarding the play's copy text. However, the fact that two different forms of linguistic evidence could lead to opposite conclusions if considered on their own suggests that scholars should reevaluate the relation between the copy text of Henry VIII and the various internal evidence that it contains.
Such a reexamination would be unnecessary if the conclusions derived from a statistical analysis of function words could be easily dismissed. However, the function word data and the statistical method have been shown to be effective through a rigorous analysis of twenty-four plays by Shakespeare and eight by Fletcher. When applied to scenes of at least five hundred words in these plays, the method correctly assigned almost 95 percent of these to the correct author. Applying this method to TNK and H8 confirms that both plays are collaborations. While the results agree with the generally accepted division between the two playwrights, a number of scenes considered by some to be Fletcher's work have a function word vocabulary much more like Shakespeare's. During the study other useful linguistic evidence was identified: the rate of contraction of is, and the use of there/where- compounds. In several instances, this evidence supports the function word results.
Figure 2. Discriminant analysis results for The Two Noble Kinsmen.
++ Except for the first 33 lines (276 words).
Figure 3. Discriminant analysis results for Henry VIII.
* Attribution of each scene according to Hoy.
GRAPH: Figure 1. Plot of the rate of occurrence of in against of measured in acts.
1. Thomas B. Horton, The Effectiveness of the Stylometry of Function Words in Discriminating between Shakespeare and Fletcher. Ph.D. diss., University of Edinburgh, 1987.
2. S. Schoenbaum, Internal Evidence and Elizabethan Dramatic Authorship: An Essay in Literary History and Method (London: Edward Arnold, 1966).
3. David Tallentire, "Confirming Intuitions about Style, Using Concordances," in The Computer in Literary and Linguistic Studies, ed. Alan Jones and R. F. Churchhouse, 309-28 (Cardiff: University of Wales Press, 1976).
4. Frederick Mosteller and David L. Wallace, Applied Bayesian and Classical Inference: The Case of the Federalist Papers (New York: Springer-Verlag, 1984; 2d ed. of Inference and Disputed Authorship: The Federalist (Reading, Mass.: Addison-Wesley, 1964).
5. D. J. Hand, Discrimination and Classification (Chichester: John Wiley & Sons, 1981).
6. A. C. Patridge. Orthography in Shakespeare and Elizabethan Drama: A Study-of Colloquial Contractions, Elision, Prosody and Punctuation (London: Edward Arnold, 1964).
7. Barron Brainerd, "Pronouns and Genre in Shakespeare's Drama," in Computers and the Humanities 13 (1979): 3-16.
8. John Fletcher and William Shakespeare, The Two Noble Kinsman, in Regents Renaissance Drama Series, ed. G. R. Proudfoot (London: Edward Arnold, 1970).
9. Cyrus Hoy, "The Shares of Fletcher and his Collaborators in the Beaumont and Fletcher Canon (VII)," in Studies in Bibliography 15 (1962): 71-90.
10. Ibid., 70.
11. Ibid., 89.
12. William Shakespeare, King Henry VII, in The Arden Shakespeare, 3d ed., ed. R. A. Foakes (London: Methuen, 1957).
13. Hoy, "Shares," 79-81.
14. Marco Mincoff, "Henry VIII and Fletcher," Shakespeare Quarterly 12 (1961): 239-60.
15. Mosteller and Wallace, Applied Bayesian.
16. Horton, Effectiveness, 298-307; 316-29.
By THOMAS B HORTON