Chapter Two: Issues in Corpus Linguistics: A Review of the Literature

Language corpora are becoming available cheaply, sometimes free. The likely impact on language teaching will be profound--indeed the whole shape of linguistics may alter at speed. (Sinclair, 1997, p. 38)

Introduction

The previous chapter has reviewed the current theoretical and practical concerns of writing pedagogy. I have made the claim that besides an ethnographic description of processes and products of writing and writing pedagogy, we also need evidence from a larger set of language sample that FL students produce. That claim will be refined in this chapter, which aims to present the case for the need of corpus analytic methods in descriptive and applied linguistics.

One of the leading figures in corpus linguistics applying machine-readable collections, Leech (1997a), defined a corpus as "a body of language material which exists in electronic form, and which may be processed by computer for various purposes such as linguistic research and language engineering" (p. 1). The theoretical underpinnings, the technical development, and the study of such corpora have gained considerable ground in the past decades, signaling a trend away from decontextualized linguistics toward a study of language that takes account of context based on what is often referred to as "real" language. This chapter will review the growing literature that has given evidence of this enterprise.

The chapter is divided into six sections. The first will offer a discussion of the theoretical issue of performance versus competence, focusing on the contrasting views of Chomskyan generative linguistics and corpus-based linguistic analysis (2.1). Section 2.2 will be based on a brief historical overview of major corpora as it clarifies the types that have been established recently. The next section (2.3) identifies the issues of representative design, and some technical details of corpus development. Section 2.4 further narrows the scope by identifying a link between computer-assisted language learning and data-driven learning. Section 2.5 reviews work in the field of learner corpus linguistics, centering on the International Corpus of Learner English project. Finally, I will identify the benefits of applying corpora in language studies in section 2.6.

The concepts, definitions, and processes reviewed in this chapter will be central to the presentation of writing pedagogy at Janus Pannonius University and to the description and analysis of the JPU Corpus.

2.1 Rationale for corpus linguistics

2.1.1 Data in language analysis

A crucial issue in any analysis of language is the role of data. Evidence sought to support a theory of structure or language use provides the basis on which to evaluate the feasibility and applicability of a hypothesis. The role of linguistic evidence also has practical implications in language education, as it impacts on the manner in which a syllabus is presented (Seliger & Shohamy, 1989). As the examples may be either intuitive (coming from the linguist's own repertoire) or observed (recorded in some psycholinguistic elicitation or field work), the issues of competence and performance present the framework in which this question has been studied.

In the field, two competing traditions have emerged: 'asocial' linguistics that incorporates intuition to capture generic features and universals of language and of particular languages, and 'social' linguistics that investigates generic and language-specific notions based on observations of utterances (Wardhaugh, 1995, pp. 10-12). In the former paradigm, linguistic inquiry springs from a need to establish sufficient elements that can adequately describe the grammar of language; the latter engages the actual language community (or population) and extracts from it a corpus that is then used to test hypotheses. This section will offer a brief evaluation of these two traditions.

The most influential theoretical linguist of the twentieth century is Noam Chomsky, whose generative grammar is embedded in the asocial tradition. In defining linguistics as the study of grammar, he developed a set of strict principles and operators that language employs in generating all possible and grammatical utterances (Chomsky, 1957; 1965). The main focus, then, is on what is possible. This represents one of the main differences between the asocial and the social paradigms. In socially embedded linguistics, it is not only what is possible that is studied, but also what is probable. According to Kennedy (1998) this does not mean that linguistic theory does not benefit or is not "compatible with" (p. 8) the study of a corpus. On the contrary; as science requires evidence with which to refute or support a hypothesis, corpus linguistics provides a rich set of such evidence that allows for generalization.

2.1.2 Competence vs. performance

Traditional generative linguistics is concerned with the competence of an idealized native speaker whose sociolinguistic status, age, and gender are viewed as immaterial to the study of the generation of grammatical utterances. By contrast, empirical linguistics, of which corpus linguistics is a representative, sets itself the agenda of investigating the variables that lead to differential performances across these spectra. It interprets competence as "tacit, internalized knowledge of a language" (McEnery & Wilson, 1996). The generative linguist, who is concerned with capturing linguistic competence, applies a corpus of internal, closed sets of examples derived through introspection (a process that, according to Labov, 1996, might introduce error into the description of linguistic phenomena). The corpus linguist's data set derives from an external, open body of actual language performance, or the actual, social and contextualized application of competence. These performances are recorded following strict rules, with the necessary and available biographical and sociolinguistic information tagged to it (Stubbs, 1996). As corpus linguistics opens up the database upon which description and analysis is based, the evidence becomes available for further verification, too, representing another advantage (McEnery & Wilson, 1996, p. 13).

As Fillmore (1992) noted, the two types of linguist should ideally "exist in the same body" (p. 35). Contrasting the images and concerns of whom he called an "armchair linguist" and a corpus linguist, Fillmore pointed out that no corpora will ever offer all the evidence linguistics needs, but also that corpora have allowed linguistic scholarship to establish new facts about language, facts that one "couldn't imagine finding out about in any other way" (Fillmore, 1992, p. 35). But he also called attention to the importance of introspection and analysis by a native-speaker linguist. Biber (1996) also suggested that both generative linguistics and variation studies looking at linguistic performance derived from corresponding aspects of linguistic competence represent valid positions.

The call for a combination of the two approaches is based on the assumption that native speakers are competent decision-makers on issues of syntax. While the claim may be a perfectly valid one, I would like to raise an issue related to the theoretical limitations of the basis of linguistic inquiry. As no corpora can ever fully represent the language performance of a community (see, for example, Partington, 1996, p. 146), so, too, are introspective linguists limited in their competence (Labov, 1996). This adds further support to the claim that theoretical linguistics and corpus linguistics can and should co-exist.

Such co-existence occurs in a social context. The notion of context (or setting) in which language competences materialize (Hymes, 1974) as well as its central importance, was further highlighted by Sinclair (1991), who claimed that as introspective linguists do not, as a rule, require a discourse context for their own examples, the naturalness of the evidence suffers. Defining this feature of an utterance as a choice of language that is appropriate to the context, Sinclair observed that because of the difficulty of simulating context, examples are often unlikely "ever to occur in speech or writing" (1991, p. 6). This is why, he went on to argue, linguistics should be careful not to misrepresent what it aims to describe. In other words, what may be authentic (in that system, possible) to the individual linguist in a particular context for supporting a particular claim may not be authentic (in that system, probable) to the language community.

2.1.3 Lexicography and language education

So far we have seen contrasting views on the primacy of theory and of evidence, the nature of evidence, and the issue of authentic context. Moving on to the rationale of corpus linguistics in the field of lexicography and language education, we need to address the interface between a linguistic enterprise and its pedagogical application. Traditionally, dictionaries were compiled mostly via introspective techniques, with individual lexicographers aiming to compile sets of data that described a limited array of items and meanings. By contrast, corpus linguistics views the generation of meaning as a process in which syntax and semantics are not isolated but interfaced. By relying on a growing body of evidence (Bullon, 1997; Sinclair, 1991; Stubbs, 1995; Summers, 1998), lexicography driven by corpus linguistics establishes this relationship and provides useful help for distinguishing between discrete meanings. However, even corpus linguistics does not, normally, need to rule out intuition. As Summers (1996) pointed out, lexicographical studies and dictionary entry frames need corpora to determine, for example, the frequency of individual units in a large general corpus, but linguistic intuition is necessary in the ordering.

In terms of language education, corpus linguistics has helped direct attention to what constitutes authenticity of material, learning experience and classroom language, key factors determining the relevance of learning especially in the communicative language teaching tradition. A direct result of the approach is what data-driven learning and the development of learner corpora have achieved (discussed in detail in 2.4 and 2.5). One of the proponents of this approach, Johns (1991a), posited that learning, especially on advanced levels, can greatly benefit from assisted and direct manipulation of corpus data. He argued against the stance held by such figures of applied linguistics as Widdowson (1979; 1991) who placed the emphasis not on authenticity of material but of learning experience, arguing for the use of simplified texts to help ensure authenticity and comprehensibility at the same time for the learner. As a consequence, he cast doubt on the relevance of corpus findings to the process of teaching and learning foreign languages (Widdowson, 1991). Calling attention to the principle of pedagogic relevance, Widdowson made the following point:

Language prescription for the inducement of learning cannot be based on a database. They cannot be modelled on the description of externalised language, the frequency profiles of text analysis. Such analysis provides us with facts...but they do not of themselves carry any guarantee of pedagogic relevance. (1991, pp. 20-21)

As opposed to Widdowson, Johns (1991a) argued that authentic and unmodified language samples were essential in language learning. Widdowson (1979, 1991) focused on the learners' need to exploit materials that represent authenticity of purpose and were within their grasp. In Johns's argument, the requirement of no modification is central. For learning material to represent full authenticity, the original purpose and audience should not be altered. Schmied (1996) took a stance whereby the corpus can be instrumental with pedagogical relevance still maintained. In his view, examples and materials derived, and, as need made this necessary, modified from a corpus still had applicability: Adaptation is possible to various learner development levels, but the example used to illustrate a language pattern may be valid if it comes from a corpus (Schmied, 1996, p. 193).

Taking a position similar to that expressed by Widdowson (1991), Owen (1996) criticized the application of corpus evidence in language education when it negated the appropriacy of intuition. Describing the problem of an advanced FL student who was primarily interested in receiving prescription, rather than description, Owen argued that teachers' experience with language and roles as standard-setters should not be ignored. He went on to claim that teachers can hardly clarify usage problems for their students based entirely on consulting a corpus. In fact, he suggested,

the tension between description and prescription is not automatically relieved by reference to a corpus. Intuitive prescription is fundamental to the psychology of language teaching and learning....Even if teachers had the time to check every prescription they want to make, the corpus would not relieve them of the burden of using their intuition. (Owen, 1996, p. 224)

This evaluation of a practical concern is in line with what other experts, such as Fillmore (1992) and Summers (1996), claimed. Biber (1996) summed up the advantages of text-based linguistic study. He identified four features that make the corpus linguistic endeavor particularly relevant. These were the following:

their empirical nature allows the analysis of naturally occurring texts;
the texts are assembled on a principled basis;
automatic and interactive computer techniques can be applied;
they can inform both quantitative and qualitative research.

The major proposition of corpus linguistics is that real examples can better support hypotheses about language than invented ones. A number of experts have made the claim (Aston, 1995, 1997; Berry, 1991; Bullon, 1988; Hoey, 1998; Sinclair, 1987a). McEnery and Wilson (1996) also underscored the importance of the synthesis of qualitative and quantitative language study. In fact, according to them, the recent increase in the study of corpora, a process they call a revival (p. 16), has been due to the realization that one needs to "redress the balance between the use of artificial data and the use of naturally occurring data" (p. 16). How this revival has been made possible by the development of influential corpora will be the subject of the next section.

2.2 Corpora: History and typology

The rationale of corpus linguistics is to directly access, derive, and manipulate evidence from a collection of texts. Such collections may be static or dynamic, depending on the media in which they are stored. The distribution of static and dynamic corpora can also be viewed from the point of view of content and representativeness. In this section, I will provide an overview of these two types, charting the development, function, and applications of pre-electronic and electronic corpora, and providing a typology of these based on Kennedy (1998), McEnery and Wilson (1996), Sinclair (1991), and Greenbaum (1996a, 1996b) as the main sources.

2.2.1 Early corpus linguistics

A static corpus is any naturally occurring and recorded sample of language use: the language has a non-metalanguage purpose to achieve. That is to say, the text's primary aim is to communicate. From this definition it follows that a corpus does not necessarily have to be stored in a digital format. In fact, for centuries, and especially in biblical studies, corpora were exclusively analog.

According to Kennedy (1998), there were five main applications of these pre-electronic corpora:

biblical and literary studies from the 18th century, based on manually produced concordances of content words;
lexicographic investigations to provide literary examples for dictionaries such as the Dictionary of the English Language and the Oxford English Dictionary;
dialect studies in the 19th century to describe lexical variation;
foreign language education innovations such as the work of Thorndike in the 1920s;
grammatical inquiries, such as the one by Fries in the U.S., and more recently Quirk's Survey of English Usage (SEU) Corpus.

The size and the systematic composition of the SEU Corpus already pointed in the direction of electronic corpora, and in fact part of it was later digitized to allow for technologically and linguistically more advanced searches and applications. The spoken samples of the SEU Corpus were to be transferred to electronic media in the 70s, forming the basis of what became known as the London-Lund Corpus (LLC, discussed in more detail later), an initiative of Svartvik.

The development of dynamic digital corpora had its theoretical and experiential foundations in the pre-electronic projects, together with a growing awareness of the need to accumulate larger collections that can be captured and stored on computer to facilitate faster access, more refined analyses, and thus more reliable and valid information drawn from these studies. With the simultaneous advance that information technology made, this was a time of convergence of linguistic interest and technological potential.

2.2.2 The Brown Corpus

In 1961, the first electronic (machine-readable) corpus was being planned by Francis and Kucera. It was to comprise one million words of English text, arranged in two major subcorpora: informative (non-fiction) and imaginative (fiction) texts. The former set contained the majority of the 500 samples: 374 texts, with the latter accounting for the rest (126). Taken together, they were to form the Brown Corpus, the major breakthrough enterprise in corpus linguistics developed and finished by 1964 in what Kennedy (1998, p. 23) called a hostile linguistic environment dominated by the theoretical and practical implications of the anti-corpus stance of Chomskyan generative grammar.

The Brown Corpus was developed to represent as wide a variety of written American English as was possible at the time. With the enormous task of transferring analog data into an electronic format done manually, the achievement is still considered a major one. The Brown Corpus contains such additional information as origin of each sample and line numbering.

2.2.3 The LOB Corpus

One rationale for the development and publication of the Brown Corpus was to provide an impetus for similar projects elsewhere. This was later answered in the late 1970s in the next major first-generation corpus project, the Lancaster--Oslo/Bergen (LOB) Corpus by Johansson, Leech and Goodluck: the British equivalent of the Brown Corpus. It was a cross-institutional effort, with the Universities of Lancaster and Oslo, and the Bergen-based center for Norwegian Humanities Computing participating. With minor differences, both the sampling and the length of the LOB followed the standards of the Brown Corpus. A more crucial difference, however, lay, interestingly, in LOB's similarity to the Brown Corpus: it, too, contained written texts produced in 1961. But as it was compiled later, the development benefited from the new technology that had become available by then. Most importantly, the advances made the use of a coding system possible, with storage in a variety media, including three different computing platforms (DOS, Macintosh and Unix). The corpus and its manual are available through ICAME, the International Computer Archive of Modern English (Johansson, Leech, & Goodluck, 1978).

With these two language analysis resources, linguists had the opportunity to compare and contrast written U.S. and U.K. English texts, exploiting frequency and co-text information (for a comparison of frequency, see Kennedy, 1998, p. 98). Besides, the careful study of hapax legomena, word forms that occur once in a corpus, which typically represent the majority of types of words in most large corpora, was now possible, with implications for lexicography, collocation studies and language education.

The influence of these two first-generation corpora proved long-lasting: not only did they set standards for representation and structuring in sampling, but they also gave rise to other corpus projects of regional varieties. These included the Indian English Corpus published in the late 1970s and the New Zealand and Australia Corpora of English, each of which aimed to be modeled on the first two corpora. For the first time in linguistics, a large collection of objective data was available. But this was relative: they also contributed to the realization that the upper word limit of one million words was a restriction that had to be re-assessed and abandoned: for analysis to be based on more representative samples, linguists needed larger sets, especially for studying lexis that occurred less frequently in earlier corpora, and for contrastive analyses across the subcorpora.

2.2.4 The London Lund Corpus

As noted earlier, the LLC, developed in Sweden, was formed on the basis of a previously statically stored corpus, the SEU Corpus. It was the first collection of spoken evidence, incorporating such descriptive codes besides the texts as tone units, onsets, pause and stress information. Although in terms of representativeness the LLC was not entirely satisfactory, it was a major step toward the integration of spoken texts in corpora.

Work on corpus development sped up in the eighties, fueled partly by the recognition that studies incorporating objective evidence made investigations more valid and reliable, and partly by the increasing facility with which to store and manipulate data. Innovations such as optical readers and software opened up the new vista of exploiting more spoken language. These developments gave rise to second-generation corpora, each based on earlier work but with different purposes and corresponding sampling principles. Another major difference between first- and second-generation corpora lies in the acceleration with which the results of linguistic analysis were incorporated in applied linguistics and language pedagogy. Of these new efforts, three projects stand out as most influential: the Bank of English, the British National Corpus, and the International Corpus of English. In each project, the activity of a national or international team, the funding of major academic and government organizations, and the economic viability of the results in the publication market continued to be operational factors.

2.2.5 The Bank of English

Originating in the seven million words of the Main COBUILD Corpus, the Bank of English is the largest collection of written and spoken English text stored on computer. Called a megacorpus (Kennedy, 1998, p. 45), its initial function was to "help learners with real English" by enabling applied linguists to do research into the contemporary language primarily for language education. The revolutionary contribution the corpus project has made to the development of learner dictionaries (Collins COBUILD English Language Dictionary, the original 1987 edition and the 1995 revision) has been the most influential result. A joint venture of Collins Publishers and the English Department of Birmingham University, it has provided new approaches (see, for example, Sinclair, 1987b) to lexicography. This can be seen in a number of innovations: First, in the concrete analysis of features of traditional and innovative learner dictionaries (Bullon, 1988). Second, in the research endeavor that has sprung from a need to amass more reliable data about the language. Third, in the publication business that has helped fund and maintain the scholarly interest, at least for some time (Clear, Fox, Francis, Krishnamurthy & Moon, 1996). It resulted in sampling a large database of evidence and extracting such information from it as was regarded as necessary for language learners (Fox, 1987; Renouf, 1987a): It incorporated the results in a lexical approach to language teaching that combined form and meaning, and it has been instrumental in setting high standards in corpus design and encoding (Renouf, 1987b).

Directed by Sinclair, the corpus was renamed in 1991 the Bank of English, and by now has reached a state whereby every month, some 2 million new words are added. The team repeatedly made "the bigger the better" claim, meaning that for truly reliable accounts of lexis and grammar, large collections are necessary. The current size is 500 million words of written and spoken text, with storage on high-tech media, including the internet. To serve the growing body of researchers and teachers, a sample of 50 million words, together with concordance and collocation search engines, is available via the COBUILD Direct service of the web site at http://titania.cobuild.collins.ac.uk (reviewed by Horváth, 1999a).

As Sinclair noted (1991), data collection, corpus planning, annotation, updating and application continued to challenge the team. Seeking permission of copyright holders has always been among the hurdles, but there are signs of a changing publishing policy that may allow for automatic insertion of a copyrighted text for corpus research purposes.

The Bank of English has continued to innovate in all the related work: in the way corpus evidence is incorporated in learner dictionaries, in study guides and recently in a special series of concordance samplers, in the application of a lexical approach to grammar (Sinclair, 1991), and in the theoretical and technical field of marking up the corpus. Analyzing discrete meanings of words, collocations, phraseological patterning, significant lexical collocates and distributional anomalies makes available a set of new results that shape our understanding of language in use. As the reference materials produced are based on a constantly updated corpus, new revisions of these materials sustain and generate a market, making the venture economically viable, too.

2.2.6 The British National Corpus

The BNC came to be formed at the initiative of such academic, commercial and public entities as the British Library, Chambers Harrap, Lancester University's Unit for Computer Research in the English Language, Longman, Oxford University Computer Services and Oxford University Press. The majority of its content, 90 percent, is written, with 10 percent made up of spoken samples, running to a total of 100 million words in over 6 million sentences. Any of its constituent texts is limited to 40,000 words (Burnard, 1996).

The BNC was among the first megacorpora to adopt the standards of the Standard Generalized Markup Language (SGML more about annotation in Section 2.4) as well as the guidelines of the Text Encoding Initiative, which aims to standardize tagging and encoding across corpora. By so doing, not only has the BNC become a representative of a large corpus that has made use of earlier attempts to allow for comparability, but it also has sought to become a benchmark for other projects (Kennedy, 1998, p. 53; "Composition of the BNC," 1997). A sample of the corpus and its dedicated search engine, SARA (Burnard, 1996), have been made available at the web site http://info.ox. ac.uk/bnc.

The pedagogical use of the BNC has already received much attention, with Aston (1996, 1998) describing and evaluating the benefit advanced FL students in Italy gain in how they conduct linguistic inquiries. Aston reported that by accessing and studying this large corpus, students were highly motivated, primarily because of their critical attitude to published reference works that they can contrast with the results of their own conclusions.

2.2.7 The International Corpus of English

With so much cross-institutional interest and work devoted to individual projects, it was not long before researchers began pursuing the possibilities of identifying a research agenda for even more ambitious aims: to collect a corpus that would represent national and regional varieties of English. The International Corpus of English (ICE) is such an undertaking, which allows for checking evidence for comparative phonetic, phonological, syntactic, morphological, lexical and discourse analysis. Sociolinguists and language educators are also seen as beneficiaries of this corpus development drive. With Meyer coordinating the project based on Greenbaum's set of sampling procedures, the ICE represents the written and spoken language varieties of twenty countries and regions: Australia, Cameroon, Canada, the Caribbean, Fiji, Ghana, Great Britain, Hong Kong, India, Ireland, Kenya, Malawi, New Zealand, Nigeria, the Philippines, Sierra Leone, Singapore, South Africa, Tanzania, and the USA. When complete, each subcorpus will be modeled on the Brown Corpus initiative: each of the 5,000 samples in a subcorpus containing 2,000 words. (Updates on the project are posted at the ICE website, http://www.ucl.ac.uk/english-usage/ice.htm.) Already, work done on the ICE has informed such descriptive studies as the Oxford English Grammar by Greenbaum, with many more under development. A component of the ICE, the International Corpus of Learner English, will be reviewed in a later section (2.5).

The ICE project assembles text samples that represent educated language use; however, the definition of this notion is not left to the individual (subjective) decision of participating teams. Rather, the corpus will structure the language production of adult users of the national varieties of the regions. According to Greenbaum (1996a, p. 6) the texts included would be by speakers or writers aged 18 or over, with "formal education through the medium of English to the completion of secondary school." As the regional 1-million-word corpora will include written texts, identifying such factors will prove a rather difficult undertaking indeed.

2.2.8 Typology

We have seen a number of pre-electronic and electronic corpora, already noting some types: static and dynamic media, annotated and unannotated, as well as those containing written or spoken data or a combination of the two. The corpus development effort continues, and of course this subsection could review only a few of the most influential ventures. Table 2 presents a matrix of the typology of corpora, based on McEnery and Wilson (1996) and Kennedy (1998).

Table 2: A typology of corpora

The steps of developing these corpora and the technology used to maintain them will be reviewed in the following section.

2.3 Current issues in design and technology

Primarily, corpus linguistics offers quantitative studies of language use. It is concerned with the distribution of linguistic features within a set of texts or across samples. By using special corpus manipulation techniques such as word counts, single and parallel concordancing, linguists and applied linguists are better informed and can inform about the language they are studying. The evidence that a corpus can provide about the language, the quantitative information on frequency of word forms, on collocations, and lexical and syntactic patterns can then be applied in revealing the quality of the language studied.

2.3.1 Corpus development

All corpora are designed with a set of principles and using a sampling frame that adequately incorporates, and has the potential to explain linguistic variation across subcorpora and between corpora. The development of a sampling frame is required so that research may be able to use data that represents the population it intends to study. For this theoretical and empirical purpose, Biber (1994) suggested a cyclical model and a set of recommendations for testing the content validity and the reliability of the corpus. In this section, this model will be introduced, together with other procedures in sampling, annotation, and technical details.

The cyclicity of corpus development is a requirement as often, either the population to be represented or the text types generated cannot be defined strictly in advance. To be able to adjust preliminary concepts, a pilot study is required that can inform the effort of the population and language variables to account for. Theoretical analysis can confirm and refine initial decisions, but it may also introduce new sampling procedures. When this phase has been finished, the next step is corpus design proper. This involves the specification of the length of each component of the text (with minimum and maximum word counts), the number of individual texts, the range of text types, and the identification and testing of a random selection technique that gives each potential text an equal chance of being selected for the corpus.

During the third stage of the cycle, a subcorpus is being collected and the specifications are tested in it. This occurs in the fourth phase when an empirical investigation takes place with specifications studied and compared with the samples, and statistical measurements are taken to determine the reliability of representativeness. For any text that does not meet the requirements of the design, the specifications need to be revised, and either new design principles are identified or the problematic text is omitted. With each new sampling of a smaller unit of the corpus, constant checks and balances are in place to ensure the theoretical and empirical viability of the linguistic study that the corpus aims to serve. The Biber model is summed up in Figure 2.

Figure 2: Biber's (1994, p. 400) model of cyclical corpus design

Word frequency counts are strong indicators of reliability. For most general corpora, and especially those that aim to serve as bases of language teaching materials, such as learner dictionaries, establishing the frequencies of words is one of the main concerns. As this information has to be based on reliable sources, studies in representativeness provide a major contribution. According to Summers (1996), this information can then be applied in framing dictionary entries objectively and consistently, providing a dictionary that can list lexical units within a single entry according to frequency. Yet, she added, there is

still a need to temper raw statistical information with intelligence and common sense. The corpus is a massively powerful resource to aid the lexicographer, which must be used judiciously. Our aim at Longman is to be corpus-based, rather than corpus-bound. (Summers, 1996, p. 262)

The compilation of small and large corpora was described in detail by Inkster (1997), Krishnamurthy (1987) and Renouf (1987a). One concern after the design principles have been set is that the spoken and written texts to be collected can be stored on computer; another is that what is stored there be authentic. The incorporation of electronic media poses little challenge: besides obtaining the permission of copyright holders, one needs only to ensure that the text is in a compatible format with the program used for accessing the corpus. The capture from CD-ROMs is one such relatively trouble-free area. But the compilation of non-electronic forms of texts, such as the transcription of spoken material and the typing in (or keying in) of manuscripts is far more prone to introducing error into the corpus.

Errors occurring during the entry of a text into the database should be avoided as this would defeat the purpose of representation. This is why developers need to put in place and regularly check procedures that help maintain an error-free corpus. The clean-text policy is one such procedure (Sinclair, 1991): manuscripts and other texts to be input are double-checked in the corpus.

Besides the procedural approach of designing a corpus and the need for limiting errors, the markup of the raw corpus is the third crucial area of dealing with general and specialized corpora. Most present-day corpora make extensive use of some annotation system that assigns one tag from a set of categories to units occurring in individual texts (Garside, Leech & McEnery, 1997). This process, the annotation of the corpus, aims to interpret the data objectively. Annotation can be viewed as adding a metalanguage to the language sample in the corpus, often in some form of the Standard Generalized Markup Language (SGML), an international standard.

By adding linguistic data to the raw text, a subjective element is incorporated in an otherwise objective entity. According to Leech (1997a, p. 2), there "is no purely objective, mechanistic way of deciding what label or labels should be applied to a given linguistic phenomenon." Leech focused on three purposes of corpus annotation:

to enable linguists to extract information. Retrieving units in a corpus can be done with much more precision if word-class information is added;
to offer further uses of the same corpus: once the grammatical tagging of a written subcorpus or the prosodic markup of a spoken collection is done, other research may benefit from the effort;
to provide such additional values to the corpus as may be exploited by other uses; this is the multi-functionality purpose.

Tagging can now be done via computer algorithms employing designs of high sophistication, making annotation of orthography, phonetics, phonemics, prosody, word class, syntax, semantics, discourse, and even pragmatics and stylistics possible. An example of a grammatically tagged corpus may look like the one reprinted in Leech (1997a, p. 13):

Origin/NN of /IN state/NN automobile/NN practices/NNS ./. The/DT practice/NN of/IN state-owned/JJ vehicles/NNS for/IN use/NN of/IN employees/NNS on/IN business/NN dates/VVS back/RP over/IN forty/CD years/NNS ./.

Grammatical notation generally makes use of both automatic and manual techniques: special parsing computer software can be programmed to apply probabilistic techniques in determining classes of words. A second-generation megacorpus, the BNC, was annotated in such a way. It consists of two types of labels: header information (such as source of text) and the tagged text, using the system known as Claws (Constituent Likelihood Automatic Word-tagging System), which resulted in fairly reliable notation; according to Garside (1997), the accuracy rate was 95 percent or higher.

As an innovative empirical effort, Garside, Fligelstone and Botley (1997) provided an example of annotating discourse information in a corpus. Whereas most other levels of tagging can benefit from high technology, the area of cohesive relations poses major difficulties. Reviewing models of markup, the team worked out a fairly consistent method and an additional set of guidelines that may be further trialed and adjusted. Already, the notation system can describe such elements as antecedents and noun phrase co-reference, central pronouns, substitute forms, ellipses, implied antecedents, metatextual references, and noun phrase predications. Any unit not adequately captured is noted by a question mark. Although the authors recognized that the field of discourse annotation "is at a fairly immature stage of development" (Garside, Fligelstone, & Botley, 1997, p. 83), exploiting SGML and refining the tagging algorithm may achieve the sophistication of other levels of annotation.

2.3.2 Concordancers: Functions and packages

When a corpus, either in its piloted state or when the database has been checked for representativeness is assembled, the corpus linguistic analysis per se may begin. Typically, for this purpose a computer program is used, which has at its core either a general-purpose concordancing module or a dedicated software package developed to deal with specialized annotated corpora. Of the many publicly available concordancers, I will present five here. Each program has a set of special features (or tools, as they are often labeled) at their center; however, all such programs are similar in that their main functions can be broken down into the following four domains:

opening a text file;
generating a concordance output on screen and to print;
generating various indices;
saving files for later retrieval and editing.

Longman Mini Concordancer is a simple DOS-based application, suitable for the swift input of small corpora used in several language classes. Mini Concordancer served as the starting point for Vlaskovits's Contour (1996), a DOS-program designed for wide access in Central Europe (the author of the program received initial guidance and beta testing from me). Scott and Johns's Microconcord (1993) is another widely used application--its advantages over Longman's product include the feature that there is no internal limitation on corpus size and the modules that allow for editing concordance entries, enabling teachers to produce classroom materials. Scott's Wordsmith is a set of corpus linguistic tools, available for the Windows environment (Lázár, 1997). A powerful application, Wordsmith offers high speed and concordancing features that make it a popular program for work with large corpora.

The program used for this dissertation was Conc 1.7, a Macintosh application that can process text files limited only by the size of the computer's hard disk and memory allocation. Minimally, the program requires 512 kilobytes of memory, and 160 KB of hard disk space. Conc was developed by the Summer Institute of Linguistics in Texas in 1992. I selected this shareware utility for its user-friendliness and reliability: during the five years of using it, it has proved stable. As the corpus analyst inevitably will have to share disks with others, another consideration was that of platform stability: In the Windows/Intel world, shutdowns resulting from malicious computer viruses have become a frequent occurrence. By contrast, the Macintosh system is virtually free of such troubles. Here is a description of the features Conc offers, together with screen shots that illustrate them.

For the description, I selected an earlier version of the first paragraph of this chapter: a 153-word minitext. After saving this paragraph in text-only format, I launched the concordancer and opened the file. The window presented in Figure 3 appeared.

Figure 3: The example text in Conc's main window

When generating a word concordance output to screen, the user has the option of sorting identical words (types) according to the words that follow them or according to their position in the original file. As Figure 4 shows, I selected the former choice.

Figure 4: Part of the Sorting Parameters dialog window in Conc

As users may not wish to display all words occurring in a text, the program lets them deselect words from the concordance. Three options are available for this, as Figure 5 demonstrates. When none of the options are selected, the program performs a full concordancing of the text. A combination of word omission features, however, makes it possible to focus, for example, on hapax legomena (by selecting the Omit words occurring more than 1 times option) or on words that occur a number of times and which are longer than six letters.

Figure 5: The dialog window where words may be omitted from the concordance

Another module lets the user define the filling of the display space, given in radio button options: all of it could be filled with the concordance lines, resulting in truncated words; only full words could be shown; or the program could compute a compromise between the two options. A typographical standard is presented in the check box to show key words in bold face (see Figure 6).

Figure 6: Part of the Display dialog box on the Options menu in Conc

When such parameters have been set, the program is ready to sort the text file accordingly. A new window appears as a result, of which Figure 7 presents a part.

Figure 7: Part of the Concordance window of the program

It is in the Concordance screen that the user can first study the co-texts of the keywords, shown in bold face, and centered as key word in context (KWIC) concordances. When a co-text does not reveal sufficient information, and thus should be enhanced with the fuller context, one can switch between the main window and the Concordance window. With the appropriate line of the concordance output selected, the main window can be superimposed and the full sentence studied. This is shown in Figure 8.

Figure 8: The Concordance and the main windows

Conc 1.7 can provide one type of index for texts: alphabetical. As this can be saved to a text file, a database program can be used for sorting words by frequency. (This procedure will be described in Chapter 4). Figure 9 displays a screen shot of part of the Index window. First-occurrence word lists and frequency lists can be generated directly in other programs.

Figure 9: The Index window's scrolling list of the alphabetical index of the file

Simple statistical word count information is also provided. The example paragraph used contained 153 tokens, 90 types (see Figure 10).

Figure 10: Part of the Statistics window on the Build menu

There are several file management options that Conc 1.7 provides. New files can be added to texts, another may be opened, a selected concordance can be saved or printed, current parameters can be saved as default options. Of course, the full concordance can be exported, too (see the menu selection screen shot in Figure 11).

Figure 11: The File menu of Conc

2.3.3 Principles and techniques in corpus analysis

The use of concordancing programs such as Conc provides the raw data for corpus studies. In any corpus linguistic endeavor, the units of analysis have to be defined after this so that conclusions made about the corpus are reliable and valid. Biber, Conrad and Reppen (1998) called these units the observations, each text in the corpus being one such unit. Another type of observation is a single linguistic feature across texts in a corpus. As the primary contribution of corpus linguistics to our knowledge of language use is aimed at a quantifiable piece of information, such studies need to be carried out on a solid statistical basis so that we can identify significant variables.

To achieve this aim in a comparative study that investigates a set of linguistic features across texts in a corpus or between corpora, Biber, Conrad and Reppen (1998) described the procedure whereby a so-called normalization of linguistic variables is performed. In essence, this involves the identification of a unit of the text that will serve as the basis of comparison. Table 3 shows one example of such a normalized comparative analysis. In this analysis of three news items and three conversations (identified as text files in the first column and as labels in the second), the length of each text is given in a word count (in column 3). For each of the three observations (verbs, adjectives and pronouns), the unit of analysis was one thousand words; the numbers indicate the occurrence of these types in each of the six texts per 1,000 words. Normalization applies a simple formula. The frequency of the observation is divided by the word count and multiplied by the unit in which the linguistic feature is analyzed. Normalization, then, refers to the process of establishing comparability among observations. According to Biber, Conrad and Reppen (1998 p. 263), it is a "statistical process of norming raw frequency counts of texts of different lengths." The results of normalization (the rates) for these observations are the quantitative data that can be compared across the texts, using statistical methods.

Table 3: An example of normalized comparative analysis (based on Biber, Conrad & Reppen, 1998, p. 273)

One type of the frequently extracted statistical information is the mean score of individual items. Not only can the normalized frequency information on individual variables within a text be informative, but also the mean score, for example, of text length within a register and across registers. Once mean averages are computed, comparisons can be done. Studying Table 3, for example, we have evidence to suggest that news tends to have more past tense than do conversation text types.

Statistical measures such as the mutual information score and the T-score, the analysis of variance (ANOVA) of lexical collocation, and chi-squared counts are used to determine whether a linguistic phenomenon occurs merely by chance or whether it is statistically significant. Corpus linguists have increasingly sought to establish whether any observed difference between normalized frequency counts is the result of chance, or whether there is statistically significant correlation between them. Such measurements have long been applied in other social sciences, and there has been growing linguistic interest in them (Biber, Conrad, & Reppen, 1998; Clear, 1993; Kennedy, 1998; Koster, 1996; McEnery & Wilson, 1996).

2.4 Data-driven learning: CALL with classroom concordancing

The previous sections have outlined the theoretical justifications for the use of large computer corpora in language description and the procedures of the approach in describing linguistic phenomena in a valid and reliable manner. As the corpus revolution occurred during the technological advances of the eighties and early nineties, it is not surprising that the practitioners of the language teaching approach commonly known as Computer Assisted Language Learning (CALL) have attempted to apply the results of corpus linguistic research and innovation. Besides, as interest in building and exploiting L1 corpora has continued to grow, so too has the initiative to collect L2 scripts for corpus development purposes. This section will provide an overview of the convergence of CALL and corpus linguistics in data-driven learning, together with current techniques of applying the approach in language education. The section following this discussion will present the rationale and aims of learner corpus projects.

2.4.1 Computer assisted language learning

The field of CALL (and the related discipline of information technology) has been the domain of much classroom innovation, especially in the U.S., but also one that has not been able to come to terms with its inherent dependence on behaviorism. CALL brought personal computers into the language class, established self-access centers, developed courseware that aimed to individualize grammar practice, and contributed to the technological know-how of teachers and students. It can be viewed as an approach to language teaching that incorporates the procedures and theoretical foundations of several methods. Early applications relied heavily on drill-and-practice exercises familiar in the grammar-translation tradition, on the exploitation of authentic materials, with more humanistic approaches and a need for more interactivity appearing lately.

In both the language software and its way of delivery, many CALL practitioners assumed that extended time spent online would result in better performance. Although there was little scholarly attention focusing on the effectiveness of CALL in the 70s and early 80s (reviewed by Chapelle & Jamieson, 1989), anecdotal evidence and the enthusiasm of scores of language educators and of students continued to attract financial and pedagogical investment. Stevens (1989), however, remarked that much CALL experience in the U.S. and elsewhere failed to revitalize the behaviorist orientation that assumed that learning will take place when discrete steps are planned properly. This is somewhat surprising, considering the amount of work put in this enterprise, and the expansion of the approach supported by such organizations as TESOL and IATEFL. Arguing for a shift in this paradigm, Stevens called for computers and software in language education to be viewed and applied as facilitators of what he called humanistic learning.

This call for a pedagogical change meant that CALL software and its application had to be based on much more concrete applied linguistic principles. Although attention to sound methodological grounding was called for as early as 1986 by Jones, much CALL business remained within the confines of the grammar-translation tradition. Stevens (1989), aiming to synthesize SLA theory, specifically the hypotheses of Krashen (1985), summed up the features that were worth exploiting as follows. First of all, CALL software had to be able to create intrinsic motivation for the learner. In other words, such courseware would need to be relevant to student needs, offer authentic tasks, and create a no-risk environment, resulting in a low affective filter. Second, he proposed that CALL applications develop more fully the interactive potential of the technology. For example, programs can do this by adjusting their routines based on the input of the individual student, a principle gaining ground in computer-adaptive testing much more effectively than in teaching. Finally, Stevens made a call for non-CALL programs; the value of eclecticism lay, he argued, in that software "designed for other audiences and purposes" (1989, p. 35) could and should be adopted in the language class.

Wolff (1993) shared this view of applicable technologies in language learning. Also concerned with more direct integration of SLA research, he identified four principles for exploiting information technology in language education (p. 27):

the provision of a rich, motivating learning context;
the application of materials that take account of individual learners' strategies;
the aim to assist learners in discovering processing and learning strategies;
the goal of developing autonomy in learners.

How this takes place in specific educational contexts, however, needs more research. In the Hungarian secondary-school system, Nikolov's (1999) study found no evidence of information technology being applied. Teachers reported lack of access to high technology that schools did possess, but it was unclear who owned them and how they were to be used for what purposes. According to Sankó (1997) much more administrative, pre-service and in-service training is necessary for any large-scale integration of information technology in Hungarian education. It remains to be seen how the current re-evaluation of the promising educational project of the Sulinet Program (Gadó, 1998) can facilitate the further dissemination of the technology. Where CALL has been introduced either in an isolated project (Horváth, 1994a, 1997a; Turi, 1997; Rósa, 1995) or as a school-wide undertaking, it has helped provide a pedago-technological innovation that has facilitated the acquisition of computer skills, thus providing a practical spin-off to language education. In this regard, CALL has been instrumental in connecting generations of students and teachers in the community of computer literate people.

2.4.2 Discovery in data-driven learning

We now turn specifically to data-driven learning (DDL). The basic principle of this approach to language teaching, especially at intermediate and advanced levels, is that learners need to discover new knowledge about language themselves, rather than being told answers to their questions. Pointing out that much of what goes on in a traditional question-and-answer session arises from the fact that the teacher knows the answer, Johns (1991a, 1991b) posited that there are linguistic queries that the teacher cannot solve with any degree of precision without access to a large corpus. If the teacher has the corpus, it is time the students had the same opportunities. DDL teachers, then, came to act as an interface between CALL and corpus linguistics: the teacher became a facilitator by planning the overall scheme of a course, but the students were given the initiative in exploring authentic examples.

DDL is viewed (Farrington, 1996; Sinclair, 1996, 1997) as a possible "new horizon" in CALL because it offers the foreign language student opportunities to engage in authentic tasks in a low-risk environment, truly interacting with authentic texts, and using appropriate tools. In short: DDL, in many ways, incorporates the values Stevens (1989) set forth. Without the rapid development in the field of corpus linguistics, however, and without its many lexicographic and grammar applications, the approach would not have become so effective. Johns (1991a) attributed the growing interest in DDL specifically to the COBUILD project.

DDL, which may be regarded as a subdivision of CALL, first appeared in the late 80s, early nineties in Johns' work with international students studying at British colleges (1991a, 1991b). Drawing on the results that CALL had established in the U.S. and the U.K. (Higgins & Johns, 1984; Pennington, 1989), he helped set up a program that would provide what he called "remedial grammar" tools and training for science students. Johns argued that advanced EFL students had a need to directly exploit the growing evidence a corpus was able to provide. He offered a model (shown in Figure 12) to explain the nature of language awareness processes taking place in such a context.

Figure 12: Johns's model of data-driven learning (1991a, p. 27)

As Johns was primarily concerned with the development of language awareness as it related to the needs of advanced students, he hypothesized that those who aimed to develop accuracy in the foreign language had to be able to understand the relationship between how functions of discourse are realized in forms, and how these forms are interpreted to satisfy them. Data is crucial in such a process: rather than inventing examples to explain to students how this happens, students and teachers need hard evidence of how forms are used in context. This is the rationale for the central position of data, with the roles of the student enriched by that of the researcher during the participation in classroom concordancing activities (such as those described in Tribble & Jones, 1990).

Data is authentic unmodified language extracted from a corpus (Johns, 1991b, p. 28). In Johns's remedial grammar and academic writing classes, students were actively involved in accessing, manipulating and exploring this data, partly by online classroom concordancing, and partly by participating in individual and pair work activities based on new types of exercises developed to take account of the data. One corpus used in the project was a 760,000-word sample of the journal New Scientist.

Data drives learning in the sense that questions are formed in relation to what the evidence suggests. Hypotheses are tested, examples are reviewed, patterns and co-texts are noted. The collaboration that evolves between students and the teacher who may not know the answer without also consulting the corpus carries a further innovative element of this approach. Students also have the opportunity to focus on clearly defined units in the data (Higgins, 1991; Kowitz & Carroll, 1991; Stevens, 1991). A spin-off of the approach was presented by Johns (1997a): new CALL programs, such as his Contexts, can be designed by incorporating concordance tasks piloted in the classroom.

The materials developed are another outcome of the approach. The technique of on-line concordancing has allowed for the generation of new task types, such as the one keyword, many co-texts activity, or the concordance-based vocabulary tasks described by Stevens (1991). Corpora also allow for the development of innovative and potentially effective approaches to and applications of pedagogical grammars (see, for example, "The Internet Grammar of English," 1997 Hunston & Francis, 1998). Also, research investigates how what is presented in traditional language coursebooks may or may not be supported by the evidence of the corpus (Sinclair, 1997; Mindt, 1996, 1997). As DDL and corpus evidence in general become mainstream, as was suggested by Svartvik (1996), new FL materials, too, will benefit from the approach.

2.4.3 Applications of DDL

The researching student testing hypotheses about language with the help of data will, however, continue to need guidance from the expert teacher (Owen, 1996). For this need, Johns has recently suggested another interactive technique: the assistance by the "kibbitzer" (1997b). This essentially means that he is making available to an international audience the queries students had when working on dissertations and writing chapters. Students would identify a lexical, syntactic or pragmatic problem, and Johns would look up the corpus to assist in dealing with it, essentially providing a parallel concordance. Patterns in the data are highlighted, and a suggestion is made on how to revise the problem item, with the student being ultimately responsible for the final decision. Such an approach to revision appears to be beneficial, but there is yet scant empirical evidence to support claims about its effectiveness. One report, by Hadley (1997), attested that in a Japanese beginner EFL class, DDL proved a welcome transition from traditional sentence-based grammar tuition procedures.

Gavioli's (1997) example offered yet another insight into the application of concordancing activities in language education. She introduced multilingual corpus analysis processes and interpretation tasks designed for a course of translators in Italy. Gavioli emphasized the importance of consulting reference materials to test hypotheses about language use. By analyzing and interpreting data in a corpus, and by corroborating their own discoveries, students can become the ones who describe features of language, rather than being offered such descriptions. The singular contribution of these applications of corpus materials in language education is the exploration of authentic texts that raise awareness of significant patterns used in natural contexts. As suggested by Kennedy (1998), such inductive use of corpus texts in classroom concordancing helps FL students to "locate...all the tokens of a particular type which occur in a text...and note the most frequent senses" (p. 293), thus discovering collocational and colligational features. Leech (1997b) and Kirk (1996) were among those positing such applications as experimentation with real language, besides recognizing their value in academic study. Kirk underscored the change this brought in language teachers' roles: as teachers' roles are enriched by being providers of an authentic resource, they can co-ordinate research initiated by students (1996, p. 234). Clearly, this has the additional benefit of empowering students, mostly on intermediate and advanced levels, so they can gain experience in a new skill, too.

Another value of DDL lies in the manner in which teachers can establish and maintain a classroom-based research interest themselves. By applying corpora in their syllabus design and class materials development efforts, they are bridging the gap between research and pedagogic activities, a trend welcomed by Dörnyei (1997) and Ellis (1995, 1998), among others. One example of such involvement was offered by Tribble (1997), who described an innovative use of a multimedia product whose text component was used as a corpus. The author proposed that teachers who find it difficult to access large corpora or who do not regard the use of a one as relevant can use multimedia encyclopedias as language learning resources. Targeting EFL students beginning to work with academic writing, the syllabus incorporated the multimedia product Encarta, a set of hypertexts, movies and graphics containing such diverse text types as, for example, articles by experts in the fields of physical science, geography, history, social science, language and performing arts. Tribble claimed that using this resource not only caters for diverse student interests in the writing course but can result in their recognition of different text organization and lexical preferences in descriptive and discursive essays, process descriptions, physical descriptions and biographies.

2.5 Learner corpora: Issues and implications

2.5.1 The International Corpus of Learner English

Most DDL activities are directed toward the manipulation of L1 corpora. They involve the tutor and the students in work similar to that done in the development of reference materials based on corpora; they contribute to a growing awareness of how users of the language studied apply the idiom principle; and they focus on improving the accuracy of the learner. Research has begun to address issues related to the development of learner corpora, too. Such projects began in the early nineties, partly to satisfy a need to verify or refute claims about transfer from the mother tongue to the foreign language. Among these drives, the Louvain-based International Corpus of Learner English (ICLE) was the forerunner, with part of the Longman Lancaster Corpus (LonLC) and the Hong Kong University of Science and Technology Learner Corpus following suit. Conceived by Granger (1993, 1994, 1996), the ICLE collection of written texts by advanced students of EFL aims to be the basis of lexical, grammatical and phraseological studies.

The main objective is to gather objective data for the description of learner language, which Granger (1998a; in press) saw as crucial for valid theory and research. Besides, the ICLE's contribution has been in directing attention to the need for observation of this language so that the notion of L1 transfer may be analyzed under stricter data control. The obvious potential outcome is for materials development projects, which will help specific classroom practices. (Longman Essential Activator, 1997, was among the first dictionaries to incorporate learner data derived from the LonLC.) Focusing on error analysis, and interlanguage (Selinker, 1992), the ICLE-based project enables researchers and educators to directly analyze and compare the written output of students from such countries as France, Germany, the Netherlands, Spain, Sweden, Finland, Poland, the Czech Republic, Bulgaria, Russia, Italy, Israel, Japan and China.

Part of the ICE project, the developers of ICLE identify the origins of interest in the analysis of learner language in early error analysis SLA studies. Granger pointed out (1998a) that although the investigations and theoretical explanations made about learner errors were grounded in data observation, the corpora for those studies did not take full account of the variables that affected the samples. For example, the number of students, their learning experience and often non-comparable test elicitation techniques raised doubts about the reliability of some of those observations. By contrast, the ICLE project has worked out a system of sampling scripts that allows for more reliable studies in the description phase as well as in contrasting individual subcorpora and a subcorpus with an L1 corpus. With each script, detailed information is recorded in the contributor's profile. This not only ensures that the data comes from a valid source, but also allows for specific analyses of types of language use in clearly defined subcorpora. The descriptors include, according to Granger (1996, p. 16):

biographical information: nationality, age and gender
English learning experience: years of formal English studies and stay in an English-speaking country
other learning experience: knowledge of other languages
task- and text-related details: conditions of writing the script (test or non-test, timed or untimed, and use of reference tools).

2.5.2 The composition of the ICLE

The target word count of the ICLE is two million words. The scripts are primarily argumentative essays, covering a variety of topics, with a smaller set of scripts made up by literature examination essays (see a list of the essay titles recommended for national contributors in Appendix A on p. 202). As the aim is to collect and analyze authentic learner scripts, the designers pointed out in their call for submissions that essays should be "entirely the student's own" and that "no help should be sought from third parties." This specification, however, raises two problems: one theoretical, the other pedagogical.

First, as a number of the assignments do not appear to involve much of the students' own deliberation as they present an argument that they need to support, no matter what their own positions, the validity of a text being a student's "own" is dubious. Even if students have the chance of choosing a title or a theme, they cannot "entirely own" their writing as they play a limited role in deciding on the focus of their essays. For this reason, the title "Europe" may be regarded of the suggested ones as the most authentic: it does define a clear enough focus, allowing students to develop an argument which is truly their own, yet specific for any lexical or rhetorical analysis when the text becomes part of the corpus.

As for the pedagogical implications of the preferred mode of submitting a student's "own" essay with "no help...sought from third parties," the authenticity of the task may be lessened. With so much written production viewed and undertaken as a collaborative process effort in the L1 field, it is somewhat surprising that no peer or teacher involvement is allowed. The specification also raises the issue of audience: the themes appear to favor the production of writer-based prose; yet the task is defined as an argumentative one where awareness of the position of the audience is crucial. Furthermore, why deny the opportunity of consulting a reader before the script is finalized if one were to follow, even for such a basically product-oriented enterprise as corpus development, a process syllabus? Considering the role that editors, colleagues and publishers play in the finalization of the written work of L1 authors (represented in L1 corpora), it stands to reason that such restriction in the development of L2 corpora may bias the comparative analyses.

These constraints notwithstanding, the ICLE has ushered in the time of interest in more specific analyses of learner language. Each of the national subcorpora will be about 200,000 words, allowing for grammatical and lexical investigations, but small for research into words and phrases of lower frequencies (Granger, 1996, p. 16). However, the project has been instrumental in helping an international team of researchers and teachers to join forces in the field, and in leading the way to new inquiries: for the development of more specialized ESL and ESP corpora. Another area where the ICLE has motivated research is the advanced spoken learner corpus and the intermediate corpus, both under development. Work on L2 corpora is gaining recognition, and the practical implications of these efforts may be seen shortly in the new reference and teaching materials that take account of L2 learners' language use (Gillard & Gadsby, 1998; Granger, 1998a, 1998b; Granger & Tribble, 1998; Kaszubski, 1997, 1998).

2.5.3 Other written learner corpora

Besides the large-scale work of the ICLE and the LonLC, there are several other projects that have attempted to capture what is significant in learner texts. Of these endeavors, Tono's (1999) and Mark's (1997a, 1997b, 1997c, 1998) work merits recognition. Both are individual teachers' initiatives, but the aims and the applications are slightly different. The Tokyo Gakugei University Learner Corpus consists of 700,000 words written by lower-grade and upper-grade Japanese students' of EFL. One of the largest such collections in Japan, it has been used primarily for interlanguage error studies (Tono, 1999). The Meiji University Learner Corpus is smaller, made up by 220,000 words (Mark, 1997a, p. 93). Mark's interest focused on exploiting the data in syllabus design, helping students in examinations, and materials development. This latter objective was conceived as especially important because textbooks available for advanced Japanese students of EFL did not seem to reflect the needs arising from the status of their interlanguage (Mark, 1997a).

2.6 Concluding remarks

In this chapter, I have presented the case for employing corpora for language description and education. Describing corpus linguistics as an empirical study of naturally occurring language use in context, I evaluated the theoretical contrast between generative linguistics and text-based language analysis. I have reviewed the development of various types of L1 and L2 corpora and recent work done in the field internationally. The scope of application has widened, with corpora set to affect the way language tests are validated (see, for example, Alderson, 1997; and Horváth, 1998c). Besides, teacher education and materials development can also benefit from corpus linguistic techniques (Bocz & Horváth, 1996; Hughes, 1997; Minugh, 1997; Renouf, 1997; Wilson, 1997).

Interest in applying corpora in linguistic analysis and materials development is on the rise in Hungary, too. Studies that are partly or entirely based on such corpora as the Bank of English represent a new trend in current Hungarian linguistics. Among these, Andor (1998), for example, applied a sample from this corpus, together with data elicited from forty native speakers of English, in the study of the mental representation and contextual basis of ellipsis and suggested that a combined use of psycholinguistic and corpus linguistic research methods would enable linguists to arrive at more valid and reliable conclusions. Csapó (1997) studied the viability of the convergence of pedagogical grammars and learner dictionaries, whereas Hollósy (1996, 1998) reported on work to develop a corpus-based dictionary of academic English.

The framework of DDL and the increasing interest in analyzing learner English on the basis of learner corpora will be applied in the following chapters: the next describing and analyzing writing pedagogy at the English Department of Janus Pannonius University, and the fourth giving an account and analysis of the JPU Corpus. The study of learner scripts contributes to the authenticity of writing pedagogy: those who collect, describe, and analyze L2 texts can test, in a valid and reliable way, hypotheses of the effectiveness of writing pedagogy. Also, such collections can serve as a basis of an innovative type of learning material that can be applied directly in the writing classroom.