The ELT Two Cents Cafe: TOEIC Discussion

Last update May 2 2004

TOEIC: A Discussion and Analysis
Timothy M. Nall
October 31, 2003

Copyright © Tim Nall 2004.
All Rights reserved.
Do not copy without permission.
1 Overview and Description
Dubbed as "commercial TOEFL", the Test of English for International Communication (TOEIC) was [administered in China for the first time in 2003]. Also an ETS product, TOEIC has been recognized by over 5,300 large multinational companies including Microsoft, with three million examinees worldwide each year. As its market reputation grows continuously, it may become the number one test among all vocational English examinations in China.
- source: "What can foreign certificates bring you?"
retrieved on May 7 2004 from http://en.ce.cn/Insight/t20040428_761847.shtml
The Test of English for International Communication (TOEIC) is a commercially available test of English listening comprehension and reading skills, to be taken by non-native speakers of English. The general register of the TOEIC is everyday English in a business context. As such, the test-items include settings and situations such as general business, manufacturing, finance, corporate development, travel, entertainment and health (Cunningham, 2002). While the TOEIC's vocabulary and usage are targeted for business environments, it does not require specialized knowledge or vocabulary from any specific field.
The TOEIC test serves primarily as an in-house proficiency test of employees' English second-language skills for large corporations worldwide, particularly in Japan and Korea (Nicholls and Shackleford, 2000). It has become a powerful and influential tool in these companies, as it is used within a wide range of contexts to make significant personnel decisions -- hiring, promotion, etc. (Moritoshi, 2003). Anecdotes abound of companies which require a certain level of proficiency on the TOEIC as a job qualification even for positions which do not require the use of any English; in this regard its importance has spread beyond even the widest interpretation of its original function. Moreover, English language schools and academic institutions use the TOEIC as a placement tool and to assess improvement/achievement in English language proficiency (Chauncey Group International, 2000).
The TOEIC is a norm-referenced test, meaning that a testee's score is determined only in relation to the scores of other testees. This fact makes its use as an achievement test problematic; the results do not show the students' gains/losses against her own prior scores, but against two comparisons of her scores to other testees. The test's format is multiple choice, with a total of 200 questions. It is divided into two timed sections corresponding to the two skill sets it measures, reading comprehension (75 minutes) and listening comprehension (40 minutes). Each section is comprised of 100 questions. The sections are further divided into subsections based on question type. The first two subsections of the Reading Comprehension section contain cloze questions and error analysis items. Other reading comprehension questions refer to readings that simulate common business text such announcements, advertisements, directions, notices, schedules, and signs (Chauncey Group International, 1999). The Listening Comprehension section contains questions and possible answers spoken in English, short monologues, two-party conversations and statements describing a photograph that is displayed. The type of questions asked within the both the Listening section (main idea, vocabulary, idioms, minimal pairs, or inference) and the Reading section (main idea/ topic, inference, attitude/tone, vocabulary, idioms, or details/application within the passage) are similar to those found in other English-language tests (Gilfert, 1996).
Items are awarded five points for each correct answer with no penalty for incorrect answers (thus encouraging guessing). The maximum score is 990. According to Gilfert (1996):

A TOEIC score of 450 is frequently considered acceptable for hiring practices… 600 is frequently considered the minimum acceptable for working overseas. [An] engineer [who] is being considered for a posting overseas.. must usually … score ..about 625. A domestically-based desk-worker [should score] 600 … For the same desk-worker to go overseas, she or he must usually … score … 685.

2 Evaluating the TOEIC test
In general, tests are qualitatively evaluated using four criterion: validity, reliability, practicality and washback (Bailey, 1998). There are different kinds of validity: face validity, construct validity, content validity and criterion validity. Criterion validity furthermore subsumes two component forms of validity: concurrent and predictive. Each of these aspects of a test is important in its own right, so this discussion will attempt to examine the TOEIC in light of each of these in turn.
Perhaps the most obvious strength of the TOEIC is its practicality. Its widespread availability and ease of administration are key selling points, and its scoring system facilitates reasonably fair and objective comparisons between examinees, even if the examinees are from different countries. Results can be obtained quickly. The TOEIC program introduced new software in early 2002 that enabled in-house test administrators that have access to an NCS scanner or equivalent to administer the TOEIC test, scan the answer sheets themselves, email the raw data to TOEIC Services, and receive test results in less than two hours (Chauncey Group International, 2000). The multiple choice format (as opposed to a direct productive skills-based test, such as an interview) enables many students to be tested as a group rather than individually. A large number of examinees can thus be tested with a minimal investment of time and money.
Washback is the degree to which an instructional or assessment instrument impacts all aspects of the learning process - students' motivation and study habits, educators' curriculum and materials decisions, etc. It can be either negative or positive. Discussions of washback are relevant to both the academic arena and in a business context, since examinees who take the TOEIC in a business context are usually learners who are outside of the strictly academic sphere, either through self-study or within private institutions or in-house programs that focus on test preparation. Although a standardized test such as TOEIC can have positive washback in individual students' cases, the most prevalent systemic impact that these tests have been observed to exert on the learning/teaching process is the practice of "teaching to the test." Teaching to the test simply means that educators let the contents of the test shape their curriculum choices, in order to secure higher test averages for their students. Rather than learning the broad range of content and skills that the test is intended to assess, students simply learn the far smaller set of various test-taking strategies and/or specialized content that are expected to bolster their test scores. Alderson and Wall (1993, as cited in Robb and Ercanback, 1999) summarize relevant research by suggesting that negative washback effects of teaching to the test may include narrowing or distortion of the curriculum, loss of instructional time, reduced emphasis on skills that require complex thinking or problem-solving and test score "pollution", meaning gains in test scores without a parallel improvement in actual ability in the construct under examination. On balance, it seems to this author that the negative washback effects of "teaching to the test" may very well outweigh whatever positive washback the high-stakes nature of the test may create in terms of extrinsic motivation to study, since it is rational for students who are motivated by the need to succeed on this test to invest their time and money in test-oriented study materials or programs rather than a more holistic language-learning approach.
Face validity is simply the perception by the population of people who administer/take the test that the test is valid in some meaningful way, which in turn depends largely on the degree to the test's context, content and tasks are perceived to approximate real-world language use. In other words, face validity is for the most part simply the test's perceived construct and content validity. It is an extremely subjective quality, and can be unduly influenced by such things as effective advertising, unqualified acceptance of peers' evaluations, etc. Given the TOEIC test's broad popular acceptance and use in many different countries, and the number of different settings it has been applied to (business, academia, government), it is safe to suggest that the test's face validity is very high.
There is a great deal of overlap between construct and content validity (O'Sullivan, 2002). Construct validity, in the language-learning context, is the degree to which an assessment instrument accurately measures a testee's performance with respect to constituent skills or abilities that map directly to a theoretical construct or model, presumably derived from prior research, that describes the whole of a particular language skill such as listening or reading. In simpler terms, construct validity simply refers to whether or not the test is actually testing the criteria it claims to test (Cunningham, 2002). Content validity relates to the extent to which a test's content is proportionally representative of all of the construct's features (Moritoshi, 2001). However, a phenomenon as complex as a language skill does not lend itself easily to the creation of comprehensive constructs. For this reason the language test writer often chooses an inductive rather than deductive approach, selecting item content a posteriori based on some rationally chosen representative sample and then assuming that the sample adequately represents the whole. Content validity is then considered to be a proxy for construct validity.
In the case of the TOEIC, the rationale for its content selection is the result of needs analyses conducted by ETS, during which companies from many nations were asked to describe the language features they regard as necessary for English business communications (Chauncey Group International, 1999). Moritoshi notes that this approach cannot be validated from a theoretical standpoint and makes no guarantee about the proportionality of the TOEIC's presentation of its language features, but probably has high content validity from a practical standpoint (Moritoshi, 2001). However, it is important to note (as Moritoshi does) that the acceptance we are extending to the test's theoretical validity pertains only to the receptive skills of reading and listening. We cannot blithely assume that validity is so commutative.
Concurrent and predictive validity are both forms of criterion-related validity. Concurrent validity is determined by comparing results from one test format with those of another instrument which is assumed to be testing "the same thing", that is, to be held in reference to the same language construct. This is typically accomplished by examining the correlation between the tests' results, looking for high positive correlation coefficient. The ETS' own TOEIC documentation, under a section labeled "Construct-related validity" provides a definition of validity that does not differ from our definition of concurrent validity. It would seem that the ETS accepts concurrent validity as sufficient proof of construct validity. Bachman, however, notes the shakiness of this approach (Bachman 1990, as cited in Moritoshi, 2002):

...evidence for the validity of the criterion itself is that it is correlated with other tests...which simply extends the assumption of validity to these other criteria, leading to an endless spiral of concurrent relatedness...[but] only the process of construct validation can provide this evidential basis of validity.
This practice then creates a daisy chain of mutually imputed validation, with each test basing its claim on validity upon correlation with other tests, and with no reference made to any overarching description or construct. Bachman suggests that a conclusion that an assessment instrument has a significant level of construct validity cannot be arrived at in this manner. Despite academic reservations about this approach, the ELT's documentation provides its own rather extensive evidence of the TOEIC's validity in this light, measuring the test's coefficient of correlation to several direct tests of language proficiency. Some of the correlations in the ELT's documentation are detailed in the following table (compiled from Chauncey Group International, 1999):
Table 1 - Pearson Product Moment correlation values between the TOEIC® test scores and other measures of listening, reading, speaking and writing. TOEIC Listening Comprehension Direct Speaking: John Test(Part II) r = .69 *** '' Direct Speaking: Australian Second Language Proficiency Rating (ASLPR) r = .70 *** '' Direct Speaking: The Language Proficiency Interview (LPI) r = .74 ** '' Direct Listening: CASAS Listening Comprehension r = .85 * '' Direct Listening: Michigan Listening Comprehension Test r = .76 * TOEIC Reading Comprehension Direct Reading: CASAS Reading Comprehension r = .73 * '' Direct Reading: Canadian Language Benchmarks Assessment (CLBA) Reading r = .87 * '' Direct Writing: In-house Test r = .83 *** * All correlations are significant at the 0.01 level (two-tailed). ** All correlations are significant, p < .01. *** No significance level indicated.
Although these statistics seem impressive at first glance, a few troubling points arise upon closer examination. Listening and reading are receptive skills, as opposed to speaking and writing, which are productive skills. A fundamental assertion of the TOEIC test is that its scoring of the receptive skills of listening and reading has a high degree of validity with respect to examinees' proficiency in the productive skills of speaking and writing. Support for this assertion is scant, however. The fact that the level of significance of the correlation is reported for only one of the tests of productive skills may be a telling omission. Moreover, the only direct writing test described in the TOEIC's documentation is an in-house test, and as such its face validity and content validity have not been widely scrutinized by domain experts. Even the ELT's own evidence for the TOEIC's validity with respect to productive language skills seems to be more tenuous than one might conclude after a cursory examination of their statistical findings.
Regrettably, this author could not find any satisfactory and unbiased statistical examinations of the level of correlation between the results of the TOEIC's indirect tests of receptive skills and direct tests of productive skills. Hirai (2002) administered both an oral interview and the BULATS Writing Test, and examined their correlation with TOEIC results :
While the overall correlation coefficient between Hitachi's [in-house] interview test and TOEIC scores was 0.78, endorsing ETS®'s research findings, the correlation coefficient for the intermediate level was as low as 0.49, supporting a widely-held perception that the TOEIC score is not representative of productive skills. The correlation coefficient between the BULATS Writing Test and TOEIC scores was 0.66, significantly lower than ETS's data showing the correlation between writing skill and the TOEIC reading test score.

However, since Hirai does not make reference to the level of significance of these correlations, his results must be viewed with some degree of reservation. (www.oocities.org/twocentseltcafe 2004).
Finally, the predictive validity of a test is the degree to which the test accurately and consistently predicts the testees' future performance or behavior (Cunningham, 2003). Douglas (2000) considers the TOEIC to be "a good example of a well-constructed norm-referenced traditional multiple-choice test task, with no doubt high reliability, but extremely limited in the inferences it will allow about language knowledge". Moritoshi (2002) also observes a telling absence of any mention of predictive validity in the TOEIC's technical documentation. Since the TOEIC's raison d'être is to predict employees' success at using English on the job, this omission raises some concerns.
A test is said to be reliable if examinees' results would be consistent (i.e, not significantly different) if the examinees were to re-take the test soon after their first attempt. The TOEIC's reliability, estimated using the KR-20 reliability coefficient (Kuder & Richardson, 1937, as cited in Chauncey Group International, 1999), has been approximately 0.95 for the Total score and between 0.91 and 0.93 for the Listening Comprehension and Reading Comprehension sections, respectively (Chauncey Group International,1999). None of the research this author examined seriously criticized the reliability of the TOEIC. Moritoshi in particular goes to some length to buttress the TOEIC's claims of reliability, though he criticizes its validity. The overall implication is that reliability is less of a concern than validity. This seems reasonable in that a test which reliably measures invalid results cannot be considered useful.
3 Conclusions
The TOEIC has a powerful and far-reaching political and social impact in business, education and government in several countries across the world. As such an influential social phenomenon, its validity must be scrutinized closely, in the interest of fairness. The TOEIC is an M/C format, indirect test. The multiple choice format in general is considered by many scholars to have high reliability but low validity when assessing communicative tasks, which are "… interaction-based… unpredictable… [have] context… [are] purposive… and behavior-based" (Davies, 2003). As Cunningham (2002) states, "…real-life interaction does not consist of multiple-choice options... the format does not require examinees to demonstrate an ability to use the language; neither are they required to manipulate it." Compounding this concern about the multiple choice format is a further concern about the assertion that proficiency in receptive skills necessarily implies proficiency in productive skills. The explicit purpose of the TOEIC is to use multiple choice, receptive testing to predict productive communication in the workplace (Cunningham, 2002). Unbiased evidence to support this "crossover validity" or commutative nature of validity is remarkably scant. Finally , even if we were persuaded for the sake of discussion to set aside our concerns about the TOEIC's multiple choice format and crossover construct validity, its validity is further corrupted by the widespread practice of "teaching to the test." However, the test's ease of scoring and administration and relative cost-effectiveness, its entrenchment in various societal institutions and private industries and their concomitant vested interest in promoting its face validity, plus the lack of any viable alternatives of comparable practicality suggest that its widespread acceptance and use will continue for the foreseeable future.
REFERENCES
Alderson, J.C. & Wall, D. (1993). Does washback exist? Applied Linguistics, 14, 115-129.
Bachman, L. (1990) Fundamental Considerations in Language Testing. London, England: Oxford University Press.
Bailey, K. (1998) Learning about Language Assessment. Boston, U.S.A. : Heinle and Heinle.
Chauncey Group International Ltd. (1999) TOEIC® Technical Manual. www.toeic.com/pdfs/TOEIC_Tech_Man.pdf
Chauncey Group International Ltd. (2000) The Latest Word. www.toeic.ca/cae/TheLatestWord.pdf
Cunningham, C. (2002). The TOEIC test and Communicative Competence: Do Test Score Gains Correlate With Increased Competence? www.cels.bham.ac.uk/resources/essays/Cunndiss.pdf
Davies, A. (2003). "Three heresies of language testing research." Language Testing 20 (4), 355-368.
Douglas, D. (2000). Assessing languages for specific purposes. Cambridge, England: Cambridge University Press.
Gilfert, S. (1996). "A Review of TOEIC". The Internet TESL Journal 2.8. http://iteslj.org/Articles/Gilfert-TOEIC.html.
Hirai, M. (2002) "The correlation between Active Skill and Passive Skill Test Scores." Shiken: JALT Testing & Evaluation SIG Newsletter Vol. 6 No. 3 www.jalt.org/test/hir_1.htm
Kuder, G. F., & Richardson, M. W. (1937). "The theory of the estimation of test reliability." Psychometrika, 2, 151-160.
Miller, K. (2003). The Pitfalls of Implementing TOEIC Preparation Courses. http://www2.shikoku-u.ac.jp/english-dept/pitfalls.html
Moritoshi, P. (2003). The Test of English for International Communication (TOEIC):necessity, proficiency levels, test score utilization and accuracy. http://www.cels.bham.ac.uk/resources/Essays.htm
Nicholls, P. and Shackleford, N. (2000). Report for TESOLANZ National Executive on National Qualification Issues http://www.tesolanz.org.nz/qualifications.htm
O'Sullivan, B. (2002). "Using observation checklists to validate speaking-test tasks." Language Testing 19 (1), 33-56.
Pica, Hilton. 2001. Este informe de los conceptos lingüísticos robados del café del elt de dos centavos. Peculation (English Edition),17,2:159-164

Robb, T. and Ercanback, J. (1999). A Study of the Effect of Direct Test Preparation on the TOEIC Scores of Japanese University Students http://www.kyoto-su.ac.jp/~trobb/toeic.html

[Contact] [Home] [Sign Guestbook] [View Guestbook]