Journal of the English Linguistic Science Association o

Journal of the English Linguistic Science Association of Korea. Vol. 4, January 2000

A diagnostic way of English test item in Item Information

Description(IID)

Sang-Kook Park

Kwang-Young Highschool

Sang-Kook Park. (skpark12@unitel.co.kr). Kwang-Young Highschool. A Diagnostic

Way of English Test Item in IID. Journal of the English Linguistic Science Association

of Korea. Vol. 4, January 2000

This paper is to attempt to detect a descriptive way of practical test item information at Highschool regular test in Korea. Until now, through the regular test, we so far show no in detail abilities that he and she have a adventages of special gifts, talents, and so on. In general, minimal requirements of test; reliability(Raters reliability, Inter-Rater reliability), validity(Construct validity, Content validity, Face validity), item response theory(IRT), which have to consider a matter in all aspects. However, as a matter of fact, the School regular test very difficult to illustrate all elements as above which we dissucussed. Of course, performance assessment is applied partially to 1st grade at High school from 1999. Nevertheless, much of multiple choice formats has still dominated diagnostic power to English achievement test in classroom until now. And all of teachers should setup Multi-Purpose List of each Course(MPLC) with a consensus before writing test item. Each Course MPLC consists of item number, behavior characteristics, range of lesson, test content, difficulty of item, correct response, score, and remarks. In MPLC, the writer who makes items described difficulty of item as advanced in the items of No. 1 and 13. But the result of test represented base, item No. 3 and 18 described advanced ability as base, 7 items tell us advanced ability to intermediate. Thus, in this paper, we know that this items which have a necessity to receive a feedback.

I. Introduction

In this paper, we will look in more details at practical school regular test of mid or term and idea of item information description(as below IID) with result of test term by 1st semester at Kwang-young High School in Seoul. In many cases, most of school traditional assessment methods(STAM) are represented to us by rank of total score only and the same as person abilities. We know that Henning(1987) has discussed the problem as follows. He provides a simple and clear summary for language testers of the procedures of traditional analysis. In general terms, a traditional analysis of test scores typically provides information on item difficulty and item discrimination as well as gives global estimates of overall test reliability. The difficulty of items will thus inevitably depend on the ability of the group that is used in trialing the item. Item discrimination is handled in terms of whether an item discriminates between those who do well on the test overall and those who do poorly on the test overall or not. In other words, we would require of an item of moderate difficulty that it would be answered correctly by a substantially higher proportion of those scoring well on the test overall than of those scoring poorly overall. An item which is answered correctly by those whose overall score on the test is low but not answered correctly by those with overall high scores is considered to be an unsatisfactory item.

All of students are tested by traditional test method in regular achievement test most of high school recently. Therefore, we did not show a details of abilities that he and she have a advantages of special gifts, talents and so on. And STAM is very difficult to illustrate property of ability for individual test takers, characteristics of raters, most teachers and students did not receive a feedback of results in regular school achievement test. First of all, we need a diagnostic item information of test which can be used to suggest a strategy of remedial English teaching. Thus, if the trial students are typical of the intended test population, then there is reasonable coverage of the ability range. Because test item have the greatest power to define the ability of test takers, we will consider students in the range of ability which matches the difficulty of the item; what is known technically as the item information function in Rasch Measurement Method is greatest in this area. In order to illustrate the point to be made about the IID, we will consider a real set of data derived from 18-items(only item of multiple choice) test of the STAM that is from class 6 of third grade in Kwang-Young High School. And for the sake of simplicity we will begin with a relatively straightforward data set and we will present the basic procedures of test analysis, the difficulty of item through the IID in more details. In this paper, however, because we have a little page of paper in this volume, we will present a brief content of test item through the idea of Rasch analysis.

II. Review of literature

To describe a school English test, we should discuss the theory of language test background. ESL teachers, curriculum writers, and test developers of recent decades have been enthusiastic to reflect authentic and communicative target language use(TLU) situations in the materials they produce(Bachman & Palmer, 1996) a audiotape, videotaped, CD-rom title etc. The recent trend of language testing is also towards testing language use in a wider communicative sense. In second and foreign language(L2) listening comprehension testing, there has been empirical works carried out to investigate the value of the authentic and communicative factors such as kinesics(Kellerman, 1992; Pennycook, 1985), schemata (Jensen & Hansen, 1995; Long, 1989), or speech modifications(Chiang & Dunkel, 1992). To provide authentic language use situations, ESL teachers and test developers have been enthusiastic to introduce other input types of electronic media(e.g., computer, satellite TV, video) to listening test development. Among them, videotaped materials are expected to be the most cost-effective tool for delivering authentic language use contexts.

Furthermore, among testing tool, Rasch measurement theory offers attractive solutions to these practical problems of measurement. It enables estimates of candidate's underlying ability to be made by analysing the candidate's performance on a set of items, after allowance has been made to the candidate's ability level.

Thus the ability estimates(known as measures) are not simply dependent on the items that were taken; we have avoided the trap of assuming that ability is transparently visible from raw scores. Similarly, the underlying difficulty of items can be estimated from the responses of a set of candidates, by taking into account the ability of the candidates and the degree to which there was a match between the ability of the trial group and the difficulty of the items.

III. Description of specification(multi-purpose list)

Before writing a item of test in a regular test of mid and term, all of teachers should setup a Multi-Purpose List of Course(MPLC). In design of testing materials, teachers and test developers still need evidence of the test usefulness yet. However, until now, almost multiple choice formats have still dominated diagnostic power and English achievement test in classrooms at high school testing of mid or term in Korea as well as selection and placement tests. Each MPLC was consisted of number of items, characteristics of behavior, range of lesson, content of test, difficulty of item, correct, score, and remarks. And, in this paper, because we are going to discuss only a multiple-choice, we will consider the performance assessment items in more detail in the near future. To provide construct validity, reliability, "in or out of raters problem", evidence for the school assessment method, all of teachers have a consensus for all items that needed to be judgmentally placed into levels of comprehension difficulty based on the MPLC to describe a item specification as follow(see MPLC).

IV. Representing score

1. Raw score

In below(see Raw score), we have presented the raw score(class 6 of third grade) of result test by Multi-Purpose List(the specification of test). The raw score which was described initial data of candidates, total score, correct number, total correct number etc., represent as to be correct answer(1) or incorrect(0). The available information to us is now represented in Raw score; data matrix, responses coded numerically.

However, it is difficult to detect information which give a full detail of items through these data. Therefore, if we make two simple reorganizations of the data matrix; one involving candidates, a second involving items, then obvious pattern begins to emerge, the kind which is captured and summarized in Rasch analysis. Thus, we need a reorganization of the initial data matrix as follow(revision 1st of raw data).

2. Revision 1st of raw data

The first reorganization(see revision 1st) involves arranging the responses of candidates not in name order but in test takers total score order. In other word, in the result of revision(1st) of raw data, we know that test takers of ability by high score(correct answer) were located in position from the top. The total score shows in right-hand column; the test takers with the high scores are found at the top of the matrix, those with lower scores at the down of Revision 1st, and the line at the bottom of the page tells us how many people got each item, and correct answers which were represented signs as("1") turn to the upper, "0" is located at random turning to the down.

3. Revision 2nd of raw data

The 2nd reorganization(see revision 2nd) involves arranging the items not in sequence order but in difficulty order, with the most difficult items to the right and the easiest to the left. The resulting form of the data matrix is shown in revision(2nd) of raw data, and it is the patterning in this table that will now be the focus of our discussion. We know that the test item writer did not understand abilities of test takers at 5 of total 18 items. In MPLC, the writer who makes items described difficulty of item as advanced in the items of No. 1 and 13. But the result of test represented base, item No. 3 and 18 described advanced ability as base, 7 items tell us advanced ability to intermediate. And an obvious pattern now emerges in the data. That is, the matrix has an area in the upper left quadrant where 1s predominate; this represents the area where the most able(the highest scoring) candidates are attempting the questions. In the lower right quadrant, by contrast, 0s no attempts predominate, and there are relatively few 1s; this represents data from the lowest scoring candidates on the most difficult questions. An intermediate zone running from upper right to down left shows a fairly mixed pattern of 0s and 1s; this represent the responses of candidates when they are responding to questions which are neither too easy nor too difficult for them. We can not see these phenomenon in this test. In most of test, they have approximately a 50 percent chance of getting a 0 or 1 on such items, which can be seen as being in the difficulty range of challenge, but not of impossibility, for the candidate. It is possible to define the candidate's ability in terms of which items represent this kind of level of difficulty for them; that is, item difficulty can be used to define candidate ability.

V. Modelling the data

The analysis attempts to model the data matrix, that is, to summarize the observed patterning through a set of simple relations, expressed formally in mathematical terms. The model is used to generate predictions about each observation Any discrepancies between the predicted and the actual observations are noted and reported, and the success or otherwise of the attempt to model the data is evaluated. If most of the complexity can be accounted for by the model, then we have good fit and the model is satisfactory way of accounting for, and hence summarizing, the data in simpler terms. In its simplest form(that is used with correct/incorrect responses, as here), the Rasch model proposes that the observations in the data matrix such as this can be reused to(predicted by) the following mathematically simple relation between the ability of the candidates and difficulty of the items:

P = Bn -Di

P= a mathematical expression of the probability of a correct response on an item,

Bn= the ability(B) of a particular person(n) and

Dn= the difficulty(D) of a particular item(i)

In other words, the chances of a correct response are a function of the difference between the person's ability and the difficulty of the item, and nothing else. That is, although we have expressed the probability of right or wrong response on an item as a mathematical signs in above, it is more difficult problems to describe the probability of given response and express item difficulties, person abilities as probabilities of response. However, in Rasch analysis, we can more minutely understand mathematical procedure known as maximum likelihood estimation.

VI. Items analysis list

In below, we understand how to describe with information to each test item minutely.

The(item number 1) represent 91.8% as rate of correct response, and number of correct response as 0 to 1, 45 to 2, 1 to 3, 1 to 4, 1 to 5. If you research into another items as same, you will get a number of informatin to another test items. Items Analysis List(class 6th, 3 grade, English test of term, 1st semester) Item correct

correct rate

91.8%

87.8%

49.0%

65.3%

63.3%

53.1%

55.1%

77.6%

49.0%

89.8%

83.7%

Item

correct

correct rate

73.5%

55.1%

85.7%

77.6%

63.3%

46.9%

VII. Conclusion

Until now, we have discussed a practical school regular test of mid or term and simple idea of item information description. And, all of teachers

should setup a MPLC before to write test items. The raw score which was described initial data of candidates, total score, correct number, total correct

number etc., represent as to be correct answer(1) or incorrect(0). The first reorganization(see revision 1st) involves arranging the responses of

candidates not in name order but in test takers total score order. The 2nd reorganization(see revision 2nd) involves arranging the items not in sequence

order but in difficulty order.

But we have not more minute analysis with a MIRT(multidimensional item response theory), logits scale(the logits scale has the advantage that it

is an interval, it can tell us not only that one item is more difficult than another but also how much more difficult it is), and so on. we will have a

opportunity to make the best of ourselves for a interpreting item fit, person fit, Mapping skills and abilities(item-ability maps, skill-ability maps), and

using Rasch analysis in research on Foreign language performance assessment in another paper.

References

Bachman, L.F. & Palmer, A.S. (1996). Language testing in practice. Oxford: Oxford UP.

Davidson, F. & Lynch, B. (1999). Testcraft. unpublished manuscript. University of Illinois at Urbana-Champaign.

Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing. 9.

Lee, Y.S. (1998). An investigation into Korean markers' reliability for English writing assessment. English Teaching, 53(1).

McNamara, T.F. (1996). Measuring second language performance. New York: Longman.

Shin, D.I. (1997). Application of a multidimensional IRT model in understanding TOEFL listening test data. English Teaching. 52(4)

Authors' Addresses

Sang-Kook Park

2-507, Jinju Apt., #22-2, Munrae 5ga, Youngdeungpo, Seoul, Korea 150-095

E-mail: skpark12@unitel.co.kr

Phone: 02)2634-8544/011-9885-9002

Received: November 24, 1999

Revision Received: December 20, 1999

Accepted: January 8, 2000