PERFORMANCE APPRAISAL METHODS AND FORMATS

A Note on Trait Ratings

The trait approach usually asks the rater to evaluate persons on the extent to which each possesses such traits as dependability, friendliness, carefulness, loyalty, ambition, kindness, courtesy, obedience, trustworthiness, and so on. A great deal of research indicates that traits are unstable within individuals and across situations. Most trait-rating approaches pay little or no attention to the context of behavior. Both height and physical attractiveness have been shown to be predictors of subsequent success in management and other areas. Trait scales are appropriate if the traits are operationally defined by specific job behavor, as they are, for example, with behaviorally anchored rating scales.

Weighted Checklists

A weighted checklist consists of statements, abjectives, or individual attributes that have been previously scaled for effectiveness. Most of the empirical research on checklists has involved comparisons with summated scales.

Summated Scales

Once the declarative statements have been written and the response format and number of scale points selected, the next step is to organize the sequence of the declarative statements on the rating format. Most summated scales are set up with a series of items, each followed by a format such as "Strongly agree, agree, undecided, disagree, and strongly disagree."

Critical Incidents

The critical incident method consists of collecting reports of behaviors that are considered "critical" in the sense that they make a difference in the success or failure of a particular work situation. The incident is defined as "critical" by an observer, who also makes a judgment as to its effectiveness.

Behaviorally Anchored Rating Scales

BARS may be described as graphic rating scales with specific behavioral descriptions utilizing various points along each scale, as shown in Exhibit 4.9. Each scale represents a dimension or factor considered important for work performance. To summarize the original BARS procedure, the sequence was to be (1) observation, (2) inference, (3) scaling, (4) recording, and (5) summary rating. The process sought to define, to clarify, and to put into operation the implicit evaluative theory of the rater. The purpose was to encourage observation and explicit formulation of the implications and interpretations of behavior. It is really the emphasis on the approach as a method of enhancing future observations that distinguishes it from other approaches, such as forced-choice, summated, and simple graphic scales. One of these criticisms has to do with the potential nonmonotonicity of the scales due to the behavioral anchors used on each scale. Regardless of the minimum requirements set for the standard deviations of item effectiveness, distributions of incidents overlap in perceived effectiveness. Raters "often have difficulty discerning any behavioral similarity between a ratee's performance and the highly specific behavioral examples used to anchor the scale". To resolve this problem, Borman, Toquam, and Ross developed "behavior summary scales," in which the behavioral examples are not located by their specific levels of effectiveness.

Mixed-Standard Scales

Mixed-standard rating scales consist of sets of conceptually compatible statemetns (usually three) that describe high, medium, and low levels of performance within a dimension. Statements are randomized on the rating form, and the dimension that each statement represents is not obvious. Mixed-standard scales (MSS) were designed to inhibit error in ratings, particularly regarding the tendency to be lenient. Another supposed advantage of MSS is that the scoring system is such that "illogical" raters can be identified.

Forced-Choice Scales

Forced choice is a rating technique specifically designed to increase objectivity and to decrease biasing factors in ratings. Introduced by Robert Wherry in the early 1940s to reduce error and to increase validity, a forced-choice scale is a checklist of statements that are grouped together according to certain statistical properties. The basic rationale underlying the approach is that the statements that are grouped have equal importance or social desirability. The rater is "forced" to select from each group of statemetns a subset of those statements that are "most descriptive" of each ratee. With this approach, raters have difficult deliberately distorting ratings in favor of or against particular individuals because they have no idea which statements of each group will ultimately result in higher (or lower) ratings.

The Forced-Choice Format

Items for forced-choice scales are arranged according to at least two statistical properties of each of the statements. The first is an importance, favorability, or social desirability index, which indicates the extent to which a statement reflects the "niceness," attractiveness, or social desirability of the behavior or characteristic it describes. The second property is a discriminability index, which reflects the extent to which a statement describes a behavior or a characteristic that distinguishes superior employees from others; this index is thus a measure of validity.

The Construction of Forced-Choice Scales

The first step in the process is to obtain descriptions of worker behavior or characteristics. Because of criteria for grouping items, it is necessary that a great many statements be written at this initial stage of the process (i.e., so one can end up with the much smaller number likely to result from the grouping process). One method of computing the importance index is to combine the ratings on the most and elast effective subordinate for each item and to take the average of this sum. In other words, by combining effective and ineffective subordinates' ratings, one computes the mean applicability of each statement. The discriminability index can be computed by finding the difference between each statement in the ratings for the effective subordinate and the ratings for the ineffective subordinate. A high discriminability index would indicate that the statement is very discriminating between effective and an ineffective subordinate. A t-test could also be used to determine the significance of the difference between ratings on the two subordinates. The next step in the process of constructing the instrument is to group the statements into items. Items were then grouped according to the category for which they were selected, their importance value, and their discriminability.

The Distributional Measurement Model

Kane and Lawler stated that an appraisal system preoccupied with the measurement of a ratee's "typical" performance ignores "the massive accumulation of evidence that performance...is determined at least as much by variable intra-and extra-individual factors as by traits" (traits being fixed intra-individual factors). This conception is at odds with the widely recognized view of performance as a function of both ability (fixed factors) and effort (variable factors). In this view, variations around a person's mean level of achievement are seen as reflecting variations in effort, and they therefore constitute nonerror characteristics of the person's record of achievement. From this alternative perspective, performance cannot be conceived as a point representing a person's average or modal level of achievement; instead, it must be thought of in terms of a distribution of occurrence rates over the range of possible achievement levels (on each performance dimension). Thus, although the average level of performance is, of course, an important source of information, it does not convey all meaningful differences in performance across individuals or within individuals over appraisal periods. Kane maintained that a second problem with most appraisal methods is their reliance on standardized scales to make comparisons between ratees. Standardization can provide this capability only at the cost of losing the capability of accounting for any unique circumstances that may affect individual work performance. Kane's third criticism of the traditional appraisal methods is that they place an unneccessary cognitive demand on the rater. That is, most methods require a recall of relative frequencies, an assessment of the average performance on each dimension, and then an evaluation of the effectiveness of the average performance. Thus, three problems beset almost all appraisal methods: they consider only average performance, they fail to take into account extraneous factors that affect individual performance, and they place unnecessary cognitive demands on the rater.

Behavioral Discrimination Scales

Briefly, the steps are as follows: First, a large pool of statements describing the full range of effective-to-ineffective behaviors or outcomes is generated through a job analysis procedure similar to that of the critical incident technique. Second, after editing, items are grouped by their generic similarity, and one general statement is written for each group of items. Items are grouped into performance specimens in order to reduce the number of items that need to be rated and to increase reliability on the specimens rated. Then, as the third step, the performance specimens are included in a questionnaire administered to a sample of at leat 20 job incumbents and/or their supervisors. In the fourth step, data from the two questionnaires are analyzed by converting the responses to question 2 to percentages of question 1 (for both forms and within each specimen). A t-statistic can then be used to test the difference between mean percentages for the two forms for each specimen. If a t-value does not reach the .01 p-level for significance, that specimen is dropped since there are apparently no definitive standards for satisfactory and unsatisfactory performance. The rationale for this criterion is to retain only those items that discriminate well between satisfactory and unsatisfactory performance. The fifth step in the process is to compute each specimen's median-occurrence percentage (question 1) and mean rating (question 3) across both forms. These data provide for the computation of the extensity-scale value, the measure of occurrence-rate goodness. The sixth step is to construct the rating form.

The Performance Distribution Assessment Method

PDA is designed for jobs that have fewer than 20 incumbents for any one position. The actual rating process consists of simply having the rater indicate the actual occurrence rates of the outcomes specified for each job component.

Personnel-Comparison Systems

Personnel-comparison systems require the rater to make relative comparisons between the rates in terms of a statement or statements of performance or organizational worth.

Paired Comparisons

With the paired-comparison system, the number of comparisons required of the raters is based on the simple formula [N(N - 1)]/2 = number of pairs.

Rank Ordering

With this procedure, one begins by first selecting the best person and then the worst person. Of those who remain to be rated, the second best person is then selected followed by the second worst person. Rank ordering produces only ordinal data. It is easy to convert ordinal data to interval data if we can assume an underlying normal distribution.

Forced Distribution

With this method, the rater is usually given five or seven categories of performance and is instructed to "force" a designated portion of the ratees into each category.

Personnel comparisons usually require that the comparisons be based on a single statement of ability or performance. While such a procedure will yield reliable ratings, the ratings may have little or no validity. Another disadvantage is that they can create debilitating friction among those who are rated and have it is difficult to compare the rankings across raters, work groups, and work locations.

Management-by-Objectives

Is the popular name for a process of managing that can focus on the performance of individuals in organizations. In general, it is a goal-setting process whereby objectives may be established for the organization, each department, each manager within each department, and each employee. MBO is not a measure of employee behavior; it is an attempt to measure employee effectiveness, or contribution to organizational success and goal attainment. The MBO process usually consists of three steps: Mutual goal setting. Freedom for the subordinate to perform. Performance review. Covaleski and Dirsmith found evidence that goals may be dysfunctional. They concluded that when MBO was used as a planning and control technique within a hospital setting, it led to dysfunctional descision making, but that at the subunit level it may have served as a catalytic agent in encourage decentralization. The point here is that individual goals and the system's goals may not be completely compatible.

Work Planning and Review

(WP&R) is similar to MBO, but it places a greater emphasis on the periodic review of work plans by the supervisor and the subordinate in order to identify goals attained, problems encountered, and need for training.