AN ANALYSIS OF THE PENN STATE
STUDENT RATING OF TEACHING EFFECTIVENESS
A Report Presented to the University Faculty Senate
of The Pennsylvania State University
September 9, 1997
Michael J. Dooris
Office of Planning and Institutional Assessment
This report reviews the Penn State Student Rating of Teaching Effectiveness (SRTE), especially with respect to the ratings on the quality of instructors, in the context of the research literature on student ratings. That literature is extensive, with hundreds of studies dating from the 1920s.
The SRTE appears to be a dependable instrument that probably produces consistent student ratings of individual instructor quality, when enough observations are taken. Expected grades statistically explain only about five percent of the variation in student ratings of instructor quality. The effects of other factors (such as class size and course level) on student ratings of instructor quality are even weaker. The findings on the Penn State SRTE are mostly similar to the results of other research on student ratings. Various measures of quality (quality of the course, quality of the instructor, department-selected items) on the SRTE tend to confirm one another.
A number of questions cannot be addressed on the basis of SRTE data alone. What is the effect of student gender? Instructor gender? Instructor rank? Time of day in which a course occurs? Student major? In the research literature, these variables typically have not been shown to have substantive effects on student ratings.
The question of validity remains. Does the locally developed SRTE in fact measure the quality of courses and instructors? This is impossible to answer on the basis of SRTE data alone, and the question still appears to be unanswered. The extensive research literature shows that professionally developed student ratings are reliable and valid indicators of the quality of instruction. A handful of such instruments have been thoroughly tested, are nationally normed, are widely used, and are available.
Student ratings are not the only possible source of information about teaching effectiveness. They may not be the best source of that information. However, they are certainly the most widely used and visible means of teaching evaluation. Nationally, researchers have conducted hundreds of studies on the reliability and validity of student ratings. This analysis summarizes some of the findings from that research literature.
In April 1985, Penn State’s University Faculty Senate mandated the use of what has come to be known as the SRTE - the locally developed Student Rating of Teaching Effectiveness. A complete psychometric evaluation of the Penn State SRTE is beyond the scope of this report. This report does provide new information, and reviews other information, about the Penn State SRTE. The emphasis of this analysis is upon the use of the SRTE to evaluate instructors.
Research on Student Rating Instruments
There have been decades of extensive research on student ratings. That research can answer many persistent questions and dispel some widely held misconceptions. For example, many faculty and administrators believe that student ratings are only popularity contests that have little to do with learning. Such a hypothesis can be readily tested - for example, by analyzing student ratings and grades in large, multiple section courses (such as introductory chemistry or physics) taught by several instructors who use common textbooks and give identical examinations. Many such studies have, in fact, been conducted. Arreola (1995), Aleamoni (1987), Feldman (1978), and Theal and Franklin (1990) have reviewed the hundreds of studies that date as far back as the 1920s on this topic. These authors acknowledge that student-rating systems do have limitations. There are differences in what specific instruments are intended to measure, how appropriate they are to different institutional settings, and how they should be used (for example, in terms of teaching improvement or personnel decisions). Nonetheless, none of the systematic reviews cited above support blanket resistance to student ratings. The following conclusions are consistent and clear:
1. The evidence does not support the claim that student evaluations are simply popularity contests. It is true that students prefer friendly, humorous instructors. Nonetheless, separate ratings of traits such as instructor warmth have not correlated well to student ratings of teaching effectiveness. Much evidence shows that students are discriminating judges of instructional effectiveness. This is especially true for professionally developed (versus homegrown) ratings.
2. There is little support for the proposition that students can only make accurate judgments after they have been away from the course and/or the university for several years. This is a difficult hypothesis to test conclusively, but studies tend to show that five to ten years after graduation, alumni rate instructors about the same as do current students.
3. Professionally developed instruments are both valid and reliable. Rigorous psychometric procedures have been applied to develop and test a handful of readily available, nationally normed, widely used student rating forms. Reliability - the extent to which a scale produces the same results on repeated trials - can be readily tested and is quite high for such instruments; it is often in the range of .8 to .9. Validity - the extent to which ratings measure what they are intended to measure - is more difficult to test and in fact has not been tested as often. However, students’ ratings have been positively correlated with other indicators of teacher competence, such as peer reviews and measures of student learning.
4. Locally developed student rating forms are often of questionable reliability and validity, according to the same body of research. Validating such instruments is a significant undertaking.
5. It is a misconception that because students prefer smaller classes, instructors of large classes receive poorer ratings. Students do prefer smaller classes. However, various studies have not shown a consistent relationship between class size and student ratings.
6. There is mixed evidence on the effects of students’ gender and instructors’ gender on student ratings. Some studies have found no differences; some studies have found female students to give lower ratings to male instructors than to female instructors; some studies have found female instructors to receive higher ratings regardless of students’ gender. No consistent pattern has emerged in the literature on this topic.
7. Research suggests that the time of day that a course is offered does not influence student ratings, but this evidence is relatively limited.
8. The bulk of the evidence suggests that students who are required to take a course may rate it more poorly than do students taking it as an elective. However, some studies have shown no difference.
9. It appears that whether students are majors or non-majors has no effect on their ratings of a particular course. However, this question has not been as deeply researched as some others discussed here.
10. There is some indication that upper-division and graduate students tend to rate instructors somewhat more favorably than do lower-division students.
11. It is a misconception that students prefer senior faculty, or junior faculty, or experienced teachers. No relationship between student ratings and faculty rank or experience has been consistently demonstrated.
12. There are conflicting indications about how well one or two global items (as opposed to more specific items) reflect instructional effectiveness. The amount of research on this issue is relatively limited, and on balance it is not conclusive.
13. The single most frequently researched issue on student ratings is this: Do grades that students receive in a course correlate with their ratings of the course and instructor? Arreola (1995) reviewed over 400 studies on this question. The signals are mixed. Various studies have found mildly positive, zero, and even mildly negative correlations. The clearest interpretation from the relevant studies is this: the relationship between grades and ratings is weak.
Single Dimensions, Multiple Dimensions
As noted above, the research literature does not provide a clear indication about how well one or two global items, as opposed to more specific items, reflect instructional effectiveness. The interpretation here is that both approaches are defensible.
Unless teaching is viewed conceptually as a single behavior, there is theoretical justification for using multiple items to assess the quality of teaching. Also, to the extent that evaluations are to be used formatively, specific measures are necessary to identify particular strengths and behaviors upon which individual instructors can improve. On the other hand, there is evidence that global quality-of-instruction items are statistically stable and, as noted, that they can be psychometrically valid. Also, to the extent that student ratings are one source (among many) of information for personnel decisions, the use of one or two global items is desirable on the basis of practicality.
In short, a case can be made for a multidimensional approach to the evaluation of teaching. A case can also be made for the use of one or two global measures. Neither necessarily precludes the other.
Background on the Penn State SRTE
Faculty Senate documentation (Senate Committee on Faculty Affairs, 1984, 1985, 1989) traces the evolution of what is now the Student Rating of Teaching Effectiveness. It was in 1980 that a joint faculty-administrative committee on promotion and tenure policies made a recommendation for student evaluation of teaching, more or less along the lines of the SRTE. The process from that point forward involved rounds of Senate recommendations, legislation, discussion, comments to the Senate from Presidents Oswald and Jordan, input from external consultants, and various revisions. A review of that entire history would be tedious here, but it probably is worth highlighting the more persistent questions:
- Should each unit should select its own method of evaluation, or should there be a more uniform system across the University?
- Should the University standardize the reporting of evaluation results (a separate issue from whether the evaluations should be standardized)?
- Should and/or can an evaluation system for course or instructor improvement also serve evaluation for promotion and tenure purposes?
- How much weight should student input have relative to peer input?
- How frequently should course and/or instructor evaluations be required?
- Who should be allowed to summarize, interpret, and otherwise use the results of teaching evaluations?
At its meeting on April 30, 1985, the University Faculty Senate endorsed a Senate Committee on Faculty Affairs report that included a statement of recommended practices for the evaluation of teaching effectiveness for promotion and tenure. This mandated the use of in-class student rating services, and outlined what would become the SRTE. For example, the 1985 report recommended a combination of University-wide global items with other questions selected from a cafeteria of items, all using a seven-point scale. That advisory and consultative report also contained the following statement (Senate Committee on Faculty Affairs, 1985, p. 5) that has, over the years, become part of Penn State’s administrative folklore:
Student evaluation surveys shall belong to the unit which administers them. The faculty member shall be furnished with a copy of all survey results. Results shall be included in promotion and tenure documentation.
In a 1989 report, the Faculty Affairs committee could state that the University had created the SRTE, incorporated it into the promotion and tenure policy guidelines, was administering the SRTE, and was using the results widely. However, in that same report, the Faculty Affairs committee also noted that “this in-class survey has not lived up to its promises.” Topics of concern raised in 1989 included the effectiveness of unit input into creation of the form; consistency of standards for distribution and administration; the fair use of the SRTE in personnel decisions; and the reporting and interpretation of SRTE results. Analysis of (and suggestions for changes to) the SRTE have continued, up to and including this report.
Penn State’s Student Rating of Teaching Effectiveness contains the following four University-wide items:
A1. Are you taking this course as an elective? (If uncertain, omit)
A2. What grade do you expect to earn in this course? A B C D F
A3. Rate the overall quality of this course.
A4. Rate the overall quality of the instructor.
The instrument also allows individual departments to select up to 15 items from a bank of 177 items. Instructors may also add 5 items. All items except for A1 and A2 use a response scale of 1 (lowest rating) to 7 (highest rating).
The SRTE data file does not contain data on actual grades received nor demographic data (such as gender, rank, or class standing) about instructors or students. The SRTE data also do not identify individual instructors or students.
SRTE participation rates as a percentage of course enrollments from 1991-92 through 1995-96 averaged 70 percent at University Park, and 80 percent at other locations (Office of Undergraduate Education, October 1996). Participation rates ranged from 58 percent in fall 1992 at University Park, to 99 percent in summer 1994 at other campuses. In fall 1995, there were 178,000 responses (out of 238,000 possible), for an overall participation rate of 75 percent (72 percent at University Park, 78 percent at other locations).
A 1993 study by Randy Deike investigated the SRTE in terms of its “dependability” (that is, the extent to which the measure yields the same results on repeated trials). The study included SRTE data for a subset of instructors who had taught two or more different courses, and were rated by five or more students, in Penn State’s College of the Liberal Arts in Spring 1993 - 128 courses with 7,236 SRTE observations. The research question concerned only the scores for the University-wide item that asks students to “Rate the overall quality of the instructor.” Deike’s research did not examine the relationship of those ratings to specific factors such as course level, section size, or gender. Instead, the study analyzed the scores’ variance across students and courses to see how consistent the scores were, given multiple observations of each instructor.
The generalizability coefficients that Deike found for given numbers of courses and students were similar to the results of other studies. For example, the generalizability coefficients of the SRTE were in the range of .45 to .51 when two courses, with five to 20 students, were included. He cited other research that found generalizability coefficients of about .40 to .53, for two courses with five to 20 students.
Although the Penn State data were limited to one semester of observations, those more extensive studies have found that global measures of instructor quality are highly stable when an adequate number of courses and students are included. Deike cited other research that estimated generalizability coefficients in the range of .62 to .84 for ratings covering five to ten courses and 20 students.
The policy implication is twofold. First, to evaluate a single instructor, SRTE quality-of-instructor data should come from at least five different courses and 20 or more students. Second, given a large enough sample, the measure is probably statistically dependable.
It is tempting to conclude that since high generalizability coefficients are possible, student rating scores reflect a single, true, underlying quality: specifically, the overall quality of each instructor. However, as Deike pointed out, his analysis did not address that issue. He stated that different measures of teaching effectiveness might or might not correlate with the SRTE. If teaching is a multi-faceted behavior, then other evaluation approaches (particularly multi-dimensional methods) might be more valid than the SRTE.
Analysis of Fall 1996 SRTE Data
For this report, Penn State’s fall 1996 SRTE data were analyzed using basic descriptive statistics and measures of association and dispersion. Results of analyzing the complete file for that semester (with over 186,000 observations) are presented in Tables 1-3. Quality of course was, on average, rated 5.38; quality of instructor was, on average, rated 5.63 on a seven-point scale.
Commonly raised questions about the SRTE involve the quality-of-instructor rating in relation to other factors (such as grades, course level, and so on).
The grades that students said they expected were relatively high, at 3.23, on average, as shown in Table 1. According to a December 1996 Senate report, the average grade point average for all baccalaureate degree students in spring 1996 was 2.91 (Senate Committee on Undergraduate Education, 1996). As a basis for comparison to that figure of 2.91, the SRTE respondents’ expected grades in undergraduate courses (not shown in these tables) was 3.12. Thirty-nine percent of SRTE respondents expected a final grade of A; 43 percent expected a final grade of B. Fewer than one percent of the respondents in the SRTE file expected to fail (this is true both in total and for 0-499 courses); according to the Senate’s data, about four percent of baccalaureate course grades are F’s. As noted, the SRTE file does not contain data on actual grades received.
Table 2 shows that the estimated correlation coefficients (r-values) for expected grades to quality of course and quality of instructor were .30 and .23, respectively. There is no hard-and-fast rule for interpreting r-values; a reasonable interpretation here is that there is a modest, positive linear relationship between expected grades and student ratings of quality. In other words, students earning better grades tend to rate courses and instructors somewhat more positively than do students earning poorer grades.
The proportion of variation in quality-of-course and quality-of-instructor ratings statistically explained by expected grades is the square of the r-values - only about nine percent and five percent, respectively.
Table 2 also shows convergence among the various quality indicators, with the quality-of-course and quality-of instructor correlation coefficient of .78, and the department-selected items and quality-of-instructor correlation coefficient of .66.
Departments use the 177-item test bank heavily. Almost every record on the file for fall 1996 (186,808 of 186,832 observations) had at least one department-selected item, and 42 percent (78,744) had the maximum of 15. University-wide, departments
clearly prefer some items over others: 89 percent of the forms used the item, “Rate the clarity of the instructor’s presentations,” while some of the 177 items in the test bank were never used.
As shown in Table 1, student ratings on average were slightly higher for the departmental items than for quality of instructor, and slightly higher for quality of instructor than quality of course. This study did not analyze individual responses to each of the 177 items in the test bank. The main point is that, as far as was determined, the various quality ratings were consistent with one another.
The relationship between grades and student ratings is, as noted, basically similar to that found in other research. The extremely weak relationships of the SRTE student ratings to section size, and student ratings to course level, are consistent with the research literature as well. Data in Table 3 show higher ratings for graduate and upper-division courses than for lower-division courses. Table 3 also suggests a possible curvilinear effect of section size, as students generally preferred smallish or very large sections; however, there was only one section (with about 800 respondents) in the largest category. Penn State students taking courses as electives rated the instructors slightly more favorably than did students taking courses as requirements. Consistent with the broader body of research, none of these differences are especially large.
(Table 3 does not show F and p values because for very large data sets, tests of significance mean little. In this case, the F values ranged from about 100 to over 2,000, and all p values were less than .001. Although statistically significant, the results do not address whether the relationships are of any educational importance, nor do they indicate anything about causality.)
On average, students at non-University Park locations rate the quality of instructors slightly higher (5.74 compared to 5.56) than do students at University Park. A campus-by-campus tabulation, not reproduced in this report, shows quality-of-instructor averages ranging from a low of 5.56 at University Park to a high of 6.11 at Shenango Valley. (41 percent of the SRTE records were from non-University Park locations.)
This report has focused on the SRTE data for the quality-of-instructor rating. Some analysis was done on the quality-of-course data, as well. The results were generally similar. As already noted, there was a positive relationship to expected grades. There was also a mild positive relationship between quality-of-course and course level, but within the relatively narrow range of 5.33 to 5.61. There was a mild inverse relationship with section size, but again within a relatively narrow range: 5.04 to 5.46. The other effects were essentially negligible.
A report titled “Penn State: Quality of Instruction” was recently led by Fern Willits as an Alumni Teaching Fellow Project. Professor Willits’ study was independent of, and her methodology completely different from, this SRTE analysis. Nonetheless, her results were very similar. Willits’ surveyed thousands of students and faculty about the quality of instruction at Penn State. What she found was that students want to be challenged, that they want to learn, and that they are discriminating judges of good instruction. In spite of the mythology, the Willits study found that the most important relationships to student ratings of quality were not class size, or difficulty, or instructor enthusiasm, or expected grade. To the contrary, that analysis concluded (Willits, Moore, & Enerson, 1997, p. 32): “Perhaps the single most stunning finding in all of the student data reported here is that the most powerful predictor of students’ overall evaluations of a course was the amount they felt they had learned in the course.”
The most important question about the effectiveness of the SRTE is the validity question: does the SRTE, in fact, measure instructional effectiveness? It is precisely this question that cannot be answered from SRTE data alone. Other measures of student learning, or other evaluations of instructional effectiveness (such as peer reviews), are necessary to reach strong conclusions about validity of any student-rating instrument. The positive correlations among the several SRTE “quality” items (quality of course, quality of instructor, and department-selected items) provide some evidence that the instrument’s items are at least consistent with one another.
Aleamoni, Lawrence M. (1987). Student rating: Myths versus research facts,” Journal of Personnel Evaluation in Education, (1) 111-119.
Arreola, Raoul A. (1995). Developing a comprehensive faculty evaluation system. Bolton, MA: Anker.
Deike, Randall C. (August 1993). The dependability of the Student Rating of Teaching Effectiveness. University Park, PA: The Pennsylvania State University.
Feldman, Kenneth A. (1978). “Course characteristics and college students’ ratings of their teachers: What we know and what we don’t,” Research in Higher Education, (9) 199-242.
Office of Undergraduate Education (October 1996). SRTE participation rates. Unpublished data. University Park, PA: The Pennsylvania State University.
Senate Committee on Faculty Affairs (May 1, 1984). Evaluating teaching effectiveness for promotion and tenure. University Park, PA: The Pennsylvania State University.
Senate Committee on Faculty Affairs (April 30, 1985). Evaluating teaching effectiveness for promotion and tenure. University Park, PA: The Pennsylvania State University.
Senate Committee on Faculty Affairs (February 21, 1989). Revision of statement of practices for the evaluation of teaching effectiveness for promotion and tenure. University Park, PA: The Pennsylvania State University.
Senate Committee on Undergraduate Education (December 3, 1996). Grade distribution report. University Park, PA: The Pennsylvania State University.
Theal, Michael and Jennifer Franklin (eds.) (1990). Student ratings of instruction: issues for improving practice. New directions for teaching and learning no. 43. San Francisco: Jossey-Bass.
Willits, Fern K., Betty L. Moore and Diane M. Enerson (1997). Penn State: Quality of instruction. Surveys of students and teachers at University Park. University Park, PA: Center for Excellence in Learning and Teaching.