I am a terrific teacher. How do I know? Easy answer: my student course evaluations of me as an instructor consistently conclude that, on a five-point scale, I get a cumulative score of 4.5 to 4.7 on overall effectiveness in the classroom. I know, may be asking, why not a 5? Well, I never said I was perfect. The beauty of statistics is never having to say you are certain. With this confession off my chest (“terrific” but not “perfect”), I turn now to some analyses.
In an Academic Leader “Parting Shot” column in 2006, I raised questions in an article titled “Student Evaluations of Instructors: A Good Thing?” Those questions cast doubt on the entire process of students evaluating their instructors. I asked:
My answers to these queries were perhaps, possibly, in some cases, who knows? I quoted philosopher of science Michael Scriven, who surmised “All student evaluations are face invalid and [less reliable] than the polygraph in criminal cases.” I concluded that, indeed, student evaluations of instructors might not be, in the words of Martha Stewart, “a good thing.”
This issue in higher education continues to generate debate, with arguments from many sides of the controversy. In a 2014 study titled “An Evaluation of Course Evaluations,” researcher Philip B. Stark concluded that “the common practice of relying on averages of student teaching evaluation scores as the primary measure of teaching effectiveness should be abandoned for substantive and statistical reasons.” In this study of the evaluation of instructor surveys at the University of California, Berkeley, Stark points out that they are popular because “the measurement is easy and takes little class or faculty time, ... have an air of objectivity simply by virtue of being numerical, ... and comparing an instructor’s average rating to departmental averages is easy.” (p. 3) We academic administrators like the convenience of simple numbers that seem to measure complex issues like “teacher effectiveness.”
So, what’s the problem? Now that many, perhaps most, colleges and universities use online evaluations, response rates are not nearly as good as for in-class evaluations. Lower response rates, Stark points out, can skew statistical averages, especially because online surveys are affected by student motivation: those who are most unhappy, even angry, are more likely than are happy, satisfied students to complete the survey. In short, if the response rate is low, the data are not likely to be representative of the whole class. (p. 5)
Stark points to a number of other statistical problems with such rating scales: (1) you should not try to average ordinal categories where the numbers are really only labels, not values; (2) rating scales result in averages for a department, but “the mere fact that one instructor’s average rating is above or below the departmental average says little”; (3) it makes no statistical sense to compare scores “across seminars, studios, labs, prerequisites, large lower-division courses, required major courses, etc.” (p. 7); (4) even the student comment section is suspect, as students and faculty have quite varied definitions of terms such as “fair, professional, organized, and respectful,” and comparing comments across disciplines is difficult. (p. 8)
This recent study concludes that “measuring learning is hard” and that “some students do not learn much no matter what an instructor does.” (p. 10) To infer cause and effect requires a controlled, randomized experiment, and student evaluations across myriad subjects and students are hardly that. The scores from the Berkeley instrument do show high correlations between teaching effectiveness scores and ratings for grade expectation, enjoyment of the course, gender, ethnicity, and the instructor’s age. (p. 13) These correlations cannot be considered a good thing. If, for example, a professor has rigorous course requirements and grading standards that lead some students to expect low grades, is it “fair” for those students to give that professor a low rating on “effectiveness”? Is this an accurate judgment or mere retaliation?
A second recent study by Lillian MacNell and her colleagues at North Carolina State University focused specifically on the question of how male and female faculty members are rated by students in college courses. Here was an unusual controlled experiment of online courses. The 43 students in the online course were divided into four discussion groups of eight to 12 students each with a female instructor teaching two of the groups and a male instructor the other two. Students never saw their instructors. The female instructor told one of her classes she was male, and the male instructor told one of his classes he was female. The end-of-course student rating scale asked students to evaluate the teaching on 12 traits related to teaching effectiveness and interpersonal skills. The instructor the students thought was male received higher ratings on all the traits related to teaching effectiveness and interpersonal skills, regardless of whether the instructor was actually male or female.
MacNell observed that the “male” instructor received markedly higher ratings on professionalism, fairness, respectfulness, giving praise, enthusiasm, and promptness. Granted, the small sample size undermines any claim of scientific validity, but the experiment itself illustrates how one factor—gender—can affect student perceptions and, hence, their ratings. In this study, the “female” professor was rated a full point lower on the item “returns work promptly” even though both professors returned work on exactly the same schedule.
Now, I am rethinking my claim to be a terrific teacher. It seems to me that a broader, more holistic view of my teaching effectiveness cannot be captured in a single number or even multiple statistical outcomes of student ratings. It has been said that statistics is the art of drawing a perfectly straight line from a faulty assumption to a fallacious conclusion. When it comes to the tricky business of student ratings of instructors, this seems to be the case. The value of such ratings is limited and should be viewed in a broad context; otherwise, a fallacious conclusion is likely.
It is more likely that my self-perceived pedagogical excellence is the product of factors such as these:
Naturally, this is but a partial list of factors that can affect student perceptions of teacher effectiveness and thus their ratings. These “perception enhancements” no doubt inflate my numerical ratings. I am fortunate indeed! But am I a terrific teacher? My answers to that question are much like the ones I provided in my earlier article: perhaps, possibly, in some cases, who knows? Student evaluations will give instructors and their deans some insight into the students’ perceptions of their experiences in the course, but we should not overgeneralize such results with glib conclusions—including one that says I am effective, let alone terrific.
MacNell, L., Driscoll, A., Hunt, A. (December 2014). What’s in a name? Exposing gender bias in student ratings of teaching. Innovative Higher Education.
Stark, P. (2014). An evaluation of course evaluations. Center for Teaching and Learning, University of California, Berkeley.
Thomas R. McDaniel is senior vice president and professor of education at Converse College in Spartanburg, S.C. He serves on the Academic Leader editorial advisory board.